1. “Novelty” by Dictionary Definition Only
- Claim: “World’s First Ultra-Reasoning Framework” built purely on prompt engineering
- Reality: Every major API-based agent architecture—from ReAct to AutoGPT to LangChain agents—already does precisely hierarchical/co-agent prompting, task decomposition, role assignment, and voting. There is zero cited prior work comparison (nor any rigorous ablation) demonstrating why HDA2A beats the dozens of existing recipes.
2. Methodology ≠ Method
-
Distribution / Round / Voting Systems (pp. 2–5):
- Diagrams on pages 2 and 4 are 100% conceptual fluff—no pseudocode, no complexity analysis, no real stopping criteria.
- The “voting” system loops until unanimous binary agreement—guaranteed infinite loop if the agents disagree or simply refuse to answer.
- No temperature settings, no prompt-length budgets, no cost/latency trade-offs.
3. Experimental “Evidence” = Anecdote
-
IMO Problems (Sec 3.1, pp. 5–7):
- Two Olympiad questions “solved” by fiat—solutions are either trivializing the problem (designing a tailor-made polynomial after seeing the answer!) or plainly incorrect in rigor.
- No error bars, no baseline runs, no statistical significance: just “deepseek r1 got 18 hallucinations corrected.”
-
Graphene Hypothesis (Sec 3.2, p. 8):
- A random chemistry protocol full of jargon (“Sacrificial WO/Ni Transfer Method”) with zero citations to materials-science literature—classic LLM hallucination dressed up as “phase 1/2/3 roadmap.”
4. Missing Everything Critical
- Cost, Latency & Scalability: They brag about “model-agnosticism” but never report how many API calls or \$\$\$ per query.
- Failure Modes & Limits: No discussion of what happens when all Sub-AIs hallucinate, or if one goes rogue.
- Reproducibility: The GitHub link is mentioned but no commit hash, no CI/tests, no install instructions—classic ghost repo.
- Comparisons: No head-to-head with off-the-shelf CoT, Self-Consistency, Debate, or Tree-of-Thought.
5. Overcooked Hype, Undercooked Science
- Overuses words like “ultra,” “first,” “hierarchal,” “deepseek,” “metacognition”—yet delivers zero measurable improvements.
- Treats natural-language prompts as if they were formal proofs, but the actual proofs (pp. 6–7) read like ChatGPT scrapings with boldface roles slapped on top.