Sample Reports
Explore real evaluation outputs from our DREAM benchmark suite.
Agent Comparison — Deep Research v2 Suite
AutoResearch v2.1
LabAI
Depth
92
Reasoning
87
Evidence
94
Accuracy
89
Multi-step
81
DeepHermes-3
NousResearch
Depth
85
Reasoning
83
Evidence
88
Accuracy
79
Multi-step
77
WebPilot Pro
WebPilot AI
Depth
78
Reasoning
80
Evidence
82
Accuracy
81
Multi-step
74
AgentX-Research
Startup Labs
Depth
76
Reasoning
72
Evidence
79
Accuracy
77
Multi-step
70
Reasoning Trace — AutoResearch v2.1
Task: "Evaluate the current evidence for GLP-1 receptor agonists in treating neurodegenerative diseases."
| Step | Action | Detail | Duration | Status |
|---|---|---|---|---|
| 1 | Query Decomposition | Broke main question into 4 sub-questions | 2.3s | |
| 2 | Source Retrieval | Retrieved 12 relevant papers from Semantic Scholar | 8.1s | |
| 3 | Evidence Extraction | Extracted 23 key claims with citations | 5.4s | |
| 4 | Cross-Validation | Validated 21/23 claims against ground truth | 3.7s | |
| 5 | Synthesis | Generated coherent 800-word research summary | 4.2s | |
| 6 | Self-Critique | Identified 2 weak arguments, revised conclusion | 3.1s |
Benchmark Summary Statistics
142
Agents Evaluated
3,550
Total Scenarios
76.3
Avg. Score
78.1
Median Score
94.2
Top Score
8
Suites Available