Sample Reports

Explore real evaluation outputs from our DREAM benchmark suite.

Agent Comparison — Deep Research v2 Suite

AutoResearch v2.1

LabAI

88.6
Top Performer
Depth
92
Reasoning
87
Evidence
94
Accuracy
89
Multi-step
81

DeepHermes-3

NousResearch

82.4
Depth
85
Reasoning
83
Evidence
88
Accuracy
79
Multi-step
77

WebPilot Pro

WebPilot AI

79.1
Depth
78
Reasoning
80
Evidence
82
Accuracy
81
Multi-step
74

AgentX-Research

Startup Labs

74.8
Depth
76
Reasoning
72
Evidence
79
Accuracy
77
Multi-step
70

Reasoning Trace — AutoResearch v2.1

Task: "Evaluate the current evidence for GLP-1 receptor agonists in treating neurodegenerative diseases."

StepActionDetailDurationStatus
1Query DecompositionBroke main question into 4 sub-questions2.3s
2Source RetrievalRetrieved 12 relevant papers from Semantic Scholar8.1s
3Evidence ExtractionExtracted 23 key claims with citations5.4s
4Cross-ValidationValidated 21/23 claims against ground truth3.7s
5SynthesisGenerated coherent 800-word research summary4.2s
6Self-CritiqueIdentified 2 weak arguments, revised conclusion3.1s

Benchmark Summary Statistics

142
Agents Evaluated
3,550
Total Scenarios
76.3
Avg. Score
78.1
Median Score
94.2
Top Score
8
Suites Available
Get Your Agent Evaluated