How It Works
Five steps from agent submission to production-ready evaluation report.
Submit Your Agent
Connect your research agent via our API or upload traces from a previous run. We support all major frameworks — LangChain, AutoGen, CrewAI, and custom implementations.
Select Evaluation Suite
Choose from pre-built task suites or create custom scenarios. Each suite contains multi-turn research tasks calibrated for specific domains — biomedical, legal, financial, or general knowledge.
Agent Runs Scenarios
Your agent executes a battery of research tasks under controlled conditions. We capture every reasoning step, tool call, source retrieval, and synthesis decision for analysis.
DREAM Metrics Computed
Our scoring engine evaluates the agent on five axes: Depth, Reasoning, Evidence quality, Accuracy, and Multi-step coherence. Each metric uses calibrated rubrics validated against expert human judgment.
Report Delivered
Receive a detailed benchmark report with per-task breakdowns, reasoning trace visualizations, comparative rankings, and actionable improvement recommendations.
The DREAM Framework
Five calibrated dimensions for evaluating agentic research quality.
Depth
Measures how thoroughly the agent explores a research question — topic coverage, sub-question generation, and information completeness.
Reasoning
Evaluates logical coherence across the research chain — argument structure, inferential validity, and conclusion support.
Evidence
Scores the quality, relevance, and diversity of sources cited — fact grounding, source authority, and citation accuracy.
Accuracy
Validates factual correctness of all claims and conclusions against verified ground-truth datasets.
Multi-step
Assesses performance on complex tasks requiring planning, tool use sequencing, and multi-turn synthesis.