How It Works

Five steps from agent submission to production-ready evaluation report.

Submit Your Agent

Connect your research agent via our API or upload traces from a previous run. We support all major frameworks — LangChain, AutoGen, CrewAI, and custom implementations.

curl -X POST https://api.dreameval.ai/v1/evaluate \ -H 'Authorization: Bearer dk_...' \ -d '{"agent_url": "https://...", "suite": "deep-research-v2"}'

Select Evaluation Suite

Choose from pre-built task suites or create custom scenarios. Each suite contains multi-turn research tasks calibrated for specific domains — biomedical, legal, financial, or general knowledge.

Available suites: deep-research-v2, biomedical-qa, legal-analysis, financial-dd, custom

Agent Runs Scenarios

Your agent executes a battery of research tasks under controlled conditions. We capture every reasoning step, tool call, source retrieval, and synthesis decision for analysis.

Avg. scenarios per suite: 25 | Avg. eval time: 45 min | Max context: 128k tokens

DREAM Metrics Computed

Our scoring engine evaluates the agent on five axes: Depth, Reasoning, Evidence quality, Accuracy, and Multi-step coherence. Each metric uses calibrated rubrics validated against expert human judgment.

Inter-annotator agreement: κ = 0.87 | Metric reliability: α = 0.92

Report Delivered

Receive a detailed benchmark report with per-task breakdowns, reasoning trace visualizations, comparative rankings, and actionable improvement recommendations.

Formats: Interactive dashboard, PDF, JSON API response

The DREAM Framework

Five calibrated dimensions for evaluating agentic research quality.

Depth

Measures how thoroughly the agent explores a research question — topic coverage, sub-question generation, and information completeness.

Reasoning

Evaluates logical coherence across the research chain — argument structure, inferential validity, and conclusion support.

Evidence

Scores the quality, relevance, and diversity of sources cited — fact grounding, source authority, and citation accuracy.

Accuracy

Validates factual correctness of all claims and conclusions against verified ground-truth datasets.

Multi-step

Assesses performance on complex tasks requiring planning, tool use sequencing, and multi-turn synthesis.

Request a Demo