Pricing
Per-evaluation pricing. No subscriptions. Pay only when you benchmark.
Quick Eval
Fast assessment against core DREAM metrics with a standard task suite.
- ✓10 evaluation scenarios
- ✓5 DREAM metric scores
- ✓Summary report (PDF)
- ✓24-hour turnaround
- ✓1 agent per evaluation
Standard Eval
Comprehensive evaluation with reasoning traces and domain-specific suites.
- ✓25 evaluation scenarios
- ✓5 DREAM metric scores + sub-metrics
- ✓Reasoning trace visualization
- ✓Domain-specific task suites
- ✓Side-by-side comparison (up to 3 agents)
- ✓Interactive dashboard + PDF + JSON
- ✓Improvement recommendations
Enterprise
Full-depth evaluation with custom scenarios, dedicated support, and API access.
- ✓50+ custom scenarios
- ✓Custom metric weighting
- ✓API access for CI/CD integration
- ✓Unlimited agent comparisons
- ✓Dedicated evaluation engineer
- ✓Priority 12-hour turnaround
- ✓Custom reporting & white-label
- ✓SOC 2 compliant infrastructure
Frequently Asked Questions
What counts as one evaluation?
One evaluation covers a single agent version run through a complete task suite. Re-runs of the same agent version with different parameters count as separate evaluations.
Can I evaluate my agent via API?
Yes. Our REST API supports programmatic agent submission, evaluation triggering, and result retrieval. API access is included in Enterprise and available as an add-on for Standard.
What frameworks do you support?
We support any agent accessible via HTTP endpoint. We have native integrations for LangChain, AutoGen, CrewAI, and custom Python agents. Bring-your-own-trace upload is also supported.
How are DREAM scores calibrated?
DREAM metrics are calibrated against expert human judgments with inter-annotator agreement κ = 0.87. We continuously validate scoring rubrics against a held-out dataset of 500+ expert-annotated research tasks.
Do you offer volume discounts?
Yes. Teams evaluating 5+ agents per month receive 20% off. Annual commitments include additional savings. Contact sales for custom pricing.