Test Suite Results

All test suites and their results. Click any row for details.

73.0% avg score

2,401 cases

1755 / 646 pass/fail

101 suites

17 models

About this dashboard

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap.

Multi-model comparison

Many suites compare multiple models. Visit Models for aggregate performance across all suites.

101 suites