Skip to main content

Test Suite Results

All test suites and their results. Click any row for details.

73.0% avg score
|
2,401 cases
1755 / 646 pass/fail
101 suites
17 models
About this dashboard

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap.

Multi-model comparison

Many suites compare multiple models. Visit Models for aggregate performance across all suites.

101 suites

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.