Skip to main content

Test Suite Results

A curated subset of NOPE's benchmark corpus. Each suite tests a specific detection challenge — crisis signals, demographic gaps, false-positive prevention, AI-specific risk patterns — across NOPE and a fixed panel of comparator models. Click any row for per-case detail. The first 8 cases per suite are kept verbatim; the rest are truncated for public publication (aggregate F1 / precision / recall remain honest — computed on the full case set). See methodology for the full benchmark setup.

93.0% avg score
|
355 cases
329 / 26 pass/fail
17 suites
8 models
About this dashboard

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap.

Multi-model comparison

Many suites compare multiple models. Visit Models for aggregate performance across all suites.

17 suites

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.