Test Suite Results

A curated subset of NOPE's benchmark corpus. Each suite tests a specific detection challenge — crisis signals, demographic gaps, false-positive prevention, AI-specific risk patterns — across NOPE and a fixed panel of comparator models. Click any row for per-case detail. The first 8 cases per suite are kept verbatim; the rest are truncated for public publication (aggregate F1 / precision / recall remain honest — computed on the full case set). See methodology for the full benchmark setup.

93.0% NOPE avg accuracy

355 cases

329 / 26 pass/fail

17 suites

8 models

"NOPE avg accuracy" is the per-suite accuracy of NOPE's production endpoint, which also grades severity-band correctness — per-model F1 comparisons are on Models.

Why we publish failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves subjective judgments where clinicians often disagree. We publish results — including our own regressions — because transparency matters more than optics. If a safety system claims 100% accuracy, be skeptical.

About this dashboard

Direction over exactitude

"Mild" vs "moderate" disagreements are acceptable. "None" when it should flag something is a real gap.

Multi-model comparison

Many suites compare multiple models. Visit Models for aggregate performance across all suites.

17 suites