Skip to main content

Model Leaderboard

Aggregate performance across all test suites

8/8
Models Selected
17
Total Suites
17
Fair Suites
355
Cases

Fair Comparison - Aggregate Performance

Only includes 17 suites run on ALL selected models

ModelSuitesCasesAccuracyScorePrecisionRecall ↓F1TPTNFPFN
evaluate-prod BEST 17 355 92.7% 94.1 99% 94% 97% 236 102 3 14
anthropic-mod 17 355 85.9% N/A 93% 86% 90% 215 90 15 35
oai-oss-safeguard 17 355 73.5% N/A 98% 64% 77% 160 101 4 90
ocular-prod-H_c51_precision_hybrid_v1_vllm 17 355 50.7% 61.7 83% 62% 71% 154 73 32 96
zentropi 17 355 65.4% N/A 98% 52% 68% 130 102 3 120
openai 17 355 53.8% N/A 75% 52% 61% 129 62 43 121
smod 17 355 39.4% N/A 63% 33% 44% 83 57 48 167
llamaguard 17 355 45.6% N/A 97% 30% 45% 73 89 2 173

Metrics Guide

Accuracy - Binary pass/fail (all checks must pass)
Score - Graduated quality (0-100, partial credit)
Precision - When flagged, how often correct?
Recall - Of actual crises, how many caught?
F1 - Harmonic mean of precision and recall
FP - False Positive (flagged benign)
FN - False Negative (missed crisis)
SATURATED - Uncalibrated heads firing at max on all inputs; severity unreliable

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.