Model Leaderboard

Aggregate performance across all test suites

How to read this: comparator-fair numbers are F1 / precision / recall on the flag decision. Accuracy additionally grades NOPE on severity-band correctness, so it is not comparable across models and can sit far below F1. These scores measure detection of crisis-shaped signal on our published corpus — they are not predictive, not diagnostic, and not a clinical validation. Full setup, comparator configs, and our corrections / right-of-reply policy: methodology.

Fair comparison only (17 suites) Show coverage matrix

8/8

Models Selected

Total Suites

Fair Suites

355

Cases

Fair Comparison - Aggregate Performance

Only includes 17 suites run on ALL selected models

Model	Suites	Cases	Accuracy	Score	Precision	Recall ↓	F1	TP	TN	FP	FN
evaluate-prod BEST	17	355	92.7%	94.1	99%	94%	97%	236	102	3	14
anthropic-mod	17	355	85.9%	N/A	93%	86%	90%	215	90	15	35
oai-oss-safeguard	17	355	73.5%	N/A	98%	64%	77%	160	101	4	90
ocular-prod-H_c51_precision_hybrid_v1_vllm	17	355	50.7%	61.7	83%	62%	71%	154	73	32	96
zentropi	17	355	65.4%	N/A	98%	52%	68%	130	102	3	120
openai	17	355	53.8%	N/A	75%	52%	61%	129	62	43	121
smod	17	355	39.4%	N/A	63%	33%	44%	83	57	48	167
llamaguard	17	355	45.6%	N/A	97%	30%	45%	73	89	2	173

Metrics Guide

Accuracy - Binary pass/fail (all checks must pass)

Score - Graduated quality (0-100, partial credit)

Precision - When flagged, how often correct?

Recall - Of actual crises, how many caught?

F1 - Harmonic mean of precision and recall

FP - False Positive (flagged benign)

FN - False Negative (missed crisis)

SATURATED - Uncalibrated heads firing at max on all inputs; severity unreliable