Model Leaderboard
Aggregate performance across all test suites
8/8
Models Selected
17
Total Suites
17
Fair Suites
355
Cases
Fair Comparison - Aggregate Performance
Only includes 17 suites run on ALL selected models
| Model | Suites | Cases | Accuracy | Score | Precision | Recall ↓ | F1 | TP | TN | FP | FN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| evaluate-prod BEST | 17 | 355 | 92.7% | 94.1 | 99% | 94% | 97% | 236 | 102 | 3 | 14 | |
| anthropic-mod | 17 | 355 | 85.9% | N/A | 93% | 86% | 90% | 215 | 90 | 15 | 35 | |
| oai-oss-safeguard | 17 | 355 | 73.5% | N/A | 98% | 64% | 77% | 160 | 101 | 4 | 90 | |
| ocular-prod-H_c51_precision_hybrid_v1_vllm | 17 | 355 | 50.7% | 61.7 | 83% | 62% | 71% | 154 | 73 | 32 | 96 | |
| zentropi | 17 | 355 | 65.4% | N/A | 98% | 52% | 68% | 130 | 102 | 3 | 120 | |
| openai | 17 | 355 | 53.8% | N/A | 75% | 52% | 61% | 129 | 62 | 43 | 121 | |
| smod | 17 | 355 | 39.4% | N/A | 63% | 33% | 44% | 83 | 57 | 48 | 167 | |
| llamaguard | 17 | 355 | 45.6% | N/A | 97% | 30% | 45% | 73 | 89 | 2 | 173 |
Metrics Guide
Accuracy - Binary pass/fail (all checks must pass)
Score - Graduated quality (0-100, partial credit)
Precision - When flagged, how often correct?
Recall - Of actual crises, how many caught?
F1 - Harmonic mean of precision and recall
FP - False Positive (flagged benign)
FN - False Negative (missed crisis)
SATURATED - Uncalibrated heads firing at max on all inputs; severity unreliable