Model Leaderboard
Aggregate performance across all test suites
17/17
Models Selected
101
Total Suites
0
Fair Suites
0
Cases
Fair Comparison - Aggregate Performance
Only includes 0 suites run on ALL selected models
| Model | Suites | Cases | Accuracy | Score | Precision | Recall ↓ | F1 | TP | TN | FP | FN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic-mod BEST | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| azure | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14e | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14f | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14f-mini | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14f-mini-cascade-v3 | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14f-mini-dual-cascade | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| baseten-v14f-mini-panel | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| legacy-v0-evaluate | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| legacy-v0-screen | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| llamaguard | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| minime-v14d | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| minime-v14d-mini | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| oai-oss-safeguard | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| openai | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| smod | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 | |
| zentropi | 0 | 0 | 0.0% | N/A | 0% | 0% | 0% | 0 | 0 | 0 | 0 |
Metrics Guide
Accuracy - Binary pass/fail (all checks must pass)
Score - Graduated quality (0-100, partial credit)
Precision - When flagged, how often correct?
Recall - Of actual crises, how many caught?
F1 - Harmonic mean of precision and recall
FP - False Positive (flagged benign)
FN - False Negative (missed crisis)