Skip to main content

Model Leaderboard

Aggregate performance across all test suites

17/17
Models Selected
101
Total Suites
0
Fair Suites
0
Cases

Fair Comparison - Aggregate Performance

Only includes 0 suites run on ALL selected models

ModelSuitesCasesAccuracyScorePrecisionRecall ↓F1TPTNFPFN
anthropic-mod BEST0 0 0.0% N/A 0% 0% 0% 0 0 0 0
azure 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14e 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14f 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14f-mini 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14f-mini-cascade-v3 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14f-mini-dual-cascade 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
baseten-v14f-mini-panel 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
legacy-v0-evaluate 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
legacy-v0-screen 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
llamaguard 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
minime-v14d 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
minime-v14d-mini 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
oai-oss-safeguard 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
openai 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
smod 0 0 0.0% N/A 0% 0% 0% 0 0 0 0
zentropi 0 0 0.0% N/A 0% 0% 0% 0 0 0 0

Metrics Guide

Accuracy - Binary pass/fail (all checks must pass)
Score - Graduated quality (0-100, partial credit)
Precision - When flagged, how often correct?
Recall - Of actual crises, how many caught?
F1 - Harmonic mean of precision and recall
FP - False Positive (flagged benign)
FN - False Negative (missed crisis)

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.