Skip to main content
← Back to Overview

Evaluate Test Suites

Full multi-domain risk assessment tests. Each case evaluates severity, imminence, confidence, risk features, and legal flags across 9 risk types.

45
Test Suites
1,198
Total Cases
89.6%
Avg Score
partial credit
75.0%
Pass Rate
all checks
898
Passed
300
Failed
20
Critical Misses

Last updated: 1/16/2026 at 2:28:27 PM

What Evaluate Tests

Risk Assessment

  • Speaker severity (none → critical)
  • Speaker imminence (chronic → emergency)
  • Confidence thresholds
  • Multi-domain risk detection

9 Risk Types

  • Suicide, Self-harm, Self-neglect
  • Violence, Abuse, Sexual violence
  • Neglect, Exploitation, Stalking

Legal Flags

  • Third-party threats
  • Intimate partner violence
  • Child/vulnerable adult safeguarding
  • Animal cruelty indicators

Exploitation Comprehensive

v2026-01-13 6 critical

Comprehensive coverage of exploitation risk type including elder financial abuse, romance scams, labor exploitation, and institutional exploitation. Addresses critical taxonomy gap where exploitation was only primary type in ~10 cases despite being 1 of 9 official risk types.

24 cases 2 passed 22 failed adaptive (default) 1/14/2026
57.9%
score
partial credit

trafficking_disclosure

v2025-12-29 4 critical

Human trafficking disclosure patterns including labor trafficking, sex trafficking, and exploitation. Tests detection of debt bondage, psychological control, delayed victim self-identification, coached responses, trauma bonding, and disclosure barriers. Includes youth-specific patterns ('boyfriend' framing) and validated QYIT screening indicators. Critical zero-coverage domain.

23 cases 7 passed 16 failed adaptive (default) 12/29/2025
68.9%
score
partial credit

verbatim_crisis_starter_pack

v2025-12-25 No safety gaps

Short verbatim excerpts from public reports, court decisions, and published research. Seed set with full provenance - not the full validation corpus. Sources: ACL Anthology, NJ DV Near-Fatality Report, Correctional Service Canada, NZ Women's Refuge, Delaware Supreme Court, UK child neglect research, Swiss refugee study, UCLan trafficking thesis, UCT eating disorder thesis.

17 cases 6 passed 11 failed adaptive (default) 1/6/2026
70.6%
score
partial credit

Self-Neglect Comprehensive

v2026-01-13 No safety gaps

Comprehensive coverage of self-neglect risk type including medical non-adherence patterns, ambiguous intentionality, elderly self-neglect, and substance misuse. Addresses critical taxonomy gap where self-neglect was only primary type in ~10 cases despite being 1 of 9 official risk types.

30 cases 2 passed 28 failed adaptive (default) 1/13/2026
70.7%
score
partial credit

evaluate-post-discharge-transitions

v2026-01-06 1 critical

Post-discharge and high-risk transition periods - full risk assessment with severity, imminence, and feature detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge (2,950/100k person-years), with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI to obtain discharge.

26 cases 4 passed 22 failed adaptive (default) 1/6/2026
71.5%
score
partial credit

postpartum_perinatal

v2025-12-29 2 critical

Postpartum and perinatal mental health crisis patterns including intrusive thoughts about infant harm, ego-dystonic OCD-type thoughts (protective vs. psychotic), postpartum depression, postpartum psychosis, disclosure fears ('they'll take my baby'), and avoidance behaviors. Tests critical distinction between repugnant intrusive thoughts (no intent, horror response) and genuine risk to infant. Approximately 50% of new mothers experience intrusive thoughts; ego-dystonic quality is key protective factor.

11 cases 5 passed 6 failed adaptive (default) 12/29/2025
71.8%
score
partial credit

nssi_comprehensive

v2025-12-29 1 critical

Comprehensive non-suicidal self-injury (NSSI) detection suite covering all ISAS functional categories, disclosure patterns, concealment strategies, and risk escalation. Tests ability to distinguish NSSI from suicidal behavior and identify anti-suicide function (critical for risk assessment). Includes contemporary algospeak and social media coded language.

33 cases 7 passed 26 failed adaptive (default) 1/14/2026
74.9%
score
partial credit

verbatim_chat_register

v2025-12-25 1 high

Chat-register verbatim text from court-quoted SMS/text messages and Reddit posts. All items are real-time, informal, direct disclosures - NOT interview quotes or retrospective accounts. Sources: US court opinions (Justia), Reddit gaming/casual subreddits.

10 cases 7 passed 3 failed adaptive (default) 1/6/2026
75.0%
score
partial credit

ai_mediated_risk-v2

v2025-12-17 1 critical

[v1] Novel risk patterns emerging from AI-mediated conversations. Tests magical thinking, AI validation/collusion, parasocial attachment, multi-turn escalation, and method-seeking with philosophical framing. Based on clinical research on third-party validation effects in suicide risk.

24 cases 12 passed 12 failed adaptive (default) 1/6/2026
78.3%
score
partial credit

chat-register-crisis

v2026-01-06 No safety gaps

Crisis detection in informal chat register: text speak, emoji, hedging with dismissive qualifiers ('lol its dumb but...'), and specific medication/means references masked by casual framing. Based on De Choudhury et al. (2016) computational discourse analysis, Coppersmith et al. (2018) lexical markers, and Crisis Text Line emoji research.

15 cases 6 passed 9 failed adaptive (default) 1/6/2026
81.7%
score
partial credit

multi-marker-false-positives

v1.0 1 high

Tests where multiple crisis indicators cluster in benign contexts, ensuring NOPE doesn't over-weight coincidental marker combinations without considering overall context

11 cases 7 passed 4 failed adaptive (default) 1/16/2026
81.8%
score
partial credit

resources-e2e

v2025-12-23 No safety gaps

End-to-end tests verifying classification AND resource resolution work together. Each case tests both risk assessment and that appropriate specialized resources are surfaced.

3 cases 1 passed 2 failed adaptive (default) 1/14/2026
83.3%
score
partial credit

multilingual_crisis

v2026-01-13 1 critical

Multilingual crisis expressions to ensure equitable detection across languages. Based on research into culturally-specific idioms, somatic expressions, and indirect crisis communication patterns. Each case includes original language, translation, and cultural context notes.

72 cases 41 passed 31 failed adaptive (default) 1/14/2026
83.3%
score
partial credit

filter_router_edge_cases-v2

v2025-12-17 1 critical 1 high

[v1] Edge cases for risk classification accuracy. Tests indirect language, context confusion, ambiguous framing, coded language, third-party disclosures, and professional contexts using orthogonal subject/type structure.

18 cases 11 passed 7 failed adaptive (default) 1/6/2026
87.9%
score
partial credit

mh_edge_cases-v2

v2025-12-17 No safety gaps

[v1] Edge and ambiguous clinical/safety cases for NOPE Safety v1. Multi-domain assessment version with severity/imminence separation. Covers dark humor, jailbreak attempts, safeguarding, violence, self-neglect, eating disorders, false positives, gaming/meme culture, chronic illness, postpartum, and multi-turn escalation patterns. 35 test cases (27 single-turn, 8 multi-turn).

42 cases 27 passed 15 failed adaptive (default) 1/6/2026
88.0%
score
partial credit

neurodivergent_presentations

v2025-12-18 No safety gaps

Crisis presentations from neurodivergent individuals (autism, ADHD, etc.) that may be missed by classifiers trained on neurotypical communication patterns. Autistic people are 3-9x more likely to die by suicide (Hirvikoski 2016); autistic women have ~13x suicide mortality risk vs non-autistic women; ideation is ~4x general population (Cassidy 2014). Key patterns: flat affect, literal expression, communication shutdown, perseverative thoughts.

32 cases 24 passed 8 failed adaptive (default) 1/6/2026
88.5%
score
partial credit

speaker_third_party_v2

v2025-12-17 No safety gaps

Tests for speaker vs third-party risk disambiguation. Uses orthogonal subject/type: speaker_severity derived from risks where subject='self', third-party risks have subject='other'.

25 cases 17 passed 8 failed adaptive (default) 1/6/2026
90.2%
score
partial credit

acute_real_world_v2

v2025-12-17 1 critical

Acute real-world scenarios for current taxonomy. Covers immediate danger, psychosis, mania, AI attachment, safeguarding, and third-party risk. 30 evidence-aligned test cases from clinical literature and real-world patterns.

30 cases 23 passed 7 failed adaptive (default) 1/14/2026
90.4%
score
partial credit

v2_eating_disorders

v2025-12-29 1 critical

Comprehensive eating disorder detection tests including anorexia, bulimia, binge-eating disorder, ARFID, and orthorexia patterns. Validates that ED behaviors are correctly classified with appropriate features for specialist resource matching. Includes pro-ED community language, male-specific presentations, minimization patterns, ED+suicidality intersection, and validated assessment instrument language (EAT-26, EDE-Q).

21 cases 16 passed 5 failed adaptive (default) 1/14/2026
90.6%
score
partial credit

active_crisis_imminent

v2025-12-29 No safety gaps

Active suicide crisis and imminent attempt scenarios representing C-SSRS Level 5 (highest severity): active ideation with specific plan, intent, and timeline. Includes interrupted attempt language, preparatory acts, resolution syndrome, and in-progress attempts. Critical for regression prevention—these cases must be detected at critical/emergency severity.

25 cases 16 passed 9 failed adaptive (default) 1/16/2026
91.0%
score
partial credit

fp_research_driven-v2

v2025-12-25 3 high

[v1] False positive test cases derived from academic research on mental health classification systems. Tests negation handling, humor markers, physical pain idioms, internet slang, minimal-context slang variants, and other documented FP patterns.

38 cases 34 passed 4 failed adaptive (default) 1/5/2026
91.2%
score
partial credit

v2_population_context

v2025-12-17 No safety gaps

Tests for population context extraction. Validates that demographic/identity indicators are captured as features to enable population-specific resource matching.

10 cases 7 passed 3 failed adaptive (default) 1/6/2026
91.3%
score
partial credit

blindspots_v2

v2025-12-17 1 critical

Research-driven blindspots test suite for current taxonomy. Covers implicit ideation (Crisis Text Line predictors), population-specific patterns (LGBTQ+, veterans, healthcare workers), and adversarial scenarios. Based on RESEARCH-SYNTHESIS.md and clinical literature.

29 cases 25 passed 4 failed adaptive (default) 1/14/2026
94.8%
score
partial credit

somatization_patterns

v2025-12-18 No safety gaps

Crisis presentations through somatic (physical) symptoms rather than emotional language. 'Somatization is the single most common reason why psychiatric illness goes undetected in general medical settings.' Physical complaints are the acceptable way to express distress in many cultures where mental illness is stigmatized.

28 cases 24 passed 4 failed adaptive (default) 1/6/2026
95.2%
score
partial credit

input-framing-v2

v2025-12-17 No safety gaps

[v1] Input framing detection test suite. Tests the ability to detect HOW content is presented (direct, humor, creative, distanced, adversarial, etc.). Framing is orthogonal to risk assessment - these tests verify framing detection alongside appropriate risk classification.

17 cases 15 passed 2 failed adaptive (default) 1/6/2026
95.3%
score
partial credit

indirect_disclosure

v2025-12-18 No safety gaps

Indirect crisis disclosure patterns including 'asking for a friend', third-person framing, hypothetical scenarios, minimization, and religious/spiritual framing. Research shows many callers take 10-15 minutes to disclose actual reason for crisis contact. These patterns are normative in most cultures - direct disclosure is the exception, not the rule.

31 cases 27 passed 4 failed adaptive (default) 1/14/2026
95.7%
score
partial credit

subject_attribution-v2

v2025-01-18 No safety gaps

Tests for correct subject attribution (self vs other). Probes edge cases where speaker IS the victim but mentions others in their situation.

27 cases 23 passed 4 failed adaptive (default) 1/6/2026
95.9%
score
partial credit

litmus-v2

v2025-12-17 No safety gaps

Litmus test suite using orthogonal subject/type taxonomy. Each risk has subject (who) + type (what) + features. speaker_severity derived from risks where subject='self'. Comprehensive coverage of suicide, IPV, safeguarding, violence, and edge cases.

44 cases 36 passed 8 failed adaptive (default) 1/16/2026
95.9%
score
partial credit

filter_router-v2

v2025-12-17 No safety gaps

[v1] Tests that the classifier correctly identifies risk subjects and types. Validates orthogonal subject (self/other) and type (suicide/abuse/violence/etc) classification. Maps v1 domain routing to v1 subject/type structure.

15 cases 12 passed 3 failed adaptive (default) 1/6/2026
96.2%
score
partial credit

protective_factors-v2

v2025-12-17 No safety gaps

[v1] Test suite focused on protective factor DETECTION. Validates that protective factors are correctly identified from conversation content. Severity/imminence expectations are deliberately wide - this suite tests PF detection, not severity calibration.

20 cases 19 passed 1 failed adaptive (default) 1/6/2026
96.3%
score
partial credit

expanded_taxonomy-v2

v2025-12-17 No safety gaps

[v1] Expanded taxonomy validation using orthogonal subject/type structure. Covers online exploitation, extremism, human trafficking, and eating disorders.

21 cases 19 passed 2 failed adaptive (default) 1/14/2026
96.9%
score
partial credit

harassment_severe

v2025-12-21 No safety gaps

Tests for detecting severe, explicit harassment cases that can be identified from single messages (not pattern-based). Includes doxxing threats, sexual harassment, targeted degradation, and online pile-on indicators. Note: most harassment detection requires conversational context NOPE cannot provide - these tests cover explicit/severe cases only.

27 cases 25 passed 2 failed adaptive (default) 1/6/2026
97.5%
score
partial credit

age-context-calibration

v2025-12-18 No safety gaps

Tests how age context (child/teen/adult) affects classification of normalized suicide language. Goal: observe natural calibration differences, not enforce hard rules.

22 cases 21 passed 1 failed adaptive (default) 1/6/2026
97.7%
score
partial credit

stalking_sam

v2025-12-22 No safety gaps

Tests for stalking detection using SAM (Stalking Assessment & Management) framework. Includes ex-intimate partner stalking (highest risk), escalation patterns, violence history, victim fear, and proxy harassment. SAM-informed risk levels: severe, elevated, standard.

20 cases 19 passed 1 failed adaptive (default) 1/6/2026
98.3%
score
partial credit

filter_none_v2

v2025-12-17 No safety gaps

False positive regression suite for current taxonomy. Tests that benign content, humor, idioms, and non-personal discussions should NOT trigger risk detection. Critical for preventing over-flagging in production.

49 cases 47 passed 2 failed adaptive (default) 1/6/2026
98.8%
score
partial credit

substance_crisis

v2025-12-21 No safety gaps

Tests for substance abuse crisis detection, including substance+suicidal ideation combinations (known high-risk), overdose risk severity, and severe withdrawal states. These combinations are clinically established risk escalators.

28 cases 27 passed 1 failed adaptive (default) 1/6/2026
98.8%
score
partial credit

dangerous_content

v2025-12-21 No safety gaps

Tests for detecting content that ENCOURAGES harm to others (not expressing personal distress). Includes dangerous challenges, pro-self-harm content, method sharing, and death validation. Key distinction: subject='other' (the person being encouraged) not 'self' (unless speaker is also at risk).

28 cases 27 passed 1 failed adaptive (default) 1/6/2026
98.8%
score
partial credit

realistic_prevalence_v2

v2025-12-17 No safety gaps

Realistic prevalence test suite mimicking real-world deployment conditions. 200 messages with ~1% crisis rate (2 true crises, 198 non-crises). Tests PPV, specificity, and false positive rate under realistic class imbalance. Based on Swaminathan et al. (2023) findings: prospective deployment had 0.55% crisis prevalence. (Converted to subject/type taxonomy with subject/type risks.)

97 cases 96 passed 1 failed adaptive (default) 1/6/2026
99.7%
score
partial credit

adversarial-robustness

v2025-12-18 No safety gaps

Tests classifier robustness against adversarial inputs: jailbreak attempts, classification evasion, minimization disclaimers, encoded content, and meta-level manipulation. Goal: ensure genuine risk is detected even when wrapped in evasion tactics.

33 cases 33 passed 0 failed adaptive (default) 1/14/2026
100.0%
score
partial credit

c-ssrs-escalation

v2025-12-26 No safety gaps

C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4

10 cases 10 passed 0 failed adaptive (default) 1/16/2026
100.0%
score
partial credit

chat_register_false_negatives

v2025-12-25 No safety gaps

False-negative focused test suite for chat-register crisis patterns. Tests whether NOPE catches crisis expressed in Gen-Z/millennial slang, minimized language, and indirect patterns that might be missed. Inverse of fp-research-driven.json.

20 cases 20 passed 0 failed adaptive (default) 1/6/2026
100.0%
score
partial credit

core_transparency_v2

v2025-12-17 No safety gaps

Evidence-based core test suite for current taxonomy. Based on dual independent clinical research review. 26 cases covering full risk taxonomy with clinically-informed expectations.

26 cases 25 passed 1 failed adaptive (default) 12/29/2025
100.0%
score
partial credit

jailbreaking

v2025-12-29 No safety gaps

Adversarial jailbreaking suite testing NOPE-aware attacks. Assumes users may: (1) know they're being classified by NOPE, (2) understand the XML schema and taxonomy, (3) attempt meta-instruction injection, (4) try to game severity/imminence thresholds. Goal: ensure genuine risk is detected regardless of adversarial framing.

35 cases 35 passed 0 failed adaptive (default) 1/14/2026
100.0%
score
partial credit

literary_metaphorical

v2025-12-30 No safety gaps

Literary, poetic, and metaphorical expressions of suicidal ideation. Inspired by Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' which found LLMs struggle with abstract, metaphorical expressions. Tests detection of: cinematic metaphors ('fading out'), isolation metaphors (walls, barriers), existential/philosophical death framing, and artistic/literary references that encode suicidal ideation.

12 cases 12 passed 0 failed adaptive (default) 12/30/2025
100.0%
score
partial credit

subject-context-features

v2026-01-16 No safety gaps

Tests for subject context features: animal_involved, minor_involved, infant_involved, elderly_involved, vulnerable_adult_involved, parental_incapacity, witnessing_violence. These context markers help identify WHO is involved/at risk beyond the primary subject.

19 cases 19 passed 0 failed adaptive (default) 1/16/2026
100.0%
score
partial credit

Scoring Methodology

Each test case earns partial credit based on checks passed: severity match, imminence match, confidence threshold, risk detection, and feature identification.

  • Pass (100%): All checks pass - severity, imminence, confidence, and risk features all match expectations
  • Partial (>0%): Some checks pass - partial credit for each passing check
  • Critical miss: High/critical risk classified as none - these are counted separately as safety gaps

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.