Skip to main content

Test Suite Transparency

Safety-critical systems require rigorous, transparent testing. Our test suites are grounded in validated clinical frameworks including C-SSRS, Danger Assessment, HCR-20, and peer-reviewed literature. Every test case shows expected classifications, actual results, and clinical rationale.

Why we publish our failures

A healthy failure rate isn't a bug—it's essential. Crisis detection involves inherently subjective judgments. Two clinicians reviewing the same conversation will often disagree. The same phrase can signal genuine distress or casual hyperbole depending on context that may not be present in the text.

We publish these results because transparency matters more than optics. Failures tell us where the hard cases are. They drive our roadmap. We work with clinical frameworks and medical professionals to refine our understanding of how risk presents in text-based conversations—but we'll never detect everything, and we don't pretend otherwise.

If a safety system claims 100% accuracy, be skeptical. The honest answer is: it's complicated, and we're working on it.

Our testing philosophy

Direction over exactitude

We care more about whether NOPE catches crisis at all than whether it labels something "moderate" vs "high." A case returning "mild" when we expected "moderate" is still going in the right direction. A case returning "none" when it should flag something is a real gap.

Expectations are hypotheses

Test expectations represent clinical intuition, not ground truth. Real ground truth requires expert annotation with inter-rater reliability. Our tests explore where NOPE's behavior aligns with or diverges from clinical frameworks—not where it's "right" or "wrong."

litmus.json (strict) = regression guardrails (~95%+ required)
Other suites (exploratory) = calibration exploration (70-90% normal)

Evaluate

45 suites · 1,198 cases
89.6%
avg score (partial credit)
898/1198
cases passed
20
critical

Screen

35 suites · 794 cases
92.8%
pass rate
737/794
cases passed
2,191 total test cases 20 critical misses
Updated 1/21/2026

Evaluate Suites

Full risk classification testing with severity, imminence, and risk features

View all

adversarial-robustness

v2025-12-18 No safety gaps

Tests classifier robustness against adversarial inputs: jailbreak attempts, classification evasion, minimization disclaimers, encoded content, and meta-level manipulation. Goal: ensure genuine risk is detected even when wrapped in evasion tactics.

33 cases 33 passed 0 failed 1/14/2026
100.0%
score
partial credit

c-ssrs-escalation

v2025-12-26 No safety gaps

C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4

10 cases 10 passed 0 failed 1/16/2026
100.0%
score
partial credit

chat_register_false_negatives

v2025-12-25 No safety gaps

False-negative focused test suite for chat-register crisis patterns. Tests whether NOPE catches crisis expressed in Gen-Z/millennial slang, minimized language, and indirect patterns that might be missed. Inverse of fp-research-driven.json.

20 cases 20 passed 0 failed 1/6/2026
100.0%
score
partial credit

core_transparency_v2

v2025-12-17 No safety gaps

Evidence-based core test suite for current taxonomy. Based on dual independent clinical research review. 26 cases covering full risk taxonomy with clinically-informed expectations.

26 cases 25 passed 1 failed 12/29/2025
100.0%
score
partial credit

jailbreaking

v2025-12-29 No safety gaps

Adversarial jailbreaking suite testing NOPE-aware attacks. Assumes users may: (1) know they're being classified by NOPE, (2) understand the XML schema and taxonomy, (3) attempt meta-instruction injection, (4) try to game severity/imminence thresholds. Goal: ensure genuine risk is detected regardless of adversarial framing.

35 cases 35 passed 0 failed 1/14/2026
100.0%
score
partial credit

literary_metaphorical

v2025-12-30 No safety gaps

Literary, poetic, and metaphorical expressions of suicidal ideation. Inspired by Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' which found LLMs struggle with abstract, metaphorical expressions. Tests detection of: cinematic metaphors ('fading out'), isolation metaphors (walls, barriers), existential/philosophical death framing, and artistic/literary references that encode suicidal ideation.

12 cases 12 passed 0 failed 12/30/2025
100.0%
score
partial credit

subject-context-features

v2026-01-16 No safety gaps

Tests for subject context features: animal_involved, minor_involved, infant_involved, elderly_involved, vulnerable_adult_involved, parental_incapacity, witnessing_violence. These context markers help identify WHO is involved/at risk beyond the primary subject.

19 cases 19 passed 0 failed 1/16/2026
100.0%
score
partial credit

realistic_prevalence_v2

v2025-12-17 No safety gaps

Realistic prevalence test suite mimicking real-world deployment conditions. 200 messages with ~1% crisis rate (2 true crises, 198 non-crises). Tests PPV, specificity, and false positive rate under realistic class imbalance. Based on Swaminathan et al. (2023) findings: prospective deployment had 0.55% crisis prevalence. (Converted to subject/type taxonomy with subject/type risks.)

97 cases 96 passed 1 failed 1/6/2026
99.7%
score
partial credit

dangerous_content

v2025-12-21 No safety gaps

Tests for detecting content that ENCOURAGES harm to others (not expressing personal distress). Includes dangerous challenges, pro-self-harm content, method sharing, and death validation. Key distinction: subject='other' (the person being encouraged) not 'self' (unless speaker is also at risk).

28 cases 27 passed 1 failed 1/6/2026
98.8%
score
partial credit

filter_none_v2

v2025-12-17 No safety gaps

False positive regression suite for current taxonomy. Tests that benign content, humor, idioms, and non-personal discussions should NOT trigger risk detection. Critical for preventing over-flagging in production.

49 cases 47 passed 2 failed 1/6/2026
98.8%
score
partial credit

substance_crisis

v2025-12-21 No safety gaps

Tests for substance abuse crisis detection, including substance+suicidal ideation combinations (known high-risk), overdose risk severity, and severe withdrawal states. These combinations are clinically established risk escalators.

28 cases 27 passed 1 failed 1/6/2026
98.8%
score
partial credit

stalking_sam

v2025-12-22 No safety gaps

Tests for stalking detection using SAM (Stalking Assessment & Management) framework. Includes ex-intimate partner stalking (highest risk), escalation patterns, violence history, victim fear, and proxy harassment. SAM-informed risk levels: severe, elevated, standard.

20 cases 19 passed 1 failed 1/6/2026
98.3%
score
partial credit

age-context-calibration

v2025-12-18 No safety gaps

Tests how age context (child/teen/adult) affects classification of normalized suicide language. Goal: observe natural calibration differences, not enforce hard rules.

22 cases 21 passed 1 failed 1/6/2026
97.7%
score
partial credit

harassment_severe

v2025-12-21 No safety gaps

Tests for detecting severe, explicit harassment cases that can be identified from single messages (not pattern-based). Includes doxxing threats, sexual harassment, targeted degradation, and online pile-on indicators. Note: most harassment detection requires conversational context NOPE cannot provide - these tests cover explicit/severe cases only.

27 cases 25 passed 2 failed 1/6/2026
97.5%
score
partial credit

expanded_taxonomy-v2

v2025-12-17 No safety gaps

[v1] Expanded taxonomy validation using orthogonal subject/type structure. Covers online exploitation, extremism, human trafficking, and eating disorders.

21 cases 19 passed 2 failed 1/14/2026
96.9%
score
partial credit

protective_factors-v2

v2025-12-17 No safety gaps

[v1] Test suite focused on protective factor DETECTION. Validates that protective factors are correctly identified from conversation content. Severity/imminence expectations are deliberately wide - this suite tests PF detection, not severity calibration.

20 cases 19 passed 1 failed 1/6/2026
96.3%
score
partial credit

filter_router-v2

v2025-12-17 No safety gaps

[v1] Tests that the classifier correctly identifies risk subjects and types. Validates orthogonal subject (self/other) and type (suicide/abuse/violence/etc) classification. Maps v1 domain routing to v1 subject/type structure.

15 cases 12 passed 3 failed 1/6/2026
96.2%
score
partial credit

litmus-v2

v2025-12-17 No safety gaps

Litmus test suite using orthogonal subject/type taxonomy. Each risk has subject (who) + type (what) + features. speaker_severity derived from risks where subject='self'. Comprehensive coverage of suicide, IPV, safeguarding, violence, and edge cases.

44 cases 36 passed 8 failed 1/16/2026
95.9%
score
partial credit

subject_attribution-v2

v2025-01-18 No safety gaps

Tests for correct subject attribution (self vs other). Probes edge cases where speaker IS the victim but mentions others in their situation.

27 cases 23 passed 4 failed 1/6/2026
95.9%
score
partial credit

indirect_disclosure

v2025-12-18 No safety gaps

Indirect crisis disclosure patterns including 'asking for a friend', third-person framing, hypothetical scenarios, minimization, and religious/spiritual framing. Research shows many callers take 10-15 minutes to disclose actual reason for crisis contact. These patterns are normative in most cultures - direct disclosure is the exception, not the rule.

31 cases 27 passed 4 failed 1/14/2026
95.7%
score
partial credit

input-framing-v2

v2025-12-17 No safety gaps

[v1] Input framing detection test suite. Tests the ability to detect HOW content is presented (direct, humor, creative, distanced, adversarial, etc.). Framing is orthogonal to risk assessment - these tests verify framing detection alongside appropriate risk classification.

17 cases 15 passed 2 failed 1/6/2026
95.3%
score
partial credit

somatization_patterns

v2025-12-18 No safety gaps

Crisis presentations through somatic (physical) symptoms rather than emotional language. 'Somatization is the single most common reason why psychiatric illness goes undetected in general medical settings.' Physical complaints are the acceptable way to express distress in many cultures where mental illness is stigmatized.

28 cases 24 passed 4 failed 1/6/2026
95.2%
score
partial credit

blindspots_v2

v2025-12-17 1 critical

Research-driven blindspots test suite for current taxonomy. Covers implicit ideation (Crisis Text Line predictors), population-specific patterns (LGBTQ+, veterans, healthcare workers), and adversarial scenarios. Based on RESEARCH-SYNTHESIS.md and clinical literature.

29 cases 25 passed 4 failed 1/14/2026
94.8%
score
partial credit

v2_population_context

v2025-12-17 No safety gaps

Tests for population context extraction. Validates that demographic/identity indicators are captured as features to enable population-specific resource matching.

10 cases 7 passed 3 failed 1/6/2026
91.3%
score
partial credit

fp_research_driven-v2

v2025-12-25 3 high

[v1] False positive test cases derived from academic research on mental health classification systems. Tests negation handling, humor markers, physical pain idioms, internet slang, minimal-context slang variants, and other documented FP patterns.

38 cases 34 passed 4 failed 1/5/2026
91.2%
score
partial credit

active_crisis_imminent

v2025-12-29 No safety gaps

Active suicide crisis and imminent attempt scenarios representing C-SSRS Level 5 (highest severity): active ideation with specific plan, intent, and timeline. Includes interrupted attempt language, preparatory acts, resolution syndrome, and in-progress attempts. Critical for regression prevention—these cases must be detected at critical/emergency severity.

25 cases 16 passed 9 failed 1/16/2026
91.0%
score
partial credit

v2_eating_disorders

v2025-12-29 1 critical

Comprehensive eating disorder detection tests including anorexia, bulimia, binge-eating disorder, ARFID, and orthorexia patterns. Validates that ED behaviors are correctly classified with appropriate features for specialist resource matching. Includes pro-ED community language, male-specific presentations, minimization patterns, ED+suicidality intersection, and validated assessment instrument language (EAT-26, EDE-Q).

21 cases 16 passed 5 failed 1/14/2026
90.6%
score
partial credit

acute_real_world_v2

v2025-12-17 1 critical

Acute real-world scenarios for current taxonomy. Covers immediate danger, psychosis, mania, AI attachment, safeguarding, and third-party risk. 30 evidence-aligned test cases from clinical literature and real-world patterns.

30 cases 23 passed 7 failed 1/14/2026
90.4%
score
partial credit

speaker_third_party_v2

v2025-12-17 No safety gaps

Tests for speaker vs third-party risk disambiguation. Uses orthogonal subject/type: speaker_severity derived from risks where subject='self', third-party risks have subject='other'.

25 cases 17 passed 8 failed 1/6/2026
90.2%
score
partial credit

neurodivergent_presentations

v2025-12-18 No safety gaps

Crisis presentations from neurodivergent individuals (autism, ADHD, etc.) that may be missed by classifiers trained on neurotypical communication patterns. Autistic people are 3-9x more likely to die by suicide (Hirvikoski 2016); autistic women have ~13x suicide mortality risk vs non-autistic women; ideation is ~4x general population (Cassidy 2014). Key patterns: flat affect, literal expression, communication shutdown, perseverative thoughts.

32 cases 24 passed 8 failed 1/6/2026
88.5%
score
partial credit

mh_edge_cases-v2

v2025-12-17 No safety gaps

[v1] Edge and ambiguous clinical/safety cases for NOPE Safety v1. Multi-domain assessment version with severity/imminence separation. Covers dark humor, jailbreak attempts, safeguarding, violence, self-neglect, eating disorders, false positives, gaming/meme culture, chronic illness, postpartum, and multi-turn escalation patterns. 35 test cases (27 single-turn, 8 multi-turn).

42 cases 27 passed 15 failed 1/6/2026
88.0%
score
partial credit

filter_router_edge_cases-v2

v2025-12-17 1 critical 1 high

[v1] Edge cases for risk classification accuracy. Tests indirect language, context confusion, ambiguous framing, coded language, third-party disclosures, and professional contexts using orthogonal subject/type structure.

18 cases 11 passed 7 failed 1/6/2026
87.9%
score
partial credit

multilingual_crisis

v2026-01-13 1 critical

Multilingual crisis expressions to ensure equitable detection across languages. Based on research into culturally-specific idioms, somatic expressions, and indirect crisis communication patterns. Each case includes original language, translation, and cultural context notes.

72 cases 41 passed 31 failed 1/14/2026
83.3%
score
partial credit

resources-e2e

v2025-12-23 No safety gaps

End-to-end tests verifying classification AND resource resolution work together. Each case tests both risk assessment and that appropriate specialized resources are surfaced.

3 cases 1 passed 2 failed 1/14/2026
83.3%
score
partial credit

multi-marker-false-positives

v1.0 1 high

Tests where multiple crisis indicators cluster in benign contexts, ensuring NOPE doesn't over-weight coincidental marker combinations without considering overall context

11 cases 7 passed 4 failed 1/16/2026
81.8%
score
partial credit

chat-register-crisis

v2026-01-06 No safety gaps

Crisis detection in informal chat register: text speak, emoji, hedging with dismissive qualifiers ('lol its dumb but...'), and specific medication/means references masked by casual framing. Based on De Choudhury et al. (2016) computational discourse analysis, Coppersmith et al. (2018) lexical markers, and Crisis Text Line emoji research.

15 cases 6 passed 9 failed 1/6/2026
81.7%
score
partial credit

ai_mediated_risk-v2

v2025-12-17 1 critical

[v1] Novel risk patterns emerging from AI-mediated conversations. Tests magical thinking, AI validation/collusion, parasocial attachment, multi-turn escalation, and method-seeking with philosophical framing. Based on clinical research on third-party validation effects in suicide risk.

24 cases 12 passed 12 failed 1/6/2026
78.3%
score
partial credit

verbatim_chat_register

v2025-12-25 1 high

Chat-register verbatim text from court-quoted SMS/text messages and Reddit posts. All items are real-time, informal, direct disclosures - NOT interview quotes or retrospective accounts. Sources: US court opinions (Justia), Reddit gaming/casual subreddits.

10 cases 7 passed 3 failed 1/6/2026
75.0%
score
partial credit

nssi_comprehensive

v2025-12-29 1 critical

Comprehensive non-suicidal self-injury (NSSI) detection suite covering all ISAS functional categories, disclosure patterns, concealment strategies, and risk escalation. Tests ability to distinguish NSSI from suicidal behavior and identify anti-suicide function (critical for risk assessment). Includes contemporary algospeak and social media coded language.

33 cases 7 passed 26 failed 1/14/2026
74.9%
score
partial credit

postpartum_perinatal

v2025-12-29 2 critical

Postpartum and perinatal mental health crisis patterns including intrusive thoughts about infant harm, ego-dystonic OCD-type thoughts (protective vs. psychotic), postpartum depression, postpartum psychosis, disclosure fears ('they'll take my baby'), and avoidance behaviors. Tests critical distinction between repugnant intrusive thoughts (no intent, horror response) and genuine risk to infant. Approximately 50% of new mothers experience intrusive thoughts; ego-dystonic quality is key protective factor.

11 cases 5 passed 6 failed 12/29/2025
71.8%
score
partial credit

evaluate-post-discharge-transitions

v2026-01-06 1 critical

Post-discharge and high-risk transition periods - full risk assessment with severity, imminence, and feature detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge (2,950/100k person-years), with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI to obtain discharge.

26 cases 4 passed 22 failed 1/6/2026
71.5%
score
partial credit

Self-Neglect Comprehensive

v2026-01-13 No safety gaps

Comprehensive coverage of self-neglect risk type including medical non-adherence patterns, ambiguous intentionality, elderly self-neglect, and substance misuse. Addresses critical taxonomy gap where self-neglect was only primary type in ~10 cases despite being 1 of 9 official risk types.

30 cases 2 passed 28 failed 1/13/2026
70.7%
score
partial credit

verbatim_crisis_starter_pack

v2025-12-25 No safety gaps

Short verbatim excerpts from public reports, court decisions, and published research. Seed set with full provenance - not the full validation corpus. Sources: ACL Anthology, NJ DV Near-Fatality Report, Correctional Service Canada, NZ Women's Refuge, Delaware Supreme Court, UK child neglect research, Swiss refugee study, UCLan trafficking thesis, UCT eating disorder thesis.

17 cases 6 passed 11 failed 1/6/2026
70.6%
score
partial credit

trafficking_disclosure

v2025-12-29 4 critical

Human trafficking disclosure patterns including labor trafficking, sex trafficking, and exploitation. Tests detection of debt bondage, psychological control, delayed victim self-identification, coached responses, trauma bonding, and disclosure barriers. Includes youth-specific patterns ('boyfriend' framing) and validated QYIT screening indicators. Critical zero-coverage domain.

23 cases 7 passed 16 failed 12/29/2025
68.9%
score
partial credit

Exploitation Comprehensive

v2026-01-13 6 critical

Comprehensive coverage of exploitation risk type including elder financial abuse, romance scams, labor exploitation, and institutional exploitation. Addresses critical taxonomy gap where exploitation was only primary type in ~10 cases despite being 1 of 9 official risk types.

24 cases 2 passed 22 failed 1/14/2026
57.9%
score
partial credit

Screen Suites

Lightweight crisis screening tests (C-SSRS framework)

View all

c-ssrs-escalation-screen

v2025-12-26

C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4

10 cases 10 passed 0 failed 1/5/2026
100.0%
score
partial credit

screen-core

v2025-12-18

Core test suite for /screen endpoint. Tests suicide/self-harm detection using C-SSRS framework, covering active crisis, passive ideation, method-seeking, self-harm (NSSI), idioms, multi-turn conversations, and false positive prevention.

92 cases 92 passed 0 failed 1/20/2026
100.0%
score
partial credit

correctional-crisis

v1.0

Crisis patterns in correctional populations: booking/intake, pre-trial detention, and post-release periods. Based on BJS suicide data, Binswanger et al. (2007) post-release mortality research, and NCCHC guidelines.

15 cases 15 passed 0 failed 1/6/2026
100.0%
score
partial credit

screen-economic-distress-advanced

v2026-01-06

Advanced economic distress patterns covering the 'Transactional Self' (commodification of existence), somatic manifestations, high-velocity ruin (crypto/trading), agrarian stewardship failure, construction industry stoicism, and housing deadline triggers. Based on the 'Deaths of Despair' framework (Case & Deaton) and Joiner's Interpersonal Theory of Suicide.

36 cases 36 passed 0 failed 1/6/2026
100.0%
score
partial credit

screen-homepage-examples

v2026-01-14

Regression tests for examples shown on nope.net homepage. Ensures our public claims match API behavior.

6 cases 6 passed 0 failed 1/20/2026
100.0%
score
partial credit

indigenous-crisis-patterns

v1.0

Crisis patterns in Indigenous/Native populations including historical trauma, intergenerational effects, cluster/contagion contexts, and Two-Spirit/Indigenous LGBTQ+ intersections. Based on Brave Heart (2003), Bombay et al. (2014), and SAMHSA cluster guidance. Includes critical false positive guidance for cultural spiritual expressions.

15 cases 15 passed 0 failed 1/6/2026
100.0%
score
partial credit

kms-hyperbole-calibration

v1.0

Calibration suite for 'kms' (kill myself) detection. Tests the boundary between hyperbolic internet slang and genuine masked ideation. Key principle: trivial stressors + humor markers = no flag; significant stressors or isolation language = flag even with humor.

19 cases 19 passed 0 failed 1/20/2026
100.0%
score
partial credit

screen-minority-cultural

v2025-12-31

Crisis patterns from racial/ethnic minority populations showing distinct linguistic markers. Research shows 'hidden ideation' in Asian American populations (less likely to explicitly state suicidal thoughts), intergenerational trauma framing in Indigenous populations, and shame-limited disclosure in Latino populations. Critical for ensuring detection systems work across demographic groups.

11 cases 11 passed 0 failed 1/15/2026
100.0%
score
partial credit

older-adult-supplemental

v1.0

Supplemental crisis patterns for adults 65+, covering patterns from second research document: firearm euphemisms (cleaning gun), medication hoarding (insurance/peace of mind framing), VSED patterns, financial ruin triggers, completed life rhetoric, spousal reunion/pact patterns, instructional farewells, past-tense narratives, and affective neutrality. Complements older-adult-crisis-patterns.json (42 cases).

27 cases 27 passed 0 failed 1/6/2026
100.0%
score
partial credit

screen-recommended-reply

v2026-01-15

Test suite for recommended_reply generation in /screen endpoint. Verifies that generated replies include appropriate resources, tone-matching, and avoid toxic positivity.

6 cases 6 passed 0 failed 1/15/2026
100.0%
score
partial credit

screen-resource-derivation

v2026-01-14

Tests that /screen correctly derives resource scopes from detected risk types. Validates the screenRisksToScopes mapping end-to-end.

12 cases 12 passed 0 failed 1/14/2026
100.0%
score
partial credit

screen-resources-e2e

v2026-01-14

End-to-end tests verifying /screen returns appropriate crisis resources for different countries and risk types.

15 cases 15 passed 0 failed 1/14/2026
100.0%
score
partial credit

sextortion-crisis-patterns

v2026-01-06

Sextortion crisis patterns for suicide detection. Tests the critical linguistic shift from external problem-focus ('I'm being blackmailed') to internal defeat ('I can't survive this'). Based on documented cases showing victims dying within 27 minutes to 6 hours of first contact. Sources: FBI 2024 sextortion data, NCMEC 36+ documented suicide cases, Thorn financial sextortion research 2024, Sadath et al. 2024 humiliation-suicide meta-analysis.

28 cases 28 passed 0 failed 1/15/2026
100.0%
score
partial credit

screen-victimization

v2025-12-23

Tests that victimization/threats FROM OTHERS are correctly distinguished from suicidal ideation. The /screen endpoint should NOT flag threats from others as suicidal ideation (C-SSRS measures self-directed risk only). However, if victimization is combined with suicidal ideation, it SHOULD flag.

18 cases 18 passed 0 failed 1/5/2026
100.0%
score
partial credit

youth-contagion

v1.0

Suicide contagion and cluster patterns. Teens who know someone who attempted are approximately 3x more likely to attempt themselves (OR 2.8-11.0). 1-5% of teen suicides occur in clusters, more common in under-25s. Notably, 'less close friends who knew the deceased have the highest risk.' Celebrity suicides increase rates significantly - Robin Williams linked to 1,800+ additional deaths, '13 Reasons Why' caused 28.9% increase in 10-17yo suicides.

12 cases 12 passed 0 failed 1/6/2026
100.0%
score
partial credit

youth-false-positives

v1.0

Youth hyperbole vs genuine crisis differentiation. NLP tools achieve only 15% accuracy in detecting sarcasm. Adolescent expressions like 'I'm literally dying,' 'kill me now,' and 'I'm dead' are ubiquitous in casual communication. Key markers: gaming/entertainment context, no emotional escalation, humor indicators, quick emotional recovery, social engagement patterns.

12 cases 12 passed 0 failed 1/20/2026
100.0%
score
partial credit

youth-romantic-relationship

v1.0

First romantic relationship and breakup crisis patterns. Research shows adolescents with dissolution stage issues (breakup concerns) were significantly more likely to present with suicide/self-harm than other relationship stages. Study of 370 students found 19.5% had major breakup, 34.3% had suicidal ideation with significant association. First breakup intensity reflects developmental factors: limited experience creates catastrophic thinking.

12 cases 12 passed 0 failed 1/6/2026
100.0%
score
partial credit

youth-school-academic

v1.0

School-based crisis patterns including academic pressure, discipline, college rejection, and exam stress. Research shows 14% prevalence of suicide risk among students with academic pressure, and the MARIS study found academic failure was the only predictor of suicide after one month. Suicide rates are notably lowest during school closures.

12 cases 12 passed 0 failed 1/6/2026
100.0%
score
partial credit

older-adult-crisis-patterns

v1.0

42 cases 40 passed 2 failed 1/15/2026
95.2%
score
partial credit

pregnancy-reproductive-loss

v2026-01-06

Pregnancy loss and reproductive health crisis detection. Based on evidence that suicide is leading cause of maternal death 6 weeks to 1 year postpartum (MBRRACE-UK), with stillbirth conferring 5.2x elevated risk. Covers miscarriage, stillbirth, infertility/IVF, birth trauma, NICU, TFMR, partner grief, TTC community language, medical terminology trauma, financial entrapment, obstetric violence, and reunion motivation patterns. Sources: Weng et al. 2018 (BJOG, DOI: 10.1111/1471-0528.15105), Lewkowitz et al. 2019 (AJOG), Tommy's National Centre, Bailey et al. 2019 (BMJ Open), Shani et al. 2016, 1001 Critical Days study.

55 cases 52 passed 3 failed 1/6/2026
94.5%
score
partial credit

screen-substance-use

v2025-12-31

Tests integration of substance use context with suicidal ideation. Research shows 45.6% of opioid overdose survivors reported some desire to die, revealing a spectrum from 'using to cope' through 'passive indifference' to 'active desire to die'. Critical for detection systems as substance use is a major co-occurring risk factor.

17 cases 16 passed 1 failed 1/5/2026
94.1%
score
partial credit

screen-economic-distress-crisis

v2026-01-06

Economic distress and financial crisis suicide patterns. Based on research showing problem debt creates 7.96x suicide attempt risk (Naranjo et al. 2021), combined financial strains create 20x increase (Elbogen et al. 2020), and 79% of foreclosure suicides occur BEFORE actual housing loss (Houle & Light 2014). Tests three primary pathways: perceived burdensomeness, provider identity collapse, and escape reasoning.

41 cases 38 passed 3 failed 1/6/2026
92.7%
score
partial credit

youth-cyberbullying

v1.0

Cyberbullying crisis patterns distinct from traditional bullying. NIH/CHOP study found cybervictims are 4x more likely to report suicidal thoughts/attempts, independent of in-person bullying. Key distinguishing factor: inability to escape - harassment follows victims home, can be anonymous, spreads virally, reaches wider audiences.

12 cases 11 passed 1 failed 1/6/2026
91.7%
score
partial credit

screen-postpartum-transitions

v2025-12-31

Crisis patterns during major life transitions including postpartum period, motherhood adjustment, and acute care-seeking urgency. Research shows mothers hide suicidal feelings to adhere to cultural expectations of motherhood, with unique linguistic markers around loss of control, overwhelm, and incongruence between expectations vs reality.

10 cases 9 passed 1 failed 1/5/2026
90.0%
score
partial credit

high-risk-occupational-crisis

v1.0

Crisis patterns in high-risk occupations: farmers (3.5x general rate), construction (75% higher), first responders (police 58% of FR suicides), lawyers (2x ideation rate), active military (28.2/100k), and dentists (PMR 2.01). Research-derived linguistic markers from qualitative studies, crisis hotline research, and occupational health literature. Citations in rationale.

33 cases 29 passed 4 failed 1/6/2026
87.9%
score
partial credit

screen-research-derived

v2025-12-27

Test cases derived from academic research on crisis communication patterns, algospeak, cultural idioms, and forensic linguistics.

46 cases 40 passed 6 failed 1/20/2026
87.0%
score
partial credit

screen-chronic-illness-disability

v2025-12-31

Crisis patterns specific to chronic illness, chronic pain, and disability populations. These populations express crisis through unique linguistic markers including perceived burdensomeness related to dependency, treatment non-adherence as passive suicide method, and conditional survival language.

13 cases 11 passed 2 failed 1/5/2026
84.6%
score
partial credit

youth-family-conflict

v1.0

Family conflict and dysfunction as primary risk pathway. Meta-analysis found adolescents with family dysfunction have 1.93x higher suicide risk. TORDIA study demonstrated family conflict significantly predicted suicidal adverse events. Child maltreatment, particularly sexual abuse, carries highest attempt risk (OR 11.7-49.3).

13 cases 11 passed 2 failed 1/6/2026
84.6%
score
partial credit

youth-lgbtq-minority-stress

v1.0

LGBTQ+ youth crisis patterns reflecting minority stress accumulation. Trevor Project 2024 survey: 39% seriously considered suicide, 12% attempted. Those experiencing 4 types of minority stress face 12x greater odds of attempt. Age 13-17 shows higher risk (46%/16%) than 18-24 (33%/8%). Conversion therapy exposure doubles attempt likelihood.

13 cases 11 passed 2 failed 1/6/2026
84.6%
score
partial credit

youth-developmental-stages

v1.0

Age-specific crisis expressions across developmental stages. Research shows children's death vocabulary and crisis communication evolve significantly: preschoolers use concrete death language, pre-teens use indirect expressions and somatic complaints, young teens exhibit emerging abstract thinking with burden language, and older teens employ adult-like articulation.

12 cases 10 passed 2 failed 1/6/2026
83.3%
score
partial credit

screen-ambiguous-gray-area

v2025-12-31

Ambiguous presentations where even trained clinicians disagree on severity. Inter-rater reliability among psychotherapists is AC1 = 0.44 (psychology students AC1 = 0.35), with middle-range cases showing lowest agreement. These cases test the system's ability to handle uncertainty and borderline severity, where binary classification is inappropriate and conservative flagging is warranted.

15 cases 12 passed 3 failed 1/5/2026
80.0%
score
partial credit

screen-healthcare-workers

v2025-12-31

Crisis patterns specific to healthcare workers (physicians, nurses, veterinarians) during high-stress periods. Research shows unique linguistic markers including workplace demoralization, learned helplessness from systemic barriers, help-seeking barriers (no time, feeling irresponsible), and loss of compassion as distress signal. Veterinarians have highest occupational suicide rate; 49% of veterinarians with ideation cite work problems.

15 cases 12 passed 3 failed 1/15/2026
80.0%
score
partial credit

screen-post-discharge-transitions

v2026-01-06

Post-discharge and high-risk transition periods - linguistic markers for crisis detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge, with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI. 49% of 2-week deaths occur before first follow-up.

30 cases 24 passed 6 failed 1/6/2026
80.0%
score
partial credit

screen-veteran-military-crisis

v2026-01-13

Veteran and military crisis patterns including moral injury, institutional betrayal, military sexual trauma, TBI comorbidity, and transition crisis. Addresses gap where only 3 veteran cases existed. Based on VA data showing veterans with moral injury have higher suicide rates and 50% of veteran suicide deaths had received VA services.

18 cases 14 passed 4 failed 1/15/2026
77.8%
score
partial credit

algospeak-comprehensive

v2026-01-14

Comprehensive algospeak test suite covering suicide euphemisms, self-harm depth indicators, character substitutions, emoji patterns, and false positive contexts. Based on 2026 research into TikTok, Reddit, Discord, and Tumblr crisis communication patterns.

51 cases 39 passed 12 failed 1/15/2026
76.5%
score
partial credit

Moderation Comparisons

Cross-provider moderation comparison results

View all

cultural-identity-false-positives

v1.0

Test cases targeting documented over-flagging by competitor moderation APIs. CHI 2025 audit found all major APIs over-flag content containing LGBTQ+, Black, Jewish, and Muslim identity terms. These cases test trauma processing, recovery narratives, educational content, and identity discussions that should NOT be flagged as crisis.

15 cases 1/15/2026

implicit-crisis-expressions

v1.0

Test cases targeting documented blind spots in competitor moderation APIs. OpenAI shows 33.2% false negative rate on implicit content, Azure 63.9% on implicit expressions, Perspective 75.4% overall. These cases test passive ideation, context-dependent statements, coded language, and cultural expressions that major APIs systematically miss.

20 cases 1/15/2026

literary-metaphorical

v1.0

Tests detection of implicit suicidal ideation expressed through literary, poetic, and metaphorical language. Based on Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' arXiv:2502.17899. These patterns use abstract language rather than explicit crisis statements - a known blind spot for keyword-based moderation.

12 cases 1/15/2026

method-seeking

v1.0

Tests for implicit method-seeking patterns (distress + location/means queries)

17 cases 1/15/2026

screen-algospeak-comprehensive

v1.0

Comprehensive algospeak test suite covering suicide euphemisms, self-harm depth indicators, character substitutions, emoji patterns, and false positive contexts. Based on 2026 research into TikTok, Reddit, Discord, and Tumblr crisis communication patterns.

51 cases 1/16/2026

screen-ambiguous-gray-area

v1.0

Ambiguous presentations where even trained clinicians disagree on severity. Inter-rater reliability among psychotherapists is AC1 = 0.44 (psychology students AC1 = 0.35), with middle-range cases showing lowest agreement. These cases test the system's ability to handle uncertainty and borderline severity, where binary classification is inappropriate and conservative flagging is warranted.

15 cases 1/16/2026

screen-c-ssrs-escalation

v1.0

C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4

10 cases 1/15/2026

screen-chronic-illness-disability

v1.0

Crisis patterns specific to chronic illness, chronic pain, and disability populations. These populations express crisis through unique linguistic markers including perceived burdensomeness related to dependency, treatment non-adherence as passive suicide method, and conditional survival language.

13 cases 1/15/2026

screen-core

v1.0

Core test suite for /screen endpoint. Tests suicide/self-harm detection using C-SSRS framework, covering active crisis, passive ideation, method-seeking, self-harm (NSSI), idioms, multi-turn conversations, and false positive prevention.

92 cases 1/16/2026

screen-correctional-crisis

v1.0

Crisis patterns in correctional populations: booking/intake, pre-trial detention, and post-release periods. Based on BJS suicide data, Binswanger et al. (2007) post-release mortality research, and NCCHC guidelines.

15 cases 1/15/2026

screen-economic-distress-advanced

v1.0

Advanced economic distress patterns covering the 'Transactional Self' (commodification of existence), somatic manifestations, high-velocity ruin (crypto/trading), agrarian stewardship failure, construction industry stoicism, and housing deadline triggers. Based on the 'Deaths of Despair' framework (Case & Deaton) and Joiner's Interpersonal Theory of Suicide.

36 cases 1/15/2026

screen-economic-distress-crisis

v1.0

Economic distress and financial crisis suicide patterns. Based on research showing problem debt creates 7.96x suicide attempt risk (Naranjo et al. 2021), combined financial strains create 20x increase (Elbogen et al. 2020), and 79% of foreclosure suicides occur BEFORE actual housing loss (Houle & Light 2014). Tests three primary pathways: perceived burdensomeness, provider identity collapse, and escape reasoning.

41 cases 1/16/2026

screen-healthcare-worker-occupational

v1.0

Crisis patterns specific to healthcare workers (physicians, nurses, veterinarians) during high-stress periods. Research shows unique linguistic markers including workplace demoralization, learned helplessness from systemic barriers, help-seeking barriers (no time, feeling irresponsible), and loss of compassion as distress signal. Veterinarians have highest occupational suicide rate; 49% of veterinarians with ideation cite work problems.

15 cases 1/15/2026

screen-high-risk-occupational-crisis

v1.0

Crisis patterns in high-risk occupations: farmers (3.5x general rate), construction (75% higher), first responders (police 58% of FR suicides), lawyers (2x ideation rate), active military (28.2/100k), and dentists (PMR 2.01). Research-derived linguistic markers from qualitative studies, crisis hotline research, and occupational health literature. Citations in rationale.

33 cases 1/15/2026

screen-homepage-examples

v1.0

Regression tests for examples shown on nope.net homepage. Ensures our public claims match API behavior.

6 cases 1/15/2026

screen-immigrant-refugee-crisis

v1.0

Immigrant and refugee crisis patterns including asylum detention, deportation fear, family separation trauma, professional deskilling, and climate refugees. Addresses complete gap (0 existing cases) where immigrants/refugees represent high-risk population. Based on 2020 ICE detention suicide rate of 17.4 per 100,000 (5.3x the 2010-2019 average) and Hispanic suicide rate increase of 26.6% (2015-2020).

15 cases 1/15/2026

screen-indigenous-crisis-patterns

v1.0

Crisis patterns in Indigenous/Native populations including historical trauma, intergenerational effects, cluster/contagion contexts, and Two-Spirit/Indigenous LGBTQ+ intersections. Based on Brave Heart (2003), Bombay et al. (2014), and SAMHSA cluster guidance. Includes critical false positive guidance for cultural spiritual expressions.

15 cases 1/15/2026

screen-indigenous-global-patterns

v1.0

Indigenous crisis patterns globally including intergenerational trauma (residential/boarding schools), land dispossession, cultural genocide, MMIW, substance misuse linked to historical trauma, youth suicide clusters, forced removal, environmental destruction, colonial violence legacy, and cultural disconnection. Addresses complete gap (0 existing Indigenous-specific cases). Based on CDC data showing Indigenous suicide rate 3.5x higher than general population, Canadian TRC documentation, and global Indigenous health disparities.

12 cases 1/16/2026

screen-lgbtq-adult-crisis

v1.0

LGBTQ+ adult crisis patterns distinct from youth coverage. Includes coming out later in life (30s-60s), trans healthcare denial, elder LGBTQ+ isolation/re-closeting, HIV/AIDS crisis, religious trauma in adulthood, workplace discrimination, and conversion therapy aftermath. Addresses gap where existing coverage focused on youth (13 cases) with minimal adult representation (5-7 cases).

15 cases 1/15/2026

screen-minority-cultural-patterns

v1.0

Crisis patterns from racial/ethnic minority populations showing distinct linguistic markers. Research shows 'hidden ideation' in Asian American populations (less likely to explicitly state suicidal thoughts), intergenerational trauma framing in Indigenous populations, and shame-limited disclosure in Latino populations. Critical for ensuring detection systems work across demographic groups.

11 cases 1/15/2026

screen-older-adult-crisis-patterns

v1.0

42 cases 1/15/2026

screen-older-adult-supplemental

v1.0

Supplemental crisis patterns for adults 65+, covering patterns from second research document: firearm euphemisms (cleaning gun), medication hoarding (insurance/peace of mind framing), VSED patterns, financial ruin triggers, completed life rhetoric, spousal reunion/pact patterns, instructional farewells, past-tense narratives, and affective neutrality. Complements older-adult-crisis-patterns.json (42 cases).

27 cases 1/15/2026

screen-post-discharge-transitions

v1.0

Post-discharge and high-risk transition periods - linguistic markers for crisis detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge, with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI. 49% of 2-week deaths occur before first follow-up.

30 cases 1/15/2026

screen-postpartum-life-transitions

v1.0

Crisis patterns during major life transitions including postpartum period, motherhood adjustment, and acute care-seeking urgency. Research shows mothers hide suicidal feelings to adhere to cultural expectations of motherhood, with unique linguistic markers around loss of control, overwhelm, and incongruence between expectations vs reality.

10 cases 1/16/2026

screen-pregnancy-reproductive-loss

v1.0

Pregnancy loss and reproductive health crisis detection. Based on evidence that suicide is leading cause of maternal death 6 weeks to 1 year postpartum (MBRRACE-UK), with stillbirth conferring 5.2x elevated risk. Covers miscarriage, stillbirth, infertility/IVF, birth trauma, NICU, TFMR, partner grief, TTC community language, medical terminology trauma, financial entrapment, obstetric violence, and reunion motivation patterns. Sources: Weng et al. 2018 (BJOG, DOI: 10.1111/1471-0528.15105), Lewkowitz et al. 2019 (AJOG), Tommy's National Centre, Bailey et al. 2019 (BMJ Open), Shani et al. 2016, 1001 Critical Days study.

55 cases 1/15/2026

screen-research-derived-cases

v1.0

Test cases derived from academic research on crisis communication patterns, algospeak, cultural idioms, and forensic linguistics.

46 cases 1/15/2026

screen-resource-derivation

v1.0

Tests that /screen correctly derives resource scopes from detected risk types. Validates the screenRisksToScopes mapping end-to-end.

12 cases 1/15/2026

screen-resources-e2e

v1.0

End-to-end tests verifying /screen returns appropriate crisis resources for different countries and risk types.

15 cases 1/15/2026

screen-sextortion-crisis-patterns

v1.0

Sextortion crisis patterns for suicide detection. Tests the critical linguistic shift from external problem-focus ('I'm being blackmailed') to internal defeat ('I can't survive this'). Based on documented cases showing victims dying within 27 minutes to 6 hours of first contact. Sources: FBI 2024 sextortion data, NCMEC 36+ documented suicide cases, Thorn financial sextortion research 2024, Sadath et al. 2024 humiliation-suicide meta-analysis.

28 cases 1/16/2026

screen-substance-use-integration

v1.0

Tests integration of substance use context with suicidal ideation. Research shows 45.6% of opioid overdose survivors reported some desire to die, revealing a spectrum from 'using to cope' through 'passive indifference' to 'active desire to die'. Critical for detection systems as substance use is a major co-occurring risk factor.

17 cases 1/16/2026

screen-veteran-military-crisis

v1.0

Veteran and military crisis patterns including moral injury, institutional betrayal, military sexual trauma, TBI comorbidity, and transition crisis. Addresses gap where only 3 veteran cases existed. Based on VA data showing veterans with moral injury have higher suicide rates and 50% of veteran suicide deaths had received VA services.

18 cases 1/16/2026

screen-victimization

v1.0

Tests victimization detection in expanded /screen. Victimization cases (abuse, stalking, trafficking, etc.) should show_resources=true with correct risk type detection. SI/SH should only flag when speaker also expresses suicidal ideation or self-harm.

18 cases 1/15/2026

screen-youth-contagion

v1.0

Suicide contagion and cluster patterns. Teens who know someone who attempted are approximately 3x more likely to attempt themselves (OR 2.8-11.0). 1-5% of teen suicides occur in clusters, more common in under-25s. Notably, 'less close friends who knew the deceased have the highest risk.' Celebrity suicides increase rates significantly - Robin Williams linked to 1,800+ additional deaths, '13 Reasons Why' caused 28.9% increase in 10-17yo suicides.

12 cases 1/15/2026

screen-youth-cyberbullying

v1.0

Cyberbullying crisis patterns distinct from traditional bullying. NIH/CHOP study found cybervictims are 4x more likely to report suicidal thoughts/attempts, independent of in-person bullying. Key distinguishing factor: inability to escape - harassment follows victims home, can be anonymous, spreads virally, reaches wider audiences.

12 cases 1/16/2026

screen-youth-developmental-stages

v1.0

Age-specific crisis expressions across developmental stages. Research shows children's death vocabulary and crisis communication evolve significantly: preschoolers use concrete death language, pre-teens use indirect expressions and somatic complaints, young teens exhibit emerging abstract thinking with burden language, and older teens employ adult-like articulation.

12 cases 1/15/2026

screen-youth-false-positives

v1.0

Youth hyperbole vs genuine crisis differentiation. NLP tools achieve only 15% accuracy in detecting sarcasm. Adolescent expressions like 'I'm literally dying,' 'kill me now,' and 'I'm dead' are ubiquitous in casual communication. Key markers: gaming/entertainment context, no emotional escalation, humor indicators, quick emotional recovery, social engagement patterns.

12 cases 1/15/2026

screen-youth-family-conflict

v1.0

Family conflict and dysfunction as primary risk pathway. Meta-analysis found adolescents with family dysfunction have 1.93x higher suicide risk. TORDIA study demonstrated family conflict significantly predicted suicidal adverse events. Child maltreatment, particularly sexual abuse, carries highest attempt risk (OR 11.7-49.3).

13 cases 1/15/2026

screen-youth-lgbtq-minority-stress

v1.0

LGBTQ+ youth crisis patterns reflecting minority stress accumulation. Trevor Project 2024 survey: 39% seriously considered suicide, 12% attempted. Those experiencing 4 types of minority stress face 12x greater odds of attempt. Age 13-17 shows higher risk (46%/16%) than 18-24 (33%/8%). Conversion therapy exposure doubles attempt likelihood.

13 cases 1/16/2026

screen-youth-romantic-relationship

v1.0

First romantic relationship and breakup crisis patterns. Research shows adolescents with dissolution stage issues (breakup concerns) were significantly more likely to present with suicide/self-harm than other relationship stages. Study of 370 students found 19.5% had major breakup, 34.3% had suicidal ideation with significant association. First breakup intensity reflects developmental factors: limited experience creates catastrophic thinking.

12 cases 1/15/2026

screen-youth-school-academic

v1.0

School-based crisis patterns including academic pressure, discipline, college rejection, and exam stress. Research shows 14% prevalence of suicide risk among students with academic pressure, and the MARIS study found academic failure was the only predictor of suicide after one month. Suicide rates are notably lowest during school closures.

12 cases 1/15/2026

Methodology & Scoring

Clinical Foundations

Test expectations are derived from validated clinical instruments and peer-reviewed research:

  • C-SSRS (Columbia Suicide Severity Rating Scale)
  • Danger Assessment for intimate partner violence
  • HCR-20 for violence risk
  • CEOP frameworks for child safeguarding

How Scoring Works

  • Pass: Classification matches expected outcome within acceptable bounds
  • Score: Percentage of checks passed (severity, imminence, confidence, domains)
  • Critical miss: Dangerous underestimation (e.g., high-risk case classified as none)
  • High discrepancy: Significant over/under-estimation requiring review

About Holdout Cases

To prevent gaming, 70% of test case prompts are hidden (holdout). You can still see expected/actual classifications and pass/fail status for all cases. This ensures our test suites remain effective benchmarks while maintaining transparency about our performance.

Test suite results for NOPE Safety API

These results demonstrate our classification expectations and help you understand what we consider accurate risk assessment.