Test Suite Transparency
Safety-critical systems require rigorous, transparent testing. Our test suites are grounded in validated clinical frameworks including C-SSRS, Danger Assessment, HCR-20, and peer-reviewed literature. Every test case shows expected classifications, actual results, and clinical rationale.
Why we publish our failures
A healthy failure rate isn't a bug—it's essential. Crisis detection involves inherently subjective judgments. Two clinicians reviewing the same conversation will often disagree. The same phrase can signal genuine distress or casual hyperbole depending on context that may not be present in the text.
We publish these results because transparency matters more than optics. Failures tell us where the hard cases are. They drive our roadmap. We work with clinical frameworks and medical professionals to refine our understanding of how risk presents in text-based conversations—but we'll never detect everything, and we don't pretend otherwise.
If a safety system claims 100% accuracy, be skeptical. The honest answer is: it's complicated, and we're working on it.
Our testing philosophy
Direction over exactitude
We care more about whether NOPE catches crisis at all than whether it labels something "moderate" vs "high." A case returning "mild" when we expected "moderate" is still going in the right direction. A case returning "none" when it should flag something is a real gap.
Expectations are hypotheses
Test expectations represent clinical intuition, not ground truth. Real ground truth requires expert annotation with inter-rater reliability. Our tests explore where NOPE's behavior aligns with or diverges from clinical frameworks—not where it's "right" or "wrong."
Evaluate
45 suites · 1,198 casesScreen
35 suites · 794 casesEvaluate Suites
Full risk classification testing with severity, imminence, and risk features
adversarial-robustness
v2025-12-18 No safety gapsTests classifier robustness against adversarial inputs: jailbreak attempts, classification evasion, minimization disclaimers, encoded content, and meta-level manipulation. Goal: ensure genuine risk is detected even when wrapped in evasion tactics.
c-ssrs-escalation
v2025-12-26 No safety gapsC-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4
chat_register_false_negatives
v2025-12-25 No safety gapsFalse-negative focused test suite for chat-register crisis patterns. Tests whether NOPE catches crisis expressed in Gen-Z/millennial slang, minimized language, and indirect patterns that might be missed. Inverse of fp-research-driven.json.
core_transparency_v2
v2025-12-17 No safety gapsEvidence-based core test suite for current taxonomy. Based on dual independent clinical research review. 26 cases covering full risk taxonomy with clinically-informed expectations.
jailbreaking
v2025-12-29 No safety gapsAdversarial jailbreaking suite testing NOPE-aware attacks. Assumes users may: (1) know they're being classified by NOPE, (2) understand the XML schema and taxonomy, (3) attempt meta-instruction injection, (4) try to game severity/imminence thresholds. Goal: ensure genuine risk is detected regardless of adversarial framing.
literary_metaphorical
v2025-12-30 No safety gapsLiterary, poetic, and metaphorical expressions of suicidal ideation. Inspired by Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' which found LLMs struggle with abstract, metaphorical expressions. Tests detection of: cinematic metaphors ('fading out'), isolation metaphors (walls, barriers), existential/philosophical death framing, and artistic/literary references that encode suicidal ideation.
subject-context-features
v2026-01-16 No safety gapsTests for subject context features: animal_involved, minor_involved, infant_involved, elderly_involved, vulnerable_adult_involved, parental_incapacity, witnessing_violence. These context markers help identify WHO is involved/at risk beyond the primary subject.
realistic_prevalence_v2
v2025-12-17 No safety gapsRealistic prevalence test suite mimicking real-world deployment conditions. 200 messages with ~1% crisis rate (2 true crises, 198 non-crises). Tests PPV, specificity, and false positive rate under realistic class imbalance. Based on Swaminathan et al. (2023) findings: prospective deployment had 0.55% crisis prevalence. (Converted to subject/type taxonomy with subject/type risks.)
dangerous_content
v2025-12-21 No safety gapsTests for detecting content that ENCOURAGES harm to others (not expressing personal distress). Includes dangerous challenges, pro-self-harm content, method sharing, and death validation. Key distinction: subject='other' (the person being encouraged) not 'self' (unless speaker is also at risk).
filter_none_v2
v2025-12-17 No safety gapsFalse positive regression suite for current taxonomy. Tests that benign content, humor, idioms, and non-personal discussions should NOT trigger risk detection. Critical for preventing over-flagging in production.
substance_crisis
v2025-12-21 No safety gapsTests for substance abuse crisis detection, including substance+suicidal ideation combinations (known high-risk), overdose risk severity, and severe withdrawal states. These combinations are clinically established risk escalators.
stalking_sam
v2025-12-22 No safety gapsTests for stalking detection using SAM (Stalking Assessment & Management) framework. Includes ex-intimate partner stalking (highest risk), escalation patterns, violence history, victim fear, and proxy harassment. SAM-informed risk levels: severe, elevated, standard.
age-context-calibration
v2025-12-18 No safety gapsTests how age context (child/teen/adult) affects classification of normalized suicide language. Goal: observe natural calibration differences, not enforce hard rules.
harassment_severe
v2025-12-21 No safety gapsTests for detecting severe, explicit harassment cases that can be identified from single messages (not pattern-based). Includes doxxing threats, sexual harassment, targeted degradation, and online pile-on indicators. Note: most harassment detection requires conversational context NOPE cannot provide - these tests cover explicit/severe cases only.
expanded_taxonomy-v2
v2025-12-17 No safety gaps[v1] Expanded taxonomy validation using orthogonal subject/type structure. Covers online exploitation, extremism, human trafficking, and eating disorders.
protective_factors-v2
v2025-12-17 No safety gaps[v1] Test suite focused on protective factor DETECTION. Validates that protective factors are correctly identified from conversation content. Severity/imminence expectations are deliberately wide - this suite tests PF detection, not severity calibration.
filter_router-v2
v2025-12-17 No safety gaps[v1] Tests that the classifier correctly identifies risk subjects and types. Validates orthogonal subject (self/other) and type (suicide/abuse/violence/etc) classification. Maps v1 domain routing to v1 subject/type structure.
litmus-v2
v2025-12-17 No safety gapsLitmus test suite using orthogonal subject/type taxonomy. Each risk has subject (who) + type (what) + features. speaker_severity derived from risks where subject='self'. Comprehensive coverage of suicide, IPV, safeguarding, violence, and edge cases.
subject_attribution-v2
v2025-01-18 No safety gapsTests for correct subject attribution (self vs other). Probes edge cases where speaker IS the victim but mentions others in their situation.
indirect_disclosure
v2025-12-18 No safety gapsIndirect crisis disclosure patterns including 'asking for a friend', third-person framing, hypothetical scenarios, minimization, and religious/spiritual framing. Research shows many callers take 10-15 minutes to disclose actual reason for crisis contact. These patterns are normative in most cultures - direct disclosure is the exception, not the rule.
input-framing-v2
v2025-12-17 No safety gaps[v1] Input framing detection test suite. Tests the ability to detect HOW content is presented (direct, humor, creative, distanced, adversarial, etc.). Framing is orthogonal to risk assessment - these tests verify framing detection alongside appropriate risk classification.
somatization_patterns
v2025-12-18 No safety gapsCrisis presentations through somatic (physical) symptoms rather than emotional language. 'Somatization is the single most common reason why psychiatric illness goes undetected in general medical settings.' Physical complaints are the acceptable way to express distress in many cultures where mental illness is stigmatized.
blindspots_v2
v2025-12-17 1 criticalResearch-driven blindspots test suite for current taxonomy. Covers implicit ideation (Crisis Text Line predictors), population-specific patterns (LGBTQ+, veterans, healthcare workers), and adversarial scenarios. Based on RESEARCH-SYNTHESIS.md and clinical literature.
v2_population_context
v2025-12-17 No safety gapsTests for population context extraction. Validates that demographic/identity indicators are captured as features to enable population-specific resource matching.
fp_research_driven-v2
v2025-12-25 3 high[v1] False positive test cases derived from academic research on mental health classification systems. Tests negation handling, humor markers, physical pain idioms, internet slang, minimal-context slang variants, and other documented FP patterns.
active_crisis_imminent
v2025-12-29 No safety gapsActive suicide crisis and imminent attempt scenarios representing C-SSRS Level 5 (highest severity): active ideation with specific plan, intent, and timeline. Includes interrupted attempt language, preparatory acts, resolution syndrome, and in-progress attempts. Critical for regression prevention—these cases must be detected at critical/emergency severity.
v2_eating_disorders
v2025-12-29 1 criticalComprehensive eating disorder detection tests including anorexia, bulimia, binge-eating disorder, ARFID, and orthorexia patterns. Validates that ED behaviors are correctly classified with appropriate features for specialist resource matching. Includes pro-ED community language, male-specific presentations, minimization patterns, ED+suicidality intersection, and validated assessment instrument language (EAT-26, EDE-Q).
acute_real_world_v2
v2025-12-17 1 criticalAcute real-world scenarios for current taxonomy. Covers immediate danger, psychosis, mania, AI attachment, safeguarding, and third-party risk. 30 evidence-aligned test cases from clinical literature and real-world patterns.
speaker_third_party_v2
v2025-12-17 No safety gapsTests for speaker vs third-party risk disambiguation. Uses orthogonal subject/type: speaker_severity derived from risks where subject='self', third-party risks have subject='other'.
neurodivergent_presentations
v2025-12-18 No safety gapsCrisis presentations from neurodivergent individuals (autism, ADHD, etc.) that may be missed by classifiers trained on neurotypical communication patterns. Autistic people are 3-9x more likely to die by suicide (Hirvikoski 2016); autistic women have ~13x suicide mortality risk vs non-autistic women; ideation is ~4x general population (Cassidy 2014). Key patterns: flat affect, literal expression, communication shutdown, perseverative thoughts.
mh_edge_cases-v2
v2025-12-17 No safety gaps[v1] Edge and ambiguous clinical/safety cases for NOPE Safety v1. Multi-domain assessment version with severity/imminence separation. Covers dark humor, jailbreak attempts, safeguarding, violence, self-neglect, eating disorders, false positives, gaming/meme culture, chronic illness, postpartum, and multi-turn escalation patterns. 35 test cases (27 single-turn, 8 multi-turn).
filter_router_edge_cases-v2
v2025-12-17 1 critical 1 high[v1] Edge cases for risk classification accuracy. Tests indirect language, context confusion, ambiguous framing, coded language, third-party disclosures, and professional contexts using orthogonal subject/type structure.
multilingual_crisis
v2026-01-13 1 criticalMultilingual crisis expressions to ensure equitable detection across languages. Based on research into culturally-specific idioms, somatic expressions, and indirect crisis communication patterns. Each case includes original language, translation, and cultural context notes.
resources-e2e
v2025-12-23 No safety gapsEnd-to-end tests verifying classification AND resource resolution work together. Each case tests both risk assessment and that appropriate specialized resources are surfaced.
multi-marker-false-positives
v1.0 1 highTests where multiple crisis indicators cluster in benign contexts, ensuring NOPE doesn't over-weight coincidental marker combinations without considering overall context
chat-register-crisis
v2026-01-06 No safety gapsCrisis detection in informal chat register: text speak, emoji, hedging with dismissive qualifiers ('lol its dumb but...'), and specific medication/means references masked by casual framing. Based on De Choudhury et al. (2016) computational discourse analysis, Coppersmith et al. (2018) lexical markers, and Crisis Text Line emoji research.
ai_mediated_risk-v2
v2025-12-17 1 critical[v1] Novel risk patterns emerging from AI-mediated conversations. Tests magical thinking, AI validation/collusion, parasocial attachment, multi-turn escalation, and method-seeking with philosophical framing. Based on clinical research on third-party validation effects in suicide risk.
verbatim_chat_register
v2025-12-25 1 highChat-register verbatim text from court-quoted SMS/text messages and Reddit posts. All items are real-time, informal, direct disclosures - NOT interview quotes or retrospective accounts. Sources: US court opinions (Justia), Reddit gaming/casual subreddits.
nssi_comprehensive
v2025-12-29 1 criticalComprehensive non-suicidal self-injury (NSSI) detection suite covering all ISAS functional categories, disclosure patterns, concealment strategies, and risk escalation. Tests ability to distinguish NSSI from suicidal behavior and identify anti-suicide function (critical for risk assessment). Includes contemporary algospeak and social media coded language.
postpartum_perinatal
v2025-12-29 2 criticalPostpartum and perinatal mental health crisis patterns including intrusive thoughts about infant harm, ego-dystonic OCD-type thoughts (protective vs. psychotic), postpartum depression, postpartum psychosis, disclosure fears ('they'll take my baby'), and avoidance behaviors. Tests critical distinction between repugnant intrusive thoughts (no intent, horror response) and genuine risk to infant. Approximately 50% of new mothers experience intrusive thoughts; ego-dystonic quality is key protective factor.
evaluate-post-discharge-transitions
v2026-01-06 1 criticalPost-discharge and high-risk transition periods - full risk assessment with severity, imminence, and feature detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge (2,950/100k person-years), with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI to obtain discharge.
Self-Neglect Comprehensive
v2026-01-13 No safety gapsComprehensive coverage of self-neglect risk type including medical non-adherence patterns, ambiguous intentionality, elderly self-neglect, and substance misuse. Addresses critical taxonomy gap where self-neglect was only primary type in ~10 cases despite being 1 of 9 official risk types.
verbatim_crisis_starter_pack
v2025-12-25 No safety gapsShort verbatim excerpts from public reports, court decisions, and published research. Seed set with full provenance - not the full validation corpus. Sources: ACL Anthology, NJ DV Near-Fatality Report, Correctional Service Canada, NZ Women's Refuge, Delaware Supreme Court, UK child neglect research, Swiss refugee study, UCLan trafficking thesis, UCT eating disorder thesis.
trafficking_disclosure
v2025-12-29 4 criticalHuman trafficking disclosure patterns including labor trafficking, sex trafficking, and exploitation. Tests detection of debt bondage, psychological control, delayed victim self-identification, coached responses, trauma bonding, and disclosure barriers. Includes youth-specific patterns ('boyfriend' framing) and validated QYIT screening indicators. Critical zero-coverage domain.
Exploitation Comprehensive
v2026-01-13 6 criticalComprehensive coverage of exploitation risk type including elder financial abuse, romance scams, labor exploitation, and institutional exploitation. Addresses critical taxonomy gap where exploitation was only primary type in ~10 cases despite being 1 of 9 official risk types.
Screen Suites
Lightweight crisis screening tests (C-SSRS framework)
c-ssrs-escalation-screen
v2025-12-26C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4
screen-core
v2025-12-18Core test suite for /screen endpoint. Tests suicide/self-harm detection using C-SSRS framework, covering active crisis, passive ideation, method-seeking, self-harm (NSSI), idioms, multi-turn conversations, and false positive prevention.
correctional-crisis
v1.0Crisis patterns in correctional populations: booking/intake, pre-trial detention, and post-release periods. Based on BJS suicide data, Binswanger et al. (2007) post-release mortality research, and NCCHC guidelines.
screen-economic-distress-advanced
v2026-01-06Advanced economic distress patterns covering the 'Transactional Self' (commodification of existence), somatic manifestations, high-velocity ruin (crypto/trading), agrarian stewardship failure, construction industry stoicism, and housing deadline triggers. Based on the 'Deaths of Despair' framework (Case & Deaton) and Joiner's Interpersonal Theory of Suicide.
screen-homepage-examples
v2026-01-14Regression tests for examples shown on nope.net homepage. Ensures our public claims match API behavior.
indigenous-crisis-patterns
v1.0Crisis patterns in Indigenous/Native populations including historical trauma, intergenerational effects, cluster/contagion contexts, and Two-Spirit/Indigenous LGBTQ+ intersections. Based on Brave Heart (2003), Bombay et al. (2014), and SAMHSA cluster guidance. Includes critical false positive guidance for cultural spiritual expressions.
kms-hyperbole-calibration
v1.0Calibration suite for 'kms' (kill myself) detection. Tests the boundary between hyperbolic internet slang and genuine masked ideation. Key principle: trivial stressors + humor markers = no flag; significant stressors or isolation language = flag even with humor.
screen-minority-cultural
v2025-12-31Crisis patterns from racial/ethnic minority populations showing distinct linguistic markers. Research shows 'hidden ideation' in Asian American populations (less likely to explicitly state suicidal thoughts), intergenerational trauma framing in Indigenous populations, and shame-limited disclosure in Latino populations. Critical for ensuring detection systems work across demographic groups.
older-adult-supplemental
v1.0Supplemental crisis patterns for adults 65+, covering patterns from second research document: firearm euphemisms (cleaning gun), medication hoarding (insurance/peace of mind framing), VSED patterns, financial ruin triggers, completed life rhetoric, spousal reunion/pact patterns, instructional farewells, past-tense narratives, and affective neutrality. Complements older-adult-crisis-patterns.json (42 cases).
screen-recommended-reply
v2026-01-15Test suite for recommended_reply generation in /screen endpoint. Verifies that generated replies include appropriate resources, tone-matching, and avoid toxic positivity.
screen-resource-derivation
v2026-01-14Tests that /screen correctly derives resource scopes from detected risk types. Validates the screenRisksToScopes mapping end-to-end.
screen-resources-e2e
v2026-01-14End-to-end tests verifying /screen returns appropriate crisis resources for different countries and risk types.
sextortion-crisis-patterns
v2026-01-06Sextortion crisis patterns for suicide detection. Tests the critical linguistic shift from external problem-focus ('I'm being blackmailed') to internal defeat ('I can't survive this'). Based on documented cases showing victims dying within 27 minutes to 6 hours of first contact. Sources: FBI 2024 sextortion data, NCMEC 36+ documented suicide cases, Thorn financial sextortion research 2024, Sadath et al. 2024 humiliation-suicide meta-analysis.
screen-victimization
v2025-12-23Tests that victimization/threats FROM OTHERS are correctly distinguished from suicidal ideation. The /screen endpoint should NOT flag threats from others as suicidal ideation (C-SSRS measures self-directed risk only). However, if victimization is combined with suicidal ideation, it SHOULD flag.
youth-contagion
v1.0Suicide contagion and cluster patterns. Teens who know someone who attempted are approximately 3x more likely to attempt themselves (OR 2.8-11.0). 1-5% of teen suicides occur in clusters, more common in under-25s. Notably, 'less close friends who knew the deceased have the highest risk.' Celebrity suicides increase rates significantly - Robin Williams linked to 1,800+ additional deaths, '13 Reasons Why' caused 28.9% increase in 10-17yo suicides.
youth-false-positives
v1.0Youth hyperbole vs genuine crisis differentiation. NLP tools achieve only 15% accuracy in detecting sarcasm. Adolescent expressions like 'I'm literally dying,' 'kill me now,' and 'I'm dead' are ubiquitous in casual communication. Key markers: gaming/entertainment context, no emotional escalation, humor indicators, quick emotional recovery, social engagement patterns.
youth-romantic-relationship
v1.0First romantic relationship and breakup crisis patterns. Research shows adolescents with dissolution stage issues (breakup concerns) were significantly more likely to present with suicide/self-harm than other relationship stages. Study of 370 students found 19.5% had major breakup, 34.3% had suicidal ideation with significant association. First breakup intensity reflects developmental factors: limited experience creates catastrophic thinking.
youth-school-academic
v1.0School-based crisis patterns including academic pressure, discipline, college rejection, and exam stress. Research shows 14% prevalence of suicide risk among students with academic pressure, and the MARIS study found academic failure was the only predictor of suicide after one month. Suicide rates are notably lowest during school closures.
older-adult-crisis-patterns
v1.0pregnancy-reproductive-loss
v2026-01-06Pregnancy loss and reproductive health crisis detection. Based on evidence that suicide is leading cause of maternal death 6 weeks to 1 year postpartum (MBRRACE-UK), with stillbirth conferring 5.2x elevated risk. Covers miscarriage, stillbirth, infertility/IVF, birth trauma, NICU, TFMR, partner grief, TTC community language, medical terminology trauma, financial entrapment, obstetric violence, and reunion motivation patterns. Sources: Weng et al. 2018 (BJOG, DOI: 10.1111/1471-0528.15105), Lewkowitz et al. 2019 (AJOG), Tommy's National Centre, Bailey et al. 2019 (BMJ Open), Shani et al. 2016, 1001 Critical Days study.
screen-substance-use
v2025-12-31Tests integration of substance use context with suicidal ideation. Research shows 45.6% of opioid overdose survivors reported some desire to die, revealing a spectrum from 'using to cope' through 'passive indifference' to 'active desire to die'. Critical for detection systems as substance use is a major co-occurring risk factor.
screen-economic-distress-crisis
v2026-01-06Economic distress and financial crisis suicide patterns. Based on research showing problem debt creates 7.96x suicide attempt risk (Naranjo et al. 2021), combined financial strains create 20x increase (Elbogen et al. 2020), and 79% of foreclosure suicides occur BEFORE actual housing loss (Houle & Light 2014). Tests three primary pathways: perceived burdensomeness, provider identity collapse, and escape reasoning.
youth-cyberbullying
v1.0Cyberbullying crisis patterns distinct from traditional bullying. NIH/CHOP study found cybervictims are 4x more likely to report suicidal thoughts/attempts, independent of in-person bullying. Key distinguishing factor: inability to escape - harassment follows victims home, can be anonymous, spreads virally, reaches wider audiences.
screen-postpartum-transitions
v2025-12-31Crisis patterns during major life transitions including postpartum period, motherhood adjustment, and acute care-seeking urgency. Research shows mothers hide suicidal feelings to adhere to cultural expectations of motherhood, with unique linguistic markers around loss of control, overwhelm, and incongruence between expectations vs reality.
high-risk-occupational-crisis
v1.0Crisis patterns in high-risk occupations: farmers (3.5x general rate), construction (75% higher), first responders (police 58% of FR suicides), lawyers (2x ideation rate), active military (28.2/100k), and dentists (PMR 2.01). Research-derived linguistic markers from qualitative studies, crisis hotline research, and occupational health literature. Citations in rationale.
screen-research-derived
v2025-12-27Test cases derived from academic research on crisis communication patterns, algospeak, cultural idioms, and forensic linguistics.
screen-chronic-illness-disability
v2025-12-31Crisis patterns specific to chronic illness, chronic pain, and disability populations. These populations express crisis through unique linguistic markers including perceived burdensomeness related to dependency, treatment non-adherence as passive suicide method, and conditional survival language.
youth-family-conflict
v1.0Family conflict and dysfunction as primary risk pathway. Meta-analysis found adolescents with family dysfunction have 1.93x higher suicide risk. TORDIA study demonstrated family conflict significantly predicted suicidal adverse events. Child maltreatment, particularly sexual abuse, carries highest attempt risk (OR 11.7-49.3).
youth-lgbtq-minority-stress
v1.0LGBTQ+ youth crisis patterns reflecting minority stress accumulation. Trevor Project 2024 survey: 39% seriously considered suicide, 12% attempted. Those experiencing 4 types of minority stress face 12x greater odds of attempt. Age 13-17 shows higher risk (46%/16%) than 18-24 (33%/8%). Conversion therapy exposure doubles attempt likelihood.
youth-developmental-stages
v1.0Age-specific crisis expressions across developmental stages. Research shows children's death vocabulary and crisis communication evolve significantly: preschoolers use concrete death language, pre-teens use indirect expressions and somatic complaints, young teens exhibit emerging abstract thinking with burden language, and older teens employ adult-like articulation.
screen-ambiguous-gray-area
v2025-12-31Ambiguous presentations where even trained clinicians disagree on severity. Inter-rater reliability among psychotherapists is AC1 = 0.44 (psychology students AC1 = 0.35), with middle-range cases showing lowest agreement. These cases test the system's ability to handle uncertainty and borderline severity, where binary classification is inappropriate and conservative flagging is warranted.
screen-healthcare-workers
v2025-12-31Crisis patterns specific to healthcare workers (physicians, nurses, veterinarians) during high-stress periods. Research shows unique linguistic markers including workplace demoralization, learned helplessness from systemic barriers, help-seeking barriers (no time, feeling irresponsible), and loss of compassion as distress signal. Veterinarians have highest occupational suicide rate; 49% of veterinarians with ideation cite work problems.
screen-post-discharge-transitions
v2026-01-06Post-discharge and high-risk transition periods - linguistic markers for crisis detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge, with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI. 49% of 2-week deaths occur before first follow-up.
screen-veteran-military-crisis
v2026-01-13Veteran and military crisis patterns including moral injury, institutional betrayal, military sexual trauma, TBI comorbidity, and transition crisis. Addresses gap where only 3 veteran cases existed. Based on VA data showing veterans with moral injury have higher suicide rates and 50% of veteran suicide deaths had received VA services.
algospeak-comprehensive
v2026-01-14Comprehensive algospeak test suite covering suicide euphemisms, self-harm depth indicators, character substitutions, emoji patterns, and false positive contexts. Based on 2026 research into TikTok, Reddit, Discord, and Tumblr crisis communication patterns.
Moderation Comparisons
Cross-provider moderation comparison results
cultural-identity-false-positives
v1.0Test cases targeting documented over-flagging by competitor moderation APIs. CHI 2025 audit found all major APIs over-flag content containing LGBTQ+, Black, Jewish, and Muslim identity terms. These cases test trauma processing, recovery narratives, educational content, and identity discussions that should NOT be flagged as crisis.
implicit-crisis-expressions
v1.0Test cases targeting documented blind spots in competitor moderation APIs. OpenAI shows 33.2% false negative rate on implicit content, Azure 63.9% on implicit expressions, Perspective 75.4% overall. These cases test passive ideation, context-dependent statements, coded language, and cultural expressions that major APIs systematically miss.
literary-metaphorical
v1.0Tests detection of implicit suicidal ideation expressed through literary, poetic, and metaphorical language. Based on Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' arXiv:2502.17899. These patterns use abstract language rather than explicit crisis statements - a known blind spot for keyword-based moderation.
method-seeking
v1.0Tests for implicit method-seeking patterns (distress + location/means queries)
screen-algospeak-comprehensive
v1.0Comprehensive algospeak test suite covering suicide euphemisms, self-harm depth indicators, character substitutions, emoji patterns, and false positive contexts. Based on 2026 research into TikTok, Reddit, Discord, and Tumblr crisis communication patterns.
screen-ambiguous-gray-area
v1.0Ambiguous presentations where even trained clinicians disagree on severity. Inter-rater reliability among psychotherapists is AC1 = 0.44 (psychology students AC1 = 0.35), with middle-range cases showing lowest agreement. These cases test the system's ability to handle uncertainty and borderline severity, where binary classification is inappropriate and conservative flagging is warranted.
screen-c-ssrs-escalation
v1.0C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4
screen-chronic-illness-disability
v1.0Crisis patterns specific to chronic illness, chronic pain, and disability populations. These populations express crisis through unique linguistic markers including perceived burdensomeness related to dependency, treatment non-adherence as passive suicide method, and conditional survival language.
screen-core
v1.0Core test suite for /screen endpoint. Tests suicide/self-harm detection using C-SSRS framework, covering active crisis, passive ideation, method-seeking, self-harm (NSSI), idioms, multi-turn conversations, and false positive prevention.
screen-correctional-crisis
v1.0Crisis patterns in correctional populations: booking/intake, pre-trial detention, and post-release periods. Based on BJS suicide data, Binswanger et al. (2007) post-release mortality research, and NCCHC guidelines.
screen-economic-distress-advanced
v1.0Advanced economic distress patterns covering the 'Transactional Self' (commodification of existence), somatic manifestations, high-velocity ruin (crypto/trading), agrarian stewardship failure, construction industry stoicism, and housing deadline triggers. Based on the 'Deaths of Despair' framework (Case & Deaton) and Joiner's Interpersonal Theory of Suicide.
screen-economic-distress-crisis
v1.0Economic distress and financial crisis suicide patterns. Based on research showing problem debt creates 7.96x suicide attempt risk (Naranjo et al. 2021), combined financial strains create 20x increase (Elbogen et al. 2020), and 79% of foreclosure suicides occur BEFORE actual housing loss (Houle & Light 2014). Tests three primary pathways: perceived burdensomeness, provider identity collapse, and escape reasoning.
screen-healthcare-worker-occupational
v1.0Crisis patterns specific to healthcare workers (physicians, nurses, veterinarians) during high-stress periods. Research shows unique linguistic markers including workplace demoralization, learned helplessness from systemic barriers, help-seeking barriers (no time, feeling irresponsible), and loss of compassion as distress signal. Veterinarians have highest occupational suicide rate; 49% of veterinarians with ideation cite work problems.
screen-high-risk-occupational-crisis
v1.0Crisis patterns in high-risk occupations: farmers (3.5x general rate), construction (75% higher), first responders (police 58% of FR suicides), lawyers (2x ideation rate), active military (28.2/100k), and dentists (PMR 2.01). Research-derived linguistic markers from qualitative studies, crisis hotline research, and occupational health literature. Citations in rationale.
screen-homepage-examples
v1.0Regression tests for examples shown on nope.net homepage. Ensures our public claims match API behavior.
screen-immigrant-refugee-crisis
v1.0Immigrant and refugee crisis patterns including asylum detention, deportation fear, family separation trauma, professional deskilling, and climate refugees. Addresses complete gap (0 existing cases) where immigrants/refugees represent high-risk population. Based on 2020 ICE detention suicide rate of 17.4 per 100,000 (5.3x the 2010-2019 average) and Hispanic suicide rate increase of 26.6% (2015-2020).
screen-indigenous-crisis-patterns
v1.0Crisis patterns in Indigenous/Native populations including historical trauma, intergenerational effects, cluster/contagion contexts, and Two-Spirit/Indigenous LGBTQ+ intersections. Based on Brave Heart (2003), Bombay et al. (2014), and SAMHSA cluster guidance. Includes critical false positive guidance for cultural spiritual expressions.
screen-indigenous-global-patterns
v1.0Indigenous crisis patterns globally including intergenerational trauma (residential/boarding schools), land dispossession, cultural genocide, MMIW, substance misuse linked to historical trauma, youth suicide clusters, forced removal, environmental destruction, colonial violence legacy, and cultural disconnection. Addresses complete gap (0 existing Indigenous-specific cases). Based on CDC data showing Indigenous suicide rate 3.5x higher than general population, Canadian TRC documentation, and global Indigenous health disparities.
screen-lgbtq-adult-crisis
v1.0LGBTQ+ adult crisis patterns distinct from youth coverage. Includes coming out later in life (30s-60s), trans healthcare denial, elder LGBTQ+ isolation/re-closeting, HIV/AIDS crisis, religious trauma in adulthood, workplace discrimination, and conversion therapy aftermath. Addresses gap where existing coverage focused on youth (13 cases) with minimal adult representation (5-7 cases).
screen-minority-cultural-patterns
v1.0Crisis patterns from racial/ethnic minority populations showing distinct linguistic markers. Research shows 'hidden ideation' in Asian American populations (less likely to explicitly state suicidal thoughts), intergenerational trauma framing in Indigenous populations, and shame-limited disclosure in Latino populations. Critical for ensuring detection systems work across demographic groups.
screen-older-adult-crisis-patterns
v1.0screen-older-adult-supplemental
v1.0Supplemental crisis patterns for adults 65+, covering patterns from second research document: firearm euphemisms (cleaning gun), medication hoarding (insurance/peace of mind framing), VSED patterns, financial ruin triggers, completed life rhetoric, spousal reunion/pact patterns, instructional farewells, past-tense narratives, and affective neutrality. Complements older-adult-crisis-patterns.json (42 cases).
screen-post-discharge-transitions
v1.0Post-discharge and high-risk transition periods - linguistic markers for crisis detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge, with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI. 49% of 2-week deaths occur before first follow-up.
screen-postpartum-life-transitions
v1.0Crisis patterns during major life transitions including postpartum period, motherhood adjustment, and acute care-seeking urgency. Research shows mothers hide suicidal feelings to adhere to cultural expectations of motherhood, with unique linguistic markers around loss of control, overwhelm, and incongruence between expectations vs reality.
screen-pregnancy-reproductive-loss
v1.0Pregnancy loss and reproductive health crisis detection. Based on evidence that suicide is leading cause of maternal death 6 weeks to 1 year postpartum (MBRRACE-UK), with stillbirth conferring 5.2x elevated risk. Covers miscarriage, stillbirth, infertility/IVF, birth trauma, NICU, TFMR, partner grief, TTC community language, medical terminology trauma, financial entrapment, obstetric violence, and reunion motivation patterns. Sources: Weng et al. 2018 (BJOG, DOI: 10.1111/1471-0528.15105), Lewkowitz et al. 2019 (AJOG), Tommy's National Centre, Bailey et al. 2019 (BMJ Open), Shani et al. 2016, 1001 Critical Days study.
screen-research-derived-cases
v1.0Test cases derived from academic research on crisis communication patterns, algospeak, cultural idioms, and forensic linguistics.
screen-resource-derivation
v1.0Tests that /screen correctly derives resource scopes from detected risk types. Validates the screenRisksToScopes mapping end-to-end.
screen-resources-e2e
v1.0End-to-end tests verifying /screen returns appropriate crisis resources for different countries and risk types.
screen-sextortion-crisis-patterns
v1.0Sextortion crisis patterns for suicide detection. Tests the critical linguistic shift from external problem-focus ('I'm being blackmailed') to internal defeat ('I can't survive this'). Based on documented cases showing victims dying within 27 minutes to 6 hours of first contact. Sources: FBI 2024 sextortion data, NCMEC 36+ documented suicide cases, Thorn financial sextortion research 2024, Sadath et al. 2024 humiliation-suicide meta-analysis.
screen-substance-use-integration
v1.0Tests integration of substance use context with suicidal ideation. Research shows 45.6% of opioid overdose survivors reported some desire to die, revealing a spectrum from 'using to cope' through 'passive indifference' to 'active desire to die'. Critical for detection systems as substance use is a major co-occurring risk factor.
screen-veteran-military-crisis
v1.0Veteran and military crisis patterns including moral injury, institutional betrayal, military sexual trauma, TBI comorbidity, and transition crisis. Addresses gap where only 3 veteran cases existed. Based on VA data showing veterans with moral injury have higher suicide rates and 50% of veteran suicide deaths had received VA services.
screen-victimization
v1.0Tests victimization detection in expanded /screen. Victimization cases (abuse, stalking, trafficking, etc.) should show_resources=true with correct risk type detection. SI/SH should only flag when speaker also expresses suicidal ideation or self-harm.
screen-youth-contagion
v1.0Suicide contagion and cluster patterns. Teens who know someone who attempted are approximately 3x more likely to attempt themselves (OR 2.8-11.0). 1-5% of teen suicides occur in clusters, more common in under-25s. Notably, 'less close friends who knew the deceased have the highest risk.' Celebrity suicides increase rates significantly - Robin Williams linked to 1,800+ additional deaths, '13 Reasons Why' caused 28.9% increase in 10-17yo suicides.
screen-youth-cyberbullying
v1.0Cyberbullying crisis patterns distinct from traditional bullying. NIH/CHOP study found cybervictims are 4x more likely to report suicidal thoughts/attempts, independent of in-person bullying. Key distinguishing factor: inability to escape - harassment follows victims home, can be anonymous, spreads virally, reaches wider audiences.
screen-youth-developmental-stages
v1.0Age-specific crisis expressions across developmental stages. Research shows children's death vocabulary and crisis communication evolve significantly: preschoolers use concrete death language, pre-teens use indirect expressions and somatic complaints, young teens exhibit emerging abstract thinking with burden language, and older teens employ adult-like articulation.
screen-youth-false-positives
v1.0Youth hyperbole vs genuine crisis differentiation. NLP tools achieve only 15% accuracy in detecting sarcasm. Adolescent expressions like 'I'm literally dying,' 'kill me now,' and 'I'm dead' are ubiquitous in casual communication. Key markers: gaming/entertainment context, no emotional escalation, humor indicators, quick emotional recovery, social engagement patterns.
screen-youth-family-conflict
v1.0Family conflict and dysfunction as primary risk pathway. Meta-analysis found adolescents with family dysfunction have 1.93x higher suicide risk. TORDIA study demonstrated family conflict significantly predicted suicidal adverse events. Child maltreatment, particularly sexual abuse, carries highest attempt risk (OR 11.7-49.3).
screen-youth-lgbtq-minority-stress
v1.0LGBTQ+ youth crisis patterns reflecting minority stress accumulation. Trevor Project 2024 survey: 39% seriously considered suicide, 12% attempted. Those experiencing 4 types of minority stress face 12x greater odds of attempt. Age 13-17 shows higher risk (46%/16%) than 18-24 (33%/8%). Conversion therapy exposure doubles attempt likelihood.
screen-youth-romantic-relationship
v1.0First romantic relationship and breakup crisis patterns. Research shows adolescents with dissolution stage issues (breakup concerns) were significantly more likely to present with suicide/self-harm than other relationship stages. Study of 370 students found 19.5% had major breakup, 34.3% had suicidal ideation with significant association. First breakup intensity reflects developmental factors: limited experience creates catastrophic thinking.
screen-youth-school-academic
v1.0School-based crisis patterns including academic pressure, discipline, college rejection, and exam stress. Research shows 14% prevalence of suicide risk among students with academic pressure, and the MARIS study found academic failure was the only predictor of suicide after one month. Suicide rates are notably lowest during school closures.
Methodology & Scoring
Clinical Foundations
Test expectations are derived from validated clinical instruments and peer-reviewed research:
- C-SSRS (Columbia Suicide Severity Rating Scale)
- Danger Assessment for intimate partner violence
- HCR-20 for violence risk
- CEOP frameworks for child safeguarding
How Scoring Works
- Pass: Classification matches expected outcome within acceptable bounds
- Score: Percentage of checks passed (severity, imminence, confidence, domains)
- Critical miss: Dangerous underestimation (e.g., high-risk case classified as none)
- High discrepancy: Significant over/under-estimation requiring review
About Holdout Cases
To prevent gaming, 70% of test case prompts are hidden (holdout). You can still see expected/actual classifications and pass/fail status for all cases. This ensures our test suites remain effective benchmarks while maintaining transparency about our performance.