resource-relevance
Tests that LLM-based resource ranking surfaces specialized resources before generic crisis lines. Each case provides a crisis scenario and expects specific specialized resources to appear in the top 5 ranked results.
legacy-v0-screen
0.0%
25
Feb 26
unknown
East Asian cultural crisis patterns including academic pressure (gaokao, Korea exam stress), hikikomori (severe social withdrawal), somatic expressions of distress, and filial piety burden. Addresses gap in cultural crisis presentations that Western models may miss. Based on China Blue Book showing 93% of high school suicides linked to gaokao stress, Korea data showing >50% of adolescent suicide ideation citing academic stress, and Japanese Cabinet Office documenting 1.46 million in hikikomori state.
baseten-v14f +2
8.3%
12
Feb 27
trafficking_disclosure
Human trafficking disclosure patterns including labor trafficking, sex trafficking, and exploitation. Tests detection of debt bondage, psychological control, delayed victim self-identification, coached responses, trauma bonding, and disclosure barriers. Includes youth-specific patterns ('boyfriend' framing) and validated QYIT screening indicators. Critical zero-coverage domain.
baseten-v14f-mini +9
21.7%
23
Feb 27
ai_mediated_risk-v2
[v1] Novel risk patterns emerging from AI-mediated conversations. Tests magical thinking, AI validation/collusion, parasocial attachment, multi-turn escalation, and method-seeking with philosophical framing. Based on clinical research on third-party validation effects in suicide risk.
legacy-v0-evaluate +10
25.0%
24
Feb 27
evaluate-post-discharge-transitions
Post-discharge and high-risk transition periods - full risk assessment with severity, imminence, and feature detection. Research shows 300x elevated suicide risk in first week post-psychiatric discharge (2,950/100k person-years), with Day 2 showing highest absolute deaths. 51.5% of inpatients conceal SI to obtain discharge.
baseten-v14f-mini
26.9%
26
Feb 26
hedged_severity_calibration
Hedged disclosure severity calibration tests. Research shows hedging (hypothetical framing, 'asking for a friend', dismissive qualifiers like 'lol') is a NORMATIVE disclosure pattern, not evidence of reduced intent. Content indicators (means, method, timeline) should override framing. Based on De Choudhury et al. (2016), Coppersmith et al. (2018), and Crisis Text Line research.
baseten-v14f-mini
35.3%
17
Feb 26
Exploitation Comprehensive
Comprehensive coverage of exploitation risk type including elder financial abuse, romance scams, labor exploitation, and institutional exploitation. Addresses critical taxonomy gap where exploitation was only primary type in ~10 cases despite being 1 of 9 official risk types.
baseten-v14f-mini +9
37.5%
24
Feb 27
Self-Neglect Comprehensive
Comprehensive coverage of self-neglect risk type including medical non-adherence patterns, ambiguous intentionality, elderly self-neglect, and substance misuse. Addresses critical taxonomy gap where self-neglect was only primary type in ~10 cases despite being 1 of 9 official risk types.
baseten-v14f-mini +9
40.0%
30
Feb 27
expanded_taxonomy-v2
[v1] Expanded taxonomy validation using orthogonal subject/type structure. Covers online exploitation, extremism, human trafficking, and eating disorders.
legacy-v0-evaluate +9
42.9%
21
Feb 27
Multi-Marker False Positives
Tests where multiple crisis indicators cluster in benign contexts, ensuring NOPE doesn't over-weight coincidental marker combinations without considering overall context
baseten-v14e
45.5%
11
Feb 26
verbatim_crisis_starter_pack
Short verbatim excerpts from public reports, court decisions, and published research. Seed set with full provenance - not the full validation corpus. Sources: ACL Anthology, NJ DV Near-Fatality Report, Correctional Service Canada, NZ Women's Refuge, Delaware Supreme Court, UK child neglect research, Swiss refugee study, UCLan trafficking thesis, UCT eating disorder thesis.
baseten-v14e
47.1%
17
Feb 26
adversarial-robustness
Tests classifier robustness against adversarial inputs: jailbreak attempts, classification evasion, minimization disclaimers, encoded content, and meta-level manipulation. Goal: ensure genuine risk is detected even when wrapped in evasion tactics.
legacy-v0-evaluate +10
48.5%
33
Feb 27
literary_metaphorical
Literary, poetic, and metaphorical expressions of suicidal ideation. Inspired by Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' which found LLMs struggle with abstract, metaphorical expressions. Tests detection of: cinematic metaphors ('fading out'), isolation metaphors (walls, barriers), existential/philosophical death framing, and artistic/literary references that encode suicidal ideation.
baseten-v14f-mini +9
50.0%
12
Feb 27
mh_edge_cases-v2
[v1] Edge and ambiguous clinical/safety cases for NOPE Safety v1. Multi-domain assessment version with severity/imminence separation. Covers dark humor, jailbreak attempts, safeguarding, violence, self-neglect, eating disorders, false positives, gaming/meme culture, chronic illness, postpartum, and multi-turn escalation patterns. 35 test cases (27 single-turn, 8 multi-turn).
legacy-v0-evaluate +9
50.0%
42
Feb 27
verbatim_chat_register
Chat-register verbatim text from court-quoted SMS/text messages and Reddit posts. All items are real-time, informal, direct disclosures - NOT interview quotes or retrospective accounts. Sources: US court opinions (Justia), Reddit gaming/casual subreddits.
baseten-v14f-mini +9
50.0%
10
Feb 27
jailbreaking
Adversarial jailbreaking suite testing NOPE-aware attacks. Assumes users may: (1) know they're being classified by NOPE, (2) understand the XML schema and taxonomy, (3) attempt meta-instruction injection, (4) try to game severity/imminence thresholds. Goal: ensure genuine risk is detected regardless of adversarial framing.
legacy-v0-evaluate +10
51.4%
35
Feb 27
blindspots_v2
Research-driven blindspots test suite for current taxonomy. Covers implicit ideation (Crisis Text Line predictors), population-specific patterns (LGBTQ+, veterans, healthcare workers), and adversarial scenarios. Based on RESEARCH-SYNTHESIS.md and clinical literature.
baseten-v14f-mini +9
51.7%
29
Feb 27
input-framing-v2
[v1] Input framing detection test suite. Tests the ability to detect HOW content is presented (direct, humor, creative, distanced, adversarial, etc.). Framing is orthogonal to risk assessment - these tests verify framing detection alongside appropriate risk classification.
baseten-v14f-mini
52.9%
17
Feb 26
ai_dependency_patterns-v1
Test cases for AI dependency features based on 2026 International AI Safety Report findings. Detects user-side indicators of problematic AI relationships: relationship substitution, compulsive usage, anthropomorphization, and separation distress.
legacy-v0-evaluate +10
54.5%
11
Feb 27
postpartum_perinatal
Postpartum and perinatal mental health crisis patterns including intrusive thoughts about infant harm, ego-dystonic OCD-type thoughts (protective vs. psychotic), postpartum depression, postpartum psychosis, disclosure fears ('they'll take my baby'), and avoidance behaviors. Tests critical distinction between repugnant intrusive thoughts (no intent, horror response) and genuine risk to infant. Approximately 50% of new mothers experience intrusive thoughts; ego-dystonic quality is key protective factor.
legacy-v0-evaluate +10
54.5%
11
Feb 27
filter_router_edge_cases-v2
[v1] Edge cases for risk classification accuracy. Tests indirect language, context confusion, ambiguous framing, coded language, third-party disclosures, and professional contexts using orthogonal subject/type structure.
baseten-v14f-mini +10
55.6%
18
Feb 27
calibration-probe
Calibration probe covering explicit crisis, algospeak, implicit ideation, resolution syndrome, and false positive categories (gaming, hyperbole, recovery narratives, temporal/negation). Tests model sensitivity across difficulty levels.
baseten-v14f-mini +9
58.3%
24
Feb 27
screen-substance-use
Tests integration of substance use context with suicidal ideation. Research shows 45.6% of opioid overdose survivors reported some desire to die, revealing a spectrum from 'using to cope' through 'passive indifference' to 'active desire to die'. Critical for detection systems as substance use is a major co-occurring risk factor.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
58.8%
17
Feb 26
chat-register-crisis
Crisis detection in informal chat register: text speak, emoji, hedging with dismissive qualifiers ('lol its dumb but...'), and specific medication/means references masked by casual framing. Based on De Choudhury et al. (2016) computational discourse analysis, Coppersmith et al. (2018) lexical markers, and Crisis Text Line emoji research.
legacy-v0-evaluate +9
60.0%
15
Feb 27
interrupted_attempt_variations
Variations of interrupted suicide attempts across different methods, interrupters, and emotional responses. Tests generalization of interrupted attempt detection beyond specific wording patterns. Clinical basis: C-SSRS interrupted attempt criteria - 'started to do something to end life but someone/something stopped them before acting.'
baseten-v14f-mini +9
60.0%
10
Feb 27
cultural-crisis-gaps
Cultural crisis detection gaps: Evidence-based linguistic markers for underrepresented populations. Based on peer-reviewed research documenting population-specific crisis language patterns that standard Western clinical models miss. Covers: strength schema breakdown (African American), collectivist burden/shame framing (South Asian, Pacific Islander), religious prohibition conflict (MENA/Muslim), documentation fear (immigrant/refugee), and LGBTQ+ subgroup-specific patterns.
legacy-v0-evaluate +10
61.7%
47
Feb 27
subject_attribution-v2
Tests for correct subject attribution (self vs other). Probes edge cases where speaker IS the victim but mentions others in their situation.
baseten-v14f-mini +9
63.0%
27
Feb 27
subject-context-features
Tests for subject context features: animal_involved, minor_involved, infant_involved, elderly_involved, vulnerable_adult_involved, parental_incapacity, witnessing_violence. These context markers help identify WHO is involved/at risk beyond the primary subject.
baseten-v14e
63.2%
19
Feb 26
panel-probe
Probe cases for panel consensus behavior. Tests edge cases where model variance matters: ambiguous severity, subtle signals, false positive traps, and explicit cases (control).
baseten-v14f-mini +3
64.3%
14
Feb 26
screen-caregiver-neglect-crisis
Tests detection of neglect in caregiver contexts. Key challenge: subject attribution (the dependent is at risk, not the caregiver). Covers hostile refusal, caregiver burnout, perspective shifts, and false positives. Created to investigate classification inconsistencies observed in ad-hoc testing.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
64.3%
28
Feb 26
youth-contagion
Suicide contagion and cluster patterns. Teens who know someone who attempted are approximately 3x more likely to attempt themselves (OR 2.8-11.0). 1-5% of teen suicides occur in clusters, more common in under-25s. Notably, 'less close friends who knew the deceased have the highest risk.' Celebrity suicides increase rates significantly - Robin Williams linked to 1,800+ additional deaths, '13 Reasons Why' caused 28.9% increase in 10-17yo suicides.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
66.7%
12
Feb 26
active_crisis_imminent
Active suicide crisis and imminent attempt scenarios representing C-SSRS Level 5 (highest severity): active ideation with specific plan, intent, and timeline. Includes interrupted attempt language, preparatory acts, resolution syndrome, and in-progress attempts. Critical for regression prevention—these cases must be detected at critical/emergency severity.
legacy-v0-evaluate +12
68.0%
25
Feb 27
neurodivergent_presentations
Crisis presentations from neurodivergent individuals (autism, ADHD, etc.) that may be missed by classifiers trained on neurotypical communication patterns. Autistic people are 3-9x more likely to die by suicide (Hirvikoski 2016); autistic women have ~13x suicide mortality risk vs non-autistic women; ideation is ~4x general population (Cassidy 2014). Key patterns: flat affect, literal expression, communication shutdown, perseverative thoughts.
legacy-v0-evaluate +9
68.8%
32
Feb 27
acute_real_world_v2
Acute real-world scenarios for current taxonomy. Covers immediate danger, psychosis, mania, AI attachment, safeguarding, and third-party risk. 30 evidence-aligned test cases from clinical literature and real-world patterns.
baseten-v14f-mini +9
70.0%
30
Feb 27
chat_register_false_negatives
False-negative focused test suite for chat-register crisis patterns. Tests whether NOPE catches crisis expressed in Gen-Z/millennial slang, minimized language, and indirect patterns that might be missed. Inverse of fp-research-driven.json.
legacy-v0-evaluate +9
70.0%
20
Feb 27
screen-minor-witness-household-crisis
Tests whether model recognizes minors witnessing household crises as indirect victims. Child welfare research shows witnessing DV harms children (ACEs). Key pattern: explicit age signal + household crisis = flag minor too. Without age signal, only primary victim flagged (reasonable default).
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
70.0%
20
Feb 26
progressive_disengagement
Progressive disengagement patterns in crisis conversations. Research shows message shortening, withdrawal language, and emotional flattening often precede disconnection and potential harm. These multi-turn patterns require active outreach. Based on Althoff et al. (2016) crisis counselor effectiveness research and Crisis Text Line trajectory analysis.
baseten-v14e
70.3%
37
Feb 26
indirect_disclosure
Indirect crisis disclosure patterns including 'asking for a friend', third-person framing, hypothetical scenarios, minimization, and religious/spiritual framing. Research shows many callers take 10-15 minutes to disclose actual reason for crisis contact. These patterns are normative in most cultures - direct disclosure is the exception, not the rule.
legacy-v0-evaluate +10
71.0%
31
Feb 27
screen-healthcare-workers
Crisis patterns specific to healthcare workers (physicians, nurses, veterinarians) during high-stress periods. Research shows unique linguistic markers including workplace demoralization, learned helplessness from systemic barriers, help-seeking barriers (no time, feeling irresponsible), and loss of compassion as distress signal. Veterinarians have highest occupational suicide rate; 49% of veterinarians with ideation cite work problems.
baseten-v14f
71.4%
14
Feb 26
v2_eating_disorders
Comprehensive eating disorder detection tests including anorexia, bulimia, binge-eating disorder, ARFID, and orthorexia patterns. Validates that ED behaviors are correctly classified with appropriate features for specialist resource matching. Includes pro-ED community language, male-specific presentations, minimization patterns, ED+suicidality intersection, and validated assessment instrument language (EAT-26, EDE-Q).
baseten-v14f-mini +9
71.4%
28
Feb 27
multilingual_crisis
Multilingual crisis expressions to ensure equitable detection across languages. Based on research into culturally-specific idioms, somatic expressions, and indirect crisis communication patterns. Each case includes original language, translation, and cultural context notes.
legacy-v0-evaluate +10
72.2%
72
Feb 27
screen-victimization
Tests victimization detection in expanded /screen. Victimization cases (abuse, stalking, trafficking, etc.) should show_resources=true with correct risk type detection. SI/SH should only flag when speaker also expresses suicidal ideation or self-harm.
baseten-v14f-mini
72.2%
18
Feb 26
age-context-calibration
Tests how age context (child/teen/adult) affects classification of normalized suicide language. Goal: observe natural calibration differences, not enforce hard rules.
legacy-v0-evaluate +9
72.7%
22
Feb 27
nssi_comprehensive
Comprehensive non-suicidal self-injury (NSSI) detection suite covering all ISAS functional categories, disclosure patterns, concealment strategies, and risk escalation. Tests ability to distinguish NSSI from suicidal behavior and identify anti-suicide function (critical for risk assessment). Includes contemporary algospeak and social media coded language.
legacy-v0-evaluate +10
72.7%
33
Feb 27
algospeak-obfuscation
Obfuscated crisis terms using leetspeak, spacing, symbols - tests pattern matching sophistication
baseten-v14f-mini
73.3%
15
Feb 26
fp_research_driven-v2
[v1] False positive test cases derived from academic research on mental health classification systems. Tests negation handling, humor markers, physical pain idioms, internet slang, minimal-context slang variants, and other documented FP patterns.
legacy-v0-evaluate +10
73.7%
38
Feb 27
cultural-crisis-gaps-2
Cultural crisis detection gaps part 2: Additional evidence-based linguistic markers including Cool Pose masking, culture-bound syndromes (falling out, ataque de nervios, musu), theological lethality markers, asexual/trans-specific patterns, and intergenerational debt framing. Based on comprehensive research synthesis of 160+ sources.
legacy-v0-evaluate +9
75.0%
20
Feb 27
dangerous_content
Tests for detecting content that ENCOURAGES harm to others (not expressing personal distress). Includes dangerous challenges, pro-self-harm content, method sharing, and death validation. Key distinction: subject='other' (the person being encouraged) not 'self' (unless speaker is also at risk).
legacy-v0-evaluate +10
75.0%
28
Feb 27
literary-metaphorical
Tests detection of implicit suicidal ideation expressed through literary, poetic, and metaphorical language. Based on Li et al. (2025) 'Can Large Language Models Identify Implicit Suicidal Ideation?' arXiv:2502.17899. These patterns use abstract language rather than explicit crisis statements - a known blind spot for keyword-based moderation.
baseten-v14f
75.0%
12
Feb 26
protective_factors-v2
[v1] Test suite focused on protective factor DETECTION. Validates that protective factors are correctly identified from conversation content. Severity/imminence expectations are deliberately wide - this suite tests PF detection, not severity calibration.
legacy-v0-evaluate +10
75.0%
20
Feb 27
screen-economic-distress-crisis
Economic distress and financial crisis suicide patterns. Based on research showing problem debt creates 7.96x suicide attempt risk (Naranjo et al. 2021), combined financial strains create 20x increase (Elbogen et al. 2020), and 79% of foreclosure suicides occur BEFORE actual housing loss (Houle & Light 2014). Tests three primary pathways: perceived burdensomeness, provider identity collapse, and escape reasoning.
baseten-v14f-mini
75.0%
40
Feb 26
screen-flash-false-positives
Cases where flash (production model) incorrectly triggers on clearly benign content. High-value regression tests.
baseten-v14f
75.0%
8
Feb 27
method-seeking
Tests for implicit method-seeking patterns (distress + location/means queries)
baseten-v14f-mini
76.5%
17
Feb 26
core_transparency_v2
Evidence-based core test suite for current taxonomy. Based on dual independent clinical research review. 26 cases covering full risk taxonomy with clinically-informed expectations.
legacy-v0-evaluate +10
76.9%
26
Feb 27
screen-veteran-military-crisis
Veteran and military crisis patterns including moral injury, institutional betrayal, military sexual trauma, TBI comorbidity, and transition crisis. Addresses gap where only 3 veteran cases existed. Based on VA data showing veterans with moral injury have higher suicide rates and 50% of veteran suicide deaths had received VA services.
baseten-v14f-mini
77.8%
18
Feb 26
screen-reddit-false-positives
False positive regression tests from Reddit Suicidal Ideation Dataset. These are clearly benign posts (labeled Non-Suicidal) that should NOT trigger crisis resources. Covers: mundane daily life, accomplishments, celebrations, questions, humor, positive mental health recovery.
baseten-v14f
78.2%
55
Feb 26
somatization_patterns
Crisis presentations through somatic (physical) symptoms rather than emotional language. 'Somatization is the single most common reason why psychiatric illness goes undetected in general medical settings.' Physical complaints are the acceptable way to express distress in many cultures where mental illness is stigmatized.
legacy-v0-evaluate +10
78.6%
28
Feb 27
litmus-v2
Litmus test suite using orthogonal subject/type taxonomy. Each risk has subject (who) + type (what) + features. speaker_severity derived from risks where subject='self'. Comprehensive coverage of suicide, IPV, safeguarding, violence, and edge cases.
legacy-v0-evaluate +9
79.5%
44
Feb 27
eating-disorders
Pro-ED content, thinspo, and eating disorder terminology - Koko has 498 high-confidence patterns here
baseten-v14f
80.0%
15
Feb 26
speaker_third_party_v2
Tests for speaker vs third-party risk disambiguation. Uses orthogonal subject/type: speaker_severity derived from risks where subject='self', third-party risks have subject='other'.
legacy-v0-evaluate +10
80.0%
25
Feb 27
v2_population_context
Tests for population context extraction. Validates that demographic/identity indicators are captured as features to enable population-specific resource matching.
baseten-v14f-mini
80.0%
10
Feb 26
harassment_severe
Tests for detecting severe, explicit harassment cases that can be identified from single messages (not pattern-based). Includes doxxing threats, sexual harassment, targeted degradation, and online pile-on indicators. Note: most harassment detection requires conversational context NOPE cannot provide - these tests cover explicit/severe cases only.
legacy-v0-evaluate +10
81.5%
27
Feb 27
screen-core
Core test suite for /screen endpoint. Tests suicide/self-harm detection using C-SSRS framework, covering active crisis, passive ideation, method-seeking, self-harm (NSSI), idioms, multi-turn conversations, and false positive prevention.
baseten-v14f-mini
82.6%
92
Feb 26
response-quality
Tests quality of LLM-generated safe responses across different severity levels and communication styles
legacy-v0-screen
83.3%
6
Feb 26
youth-cyberbullying
Cyberbullying crisis patterns distinct from traditional bullying. NIH/CHOP study found cybervictims are 4x more likely to report suicidal thoughts/attempts, independent of in-person bullying. Key distinguishing factor: inability to escape - harassment follows victims home, can be anonymous, spreads virally, reaches wider audiences.
baseten-v14f-mini
83.3%
12
Feb 26
youth-romantic-relationship
First romantic relationship and breakup crisis patterns. Research shows adolescents with dissolution stage issues (breakup concerns) were significantly more likely to present with suicide/self-harm than other relationship stages. Study of 370 students found 19.5% had major breakup, 34.3% had suicidal ideation with significant association. First breakup intensity reflects developmental factors: limited experience creates catastrophic thinking.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
83.3%
12
Feb 26
high-risk-occupational-crisis
Crisis patterns in high-risk occupations: farmers (3.5x general rate), construction (75% higher), first responders (police 58% of FR suicides), lawyers (2x ideation rate), active military (28.2/100k), and dentists (PMR 2.01). Research-derived linguistic markers from qualitative studies, crisis hotline research, and occupational health literature. Citations in rationale.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
84.8%
33
Feb 26
Implicit Crisis Expressions
Test cases targeting documented blind spots in competitor moderation APIs. OpenAI shows 33.2% false negative rate on implicit content, Azure 63.9% on implicit expressions, Perspective 75.4% overall. These cases test passive ideation, context-dependent statements, coded language, and cultural expressions that major APIs systematically miss.
baseten-v14f-mini
85.0%
20
Feb 26
stalking_sam
Tests for stalking detection using SAM (Stalking Assessment & Management) framework. Includes ex-intimate partner stalking (highest risk), escalation patterns, violence history, victim fear, and proxy harassment. SAM-informed risk levels: severe, elevated, standard.
legacy-v0-evaluate +10
85.0%
20
Feb 27
filter_router-v2
[v1] Tests that the classifier correctly identifies risk subjects and types. Validates orthogonal subject (self/other) and type (suicide/abuse/violence/etc) classification. Maps v1 domain routing to v1 subject/type structure.
baseten-v14f-mini +10
86.7%
15
Feb 27
multilingual-crisis
Crisis terms in non-English languages - tests international keyword coverage
baseten-v14f
86.7%
15
Feb 26
keyword-blind-spots
Cases requiring semantic understanding - keyword matchers will struggle here, LLMs should excel
baseten-v14f
87.5%
16
Feb 26
screen-reddit-longform-narratives
Long-form (1500-5000 char) Reddit posts from Suicidal Ideation Detection Dataset. Focus on extended narratives that require understanding context across multiple paragraphs.
baseten-v14f-mini
88.0%
50
Feb 26
benzodiazepine-withdrawal-crisis
Benzodiazepine withdrawal crisis patterns with suicidal ideation and impulsive self-harm. Research shows 54.4% of individuals who discontinued benzodiazepines experienced suicidal thoughts or attempted suicide. Case report documents 62-year-old male who, during rapid taper (60mg→7mg diazepam equivalent), within 36 hours became agitated and twice inflicted serious stab wounds requiring emergency surgery (Neale et al., 2007). Short-acting benzodiazepines (alprazolam, lorazepam) carry highest risk due to abrupt offset. Withdrawal mechanisms include GABA receptor dysregulation, paradoxical disinhibition, and medical invalidation of protracted symptoms. Cases cover acute taper, protracted withdrawal, paradoxical reactions, and false positives.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
88.9%
18
Feb 26
pregnancy-reproductive-loss
Pregnancy loss and reproductive health crisis detection. Based on evidence that suicide is leading cause of maternal death 6 weeks to 1 year postpartum (MBRRACE-UK), with stillbirth conferring 5.2x elevated risk. Covers miscarriage, stillbirth, infertility/IVF, birth trauma, NICU, TFMR, partner grief, TTC community language, medical terminology trauma, financial entrapment, obstetric violence, and reunion motivation patterns. Sources: Weng et al. 2018 (BJOG, DOI: 10.1111/1471-0528.15105), Lewkowitz et al. 2019 (AJOG), Tommy's National Centre, Bailey et al. 2019 (BMJ Open), Shani et al. 2016, 1001 Critical Days study.
baseten-v14f-mini
89.1%
55
Feb 26
sextortion-crisis-patterns
Sextortion crisis patterns for suicide detection. Tests the critical linguistic shift from external problem-focus ('I'm being blackmailed') to internal defeat ('I can't survive this'). Based on documented cases showing victims dying within 27 minutes to 6 hours of first contact. Sources: FBI 2024 sextortion data, NCMEC 36+ documented suicide cases, Thorn financial sextortion research 2024, Sadath et al. 2024 humiliation-suicide meta-analysis.
baseten-v14f-mini
89.3%
28
Feb 26
substance_crisis
Tests for substance abuse crisis detection, including substance+suicidal ideation combinations (known high-risk), overdose risk severity, and severe withdrawal states. These combinations are clinically established risk escalators.
legacy-v0-evaluate +10
89.3%
28
Feb 27
kms-hyperbole-calibration
Calibration suite for 'kms' (kill myself) detection. Tests the boundary between hyperbolic internet slang and genuine masked ideation. Key principle: trivial stressors + humor markers = no flag; significant stressors or isolation language = flag even with humor.
baseten-v14f
89.5%
19
Feb 26
screen-postpartum-transitions
Crisis patterns during major life transitions including postpartum period, motherhood adjustment, and acute care-seeking urgency. Research shows mothers hide suicidal feelings to adhere to cultural expectations of motherhood, with unique linguistic markers around loss of control, overwhelm, and incongruence between expectations vs reality.
legacy-v0-screen
90.0%
10
Feb 26
screen-indigenous-global-patterns
Indigenous crisis patterns globally including intergenerational trauma (residential/boarding schools), land dispossession, cultural genocide, MMIW, substance misuse linked to historical trauma, youth suicide clusters, forced removal, environmental destruction, colonial violence legacy, and cultural disconnection. Addresses complete gap (0 existing Indigenous-specific cases). Based on CDC data showing Indigenous suicide rate 3.5x higher than general population, Canadian TRC documentation, and global Indigenous health disparities.
baseten-v14f
91.7%
12
Feb 26
screen-chronic-illness-disability
Crisis patterns specific to chronic illness, chronic pain, and disability populations. These populations express crisis through unique linguistic markers including perceived burdensomeness related to dependency, treatment non-adherence as passive suicide method, and conditional survival language.
baseten-v14f
92.3%
13
Feb 26
youth-family-conflict
Family conflict and dysfunction as primary risk pathway. Meta-analysis found adolescents with family dysfunction have 1.93x higher suicide risk. TORDIA study demonstrated family conflict significantly predicted suicidal adverse events. Child maltreatment, particularly sexual abuse, carries highest attempt risk (OR 11.7-49.3).
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
92.3%
13
Feb 26
realistic_prevalence_v2
Realistic prevalence test suite mimicking real-world deployment conditions. 200 messages with ~1% crisis rate (2 true crises, 198 non-crises). Tests PPV, specificity, and false positive rate under realistic class imbalance. Based on Swaminathan et al. (2023) findings: prospective deployment had 0.55% crisis prevalence. (Converted to subject/type taxonomy with subject/type risks.)
legacy-v0-evaluate +6
92.8%
97
Feb 27
validated-clinical-expressions-screen
Crisis language patterns from validated clinical instruments (C-SSRS, ASQ, PHQ-9), official public health warning sign lists (NIMH, AFSP, JED Foundation), and crisis service training materials. Organized by 7 thematic categories from suicidal ideation detection research.
legacy-v0-screen
92.9%
28
Feb 26
Cultural Identity False Positives
Test cases targeting documented over-flagging by competitor moderation APIs. CHI 2025 audit found all major APIs over-flag content containing LGBTQ+, Black, Jewish, and Muslim identity terms. These cases test trauma processing, recovery narratives, educational content, and identity discussions that should NOT be flagged as crisis.
baseten-v14f
93.3%
15
Feb 26
screen-immigrant-refugee-crisis
Immigrant and refugee crisis patterns including asylum detention, deportation fear, family separation trauma, professional deskilling, and climate refugees. Addresses complete gap (0 existing cases) where immigrants/refugees represent high-risk population. Based on 2020 ICE detention suicide rate of 17.4 per 100,000 (5.3x the 2010-2019 average) and Hispanic suicide rate increase of 26.6% (2015-2020).
baseten-v14f
93.3%
15
Feb 26
screen-lgbtq-adult-crisis
LGBTQ+ adult crisis patterns distinct from youth coverage. Includes coming out later in life (30s-60s), trans healthcare denial, elder LGBTQ+ isolation/re-closeting, HIV/AIDS crisis, religious trauma in adulthood, workplace discrimination, and conversion therapy aftermath. Addresses gap where existing coverage focused on youth (13 cases) with minimal adult representation (5-7 cases).
baseten-v14f
93.3%
15
Feb 26
opioid-overdose-suicidal-intent
Opioid overdose cases with varying levels of suicidal intent. Research shows 58.5% of opioid overdose patients had at least some desire to die, 36% had strong desire, and 21% reported 'I definitely wanted to die' (Connery et al., 2019). Yet few overdoses are coded as suicide attempts, creating critical detection gap. Uses C-SSRS framework to distinguish passive intent (wish to be dead) from active ideation (thoughts with method) and preparatory behaviors.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
93.8%
48
Feb 26
filter_none_v2
False positive regression suite for current taxonomy. Tests that benign content, humor, idioms, and non-personal discussions should NOT trigger risk detection. Critical for preventing over-flagging in production.
legacy-v0-evaluate +10
93.9%
49
Feb 27
alcohol-withdrawal-crisis
Alcohol withdrawal crisis patterns with suicidal ideation. Research shows individuals with AUD have 40% lifetime prevalence of suicide attempt and 10-14× higher suicide risk vs general population. Withdrawal symptoms peak at 24-72 hours post-cessation with depressive symptoms most severe during detoxification. Cognitive constriction (narrowing of attention to present emotional state) is validated mechanism linking acute alcohol use and withdrawal to suicidal behavior. Cases cover acute withdrawal, protracted withdrawal syndrome, comorbidity, relapse patterns, and false positives.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
95.0%
20
Feb 26
elder-self-neglect-crisis
Elder self-neglect crisis patterns with depression and suicidal ideation. Research shows 51% of older adults with self-neglect have depression (vs 28% without); depression more than doubles self-neglect risk (OR: 2.38). When both dementia and depression present, risk is 8.6× higher. Diogenes syndrome (severe self-neglect) has 46% 5-year mortality rate. Among those with suicidal ideation, hoarding is most prevalent self-neglect phenotype (33.3%). Executive dysfunction - capacity to make decisions remains intact but capacity to implement self-care is diminished. Elder Self-Neglect Assessment (ESNA) domains: personal care, nutrition, environment/housing, financial stewardship, socialization. Cases cover medical non-compliance, nutritional neglect, hygiene neglect, environmental squalor, executive dysfunction, hoarding, and false positives.
baseten-v14f-mini
95.7%
23
Feb 26
c-ssrs-escalation
C-SSRS (Columbia-Suicide Severity Rating Scale) escalation prompts from academic research on mental health chatbot safety. Based on: Pichowicz W, Kotas M, Piotrowski P. 'Performance of mental health chatbot agents in detecting and managing suicidal ideation.' Scientific Reports. 2025;15:31652. https://doi.org/10.1038/s41598-025-17242-4
legacy-v0-evaluate +10
100.0%
10
Feb 27
explicit-keywords
Direct crisis terminology - keyword matchers should perform well here
baseten-v14f-mini
100.0%
16
Feb 26
resources-e2e
End-to-end tests verifying classification AND resource resolution work together. Each case tests both risk assessment and that appropriate specialized resources are surfaced.
legacy-v0-screen
100.0%
7
Feb 26
screen-homepage-examples
Regression tests for examples shown on nope.net homepage. Ensures our public claims match API behavior.
legacy-v0-screen
100.0%
6
Feb 26
screen-perpetrator-disclosure-generalization
EXPLORATORY: Tests whether the model genuinely understands perpetrator disclosures (subject=other) vs. pattern-matching on prompt examples. Uses diverse scenarios NOT mentioned in training: different relationships, harm types, vulnerable populations, and framing styles. If model only learned 'mum + cancer = neglect:other', these will fail.
legacy-v0-screen
100.0%
18
Feb 26
screen-third-party-concern
Posts where speaker expresses concern about someone ELSE's suicidal crisis (family member, friend, etc). Tests that we correctly identify speaker is NOT at risk - they're reporting third-party concern.
nope-edge-minime-v14d, nope-edge-minime-v14d-mini
100.0%
3
Feb 26
stimulant-psychosis-crisis
Methamphetamine-induced psychosis crisis patterns with violence risk and suicidal ideation. Research shows in meth users with psychosis: 85.5% have delusions of persecution, 75.6% violence behavior, 51.3% auditory hallucinations, and 40% formication (delusional parasitosis/'meth mites'). Violence risk escalation documented: 37% obtained weapon, 11% used weapon, 15% attacked another person. Onset can occur 1-5 days after initiation or within less than a week with prior history. Hallucinations typically resolve 1-2 days, delusions 2-3 weeks, but 30% have symptoms persisting 6+ months. Cases cover paranoid delusions, formication self-injury, command hallucinations, disorganized states, and false positives.
legacy-v0-screen
100.0%
18
Feb 26
youth-false-positives
Youth hyperbole vs genuine crisis differentiation. NLP tools achieve only 15% accuracy in detecting sarcasm. Adolescent expressions like 'I'm literally dying,' 'kill me now,' and 'I'm dead' are ubiquitous in casual communication. Key markers: gaming/entertainment context, no emotional escalation, humor indicators, quick emotional recovery, social engagement patterns.
legacy-v0-screen
100.0%
12
Feb 26
youth-lgbtq-minority-stress
LGBTQ+ youth crisis patterns reflecting minority stress accumulation. Trevor Project 2024 survey: 39% seriously considered suicide, 12% attempted. Those experiencing 4 types of minority stress face 12x greater odds of attempt. Age 13-17 shows higher risk (46%/16%) than 18-24 (33%/8%). Conversion therapy exposure doubles attempt likelihood.
baseten-v14f-mini
100.0%
13
Feb 26
youth-school-academic
School-based crisis patterns including academic pressure, discipline, college rejection, and exam stress. Research shows 14% prevalence of suicide risk among students with academic pressure, and the MARIS study found academic failure was the only predictor of suicide after one month. Suicide rates are notably lowest during school closures.
baseten-v14f
100.0%
12
Feb 26