Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing
Kiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy, Robbie Fraser, Nina Vasan, Darja Djordjevic, Akanksha Dadlani, Max Lamparth, Eugenia Kim, Mykel Kochenderfer
Don't assume expert labels create valid ground truth for mental health AI. ICC below 0.3 means your training signal is mostly noise. If you're building RLHF pipelines for clinical domains, you need either radically different aggregation methods or acceptance that expert disagreement reflects genuine ambiguity, not labeling error.
Learning from human feedback assumes expert judgments, when aggregated, yield valid ground truth for training AI. Mental health stakes demand consensus—but do experts actually agree?
Method: Three certified psychiatrists independently evaluated LLM-generated mental health responses using a calibrated rubric. Inter-rater reliability was consistently poor (ICC 0.087–0.295), falling below acceptable thresholds despite similar training and shared instructions. The study systematically tested whether expert consensus—the foundation of RLHF pipelines—actually exists in high-stakes domains.
Caveats: Three raters, one domain. Other clinical specialties may show different consensus patterns.
Reflections: Do other high-stakes domains (legal advice, financial planning) show similar expert disagreement? · Can structured disagreement (capturing why experts differ) produce better training signal than forced consensus? · Should safety-critical AI systems be trained on expert consensus at all, or on outcome data?