Close Menu
    2digital.news2digital.news
    • News
    • Analytics
    • Interviews
    • About us
    • Editorial board
    • Events
    2digital.news2digital.news
    Home»News»Widely Used AI Models Can Produce Severely Harmful Medical Advice. NOHARM Benchmark from Stanford University.
    News

    Widely Used AI Models Can Produce Severely Harmful Medical Advice. NOHARM Benchmark from Stanford University.

    Lidziya TarasenkaBy Lidziya TarasenkaDecember 10, 20252 Mins Read
    LinkedIn Twitter Threads Reddit
    Share
    Twitter LinkedIn Threads Reddit

    Large language models (LLMs) that already assist physicians and patients with medical questions can still generate severely harmful advice in a sizable percentage of real cases, according to a new multi-center preprint introducing the NOHARM safety benchmark. 

    Across 31 models, the rate of at least one severely harmful recommendation ranged from about 9% to 22% of outpatient consultation cases.

    The NOHARM benchmark uses 100 real primary-care–to-specialist eConsults across 10 specialties and 4,249 possible management actions (tests, drugs, referrals, follow-up steps). Models were scored on how often they recommended harmful actions or failed to recommend necessary ones.

    Severe harm was driven mainly by omissions. About 77% of severe errors occurred when models failed to suggest critical actions – such as key diagnostic tests, urgent referrals, or essential follow-up – rather than from proposing clearly dangerous interventions. Even a “no intervention” strategy (reassurance only) produced substantial severe harm, underscoring that inaction is not a safe baseline.

    On composite safety metrics, the best models, including a leading commercial model and a clinically grounded RAG system, outperformed generalist internists who were limited to conventional online resources. However, traditional AI and medical-knowledge benchmarks only moderately correlated with NOHARM scores, and model size, recency, or “reasoning” modes did not reliably predict safety.

    Harm could be mitigated by orchestration. Multi-agent setups, where one model’s plan was reviewed by others prompted to look for errors, had nearly six-fold higher odds of landing in the top safety quartile. Heterogeneous ensembles that mixed open-source, proprietary, and RAG models performed best.The authors conclude that clinical safety is a distinct dimension of model performance that cannot be inferred from exam-style accuracy alone.

    Photo: Pavel Danilyuk / Pexels

    Related Posts

    News

    Radiology in 2026: Financial Boom, Remote Work, and a Reality Check on the AI Myth

    April 1, 2026
    News

    Aging in four days instead of 40 years. Scientists developed “organ-on-a-chip”

    March 31, 2026
    News

    Japan tests first large hydrogen-powered marine engine. It’s a step toward cleaner shipping

    March 31, 2026
    Read more

    Drone Warfare Is Changing the Rules. Scale, Integration, and Speed Are What Decide It

    March 24, 2026

    Can Aging Be Hacked? Yury Melnichek on Gero, Doctorina, and the Near-Term Prospect of Living to 150

    March 20, 2026

    How the Body Ages. Cell Communication, Inflammation, and the Microbiome

    March 18, 2026
    Stay in touch
    • Twitter
    • Instagram
    • LinkedIn
    • Threads
    • Reddit
    Demo
    X (Twitter) Instagram Threads LinkedIn Reddit
    • NEWS
    • ANALYTICS
    • INTERVIEWS
    • ABOUT US
    • EDITORIAL BOARD
    • EVENTS
    • CONTACT US
    • ©2026 2Digital. All rights reserved.
    • Privacy policy.

    Type above and press Enter to search. Press Esc to cancel.