Close Menu
    2digital.news2digital.news
    • News
    • Analytics
    • Interviews
    • About us
    • Editorial board
    2digital.news2digital.news
    Home»News»Widely Used AI Models Can Produce Severely Harmful Medical Advice. NOHARM Benchmark from Stanford University.
    News

    Widely Used AI Models Can Produce Severely Harmful Medical Advice. NOHARM Benchmark from Stanford University.

    December 10, 20252 Mins Read
    LinkedIn Twitter

    Large language models (LLMs) that already assist physicians and patients with medical questions can still generate severely harmful advice in a sizable percentage of real cases, according to a new multi-center preprint introducing the NOHARM safety benchmark. 

    Across 31 models, the rate of at least one severely harmful recommendation ranged from about 9% to 22% of outpatient consultation cases.

    The NOHARM benchmark uses 100 real primary-care–to-specialist eConsults across 10 specialties and 4,249 possible management actions (tests, drugs, referrals, follow-up steps). Models were scored on how often they recommended harmful actions or failed to recommend necessary ones.

    Severe harm was driven mainly by omissions. About 77% of severe errors occurred when models failed to suggest critical actions – such as key diagnostic tests, urgent referrals, or essential follow-up – rather than from proposing clearly dangerous interventions. Even a “no intervention” strategy (reassurance only) produced substantial severe harm, underscoring that inaction is not a safe baseline.

    On composite safety metrics, the best models, including a leading commercial model and a clinically grounded RAG system, outperformed generalist internists who were limited to conventional online resources. However, traditional AI and medical-knowledge benchmarks only moderately correlated with NOHARM scores, and model size, recency, or “reasoning” modes did not reliably predict safety.

    Harm could be mitigated by orchestration. Multi-agent setups, where one model’s plan was reviewed by others prompted to look for errors, had nearly six-fold higher odds of landing in the top safety quartile. Heterogeneous ensembles that mixed open-source, proprietary, and RAG models performed best.The authors conclude that clinical safety is a distinct dimension of model performance that cannot be inferred from exam-style accuracy alone.

    Photo: Pavel Danilyuk / Pexels

    Share. Twitter LinkedIn
    Avatar photo
    Lidziya Tarasenka
    • LinkedIn

    Healthcare professional with a strong background in medical journalism, media redaction, and fact-checking healthcare information. Medical advisor skilled in research, content creation, and policy analysis. Expertise in identifying systemic healthcare issues, drafting reports, and ensuring the accuracy of medical content for public and professional audiences.

    Related Posts

    News

    ASML controls 90% of the chipmaking market. Japan’s DNP wants a slice of that pie

    December 11, 2025
    News

    OpenAI and Other AI Companies Receive Warning Letters from Multiple State Attorneys General Demanding Transparency and Greater Accountability

    December 11, 2025
    News

    Integral AI Claims Model That Learns New Tasks Without Additional Data or Human Involvement

    December 10, 2025
    Read more

    Is Informed Consent Still Informed? What Happens When We Click on “I Agree” 

    December 3, 2025

    The FDA’s Elsa AI Explained: Has It Really Accelerated Drug and Device Approvals?

    December 2, 2025

    Why does AI lie and get lazy about answering your questions? We spoke with an LLM expert

    November 27, 2025
    Stay in touch
    • Twitter
    • Instagram
    • LinkedIn
    Demo
    X (Twitter) Instagram LinkedIn
    • News
    • Analytics
    • Interviews
    • About us
    • Editorial board
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.