Large language models (LLMs) that already assist physicians and patients with medical questions can still generate severely harmful advice in a sizable percentage of real cases, according to a new multi-center preprint introducing the NOHARM safety benchmark.
Across 31 models, the rate of at least one severely harmful recommendation ranged from about 9% to 22% of outpatient consultation cases.
The NOHARM benchmark uses 100 real primary-care–to-specialist eConsults across 10 specialties and 4,249 possible management actions (tests, drugs, referrals, follow-up steps). Models were scored on how often they recommended harmful actions or failed to recommend necessary ones.
Severe harm was driven mainly by omissions. About 77% of severe errors occurred when models failed to suggest critical actions – such as key diagnostic tests, urgent referrals, or essential follow-up – rather than from proposing clearly dangerous interventions. Even a “no intervention” strategy (reassurance only) produced substantial severe harm, underscoring that inaction is not a safe baseline.
On composite safety metrics, the best models, including a leading commercial model and a clinically grounded RAG system, outperformed generalist internists who were limited to conventional online resources. However, traditional AI and medical-knowledge benchmarks only moderately correlated with NOHARM scores, and model size, recency, or “reasoning” modes did not reliably predict safety.
Harm could be mitigated by orchestration. Multi-agent setups, where one model’s plan was reviewed by others prompted to look for errors, had nearly six-fold higher odds of landing in the top safety quartile. Heterogeneous ensembles that mixed open-source, proprietary, and RAG models performed best.The authors conclude that clinical safety is a distinct dimension of model performance that cannot be inferred from exam-style accuracy alone.
Photo: Pavel Danilyuk / Pexels

