AI is becoming increasingly capable in medicine, and it’s no secret that it can handle even very complex problems. Artificial intelligence is also being used more and more as a substitute for human workers, as illustrated by the wave of layoffs driven by AI. The latest study from a team at the University of Alberta shows that doctors still achieve better results than advanced language models in tasks that require flexible clinical reasoning.
The researchers evaluated how well AI models can analyze evolving clinical information — for example, when a patient’s symptoms change during an examination or when distracting elements appear. According to results published in New England Journal of Medicine, although AI scores very highly on multiple-choice exams, in real clinical scenarios — which require ordinary human intuition, context, and adaptability — the models fall far behind physicians.
The research team used a tool called concor.dance, based on script-concordance testing, which evaluates the ability to recognize which pieces of information are relevant to a treatment plan and which are merely distracting “false cues.” Ten popular AI models developed by leading companies such as Google, OpenAI, and Anthropic were tested. Overall, the models performed at the level of first- and second-year medical students, while senior residents and experienced physicians remained significantly more effective.
One of the key findings was that in roughly 30% of the tasks, the new information was irrelevant or intentionally misleading — and AI models struggled to recognize this. They would often attempt to “explain” an irrelevant detail by integrating it into the diagnostic plan, which led to errors. “One of our biggest concerns about large language models is that they have been fine-tuned to be very helpful, giving frequent answers that inspire confidence in humans” — said one of the study’s authors, University of Alberta neurology resident Liam McCoy.
The researchers emphasize that although AI in medicine is improving rapidly — for example in medical imaging analysis, generating clinical notes, or retrieving data — it is still not suitable for autonomously replacing doctors in diagnosis and treatment. There is still a long way to go before that becomes possible, if it is even possible at all.

