The Illusion of Readiness: AI, LLMs, and Healthcare Hallucinations

Introduction

With last weeks announcements around the use of Microsoft’s Co-Pilot in the NHS and DHSC; several bold claims were made including the monthly saving of 300,000 hours of staff time. Unfortunately some key was missing from this paper which has prevented commentators digging into the claims being made. But what hasn’t been really covered is the release of Microsoft’s academic paper, “The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks” (Gu et al., 2025), where their own academics questioned the true readiness of these technologies for clinical use. Are AI hallucinations being ignored by those selling of AI in healthcare, or is there genuine potential waiting to be realised?

High Scores, Hidden Fragility

Microsoft’s paper reveals that while large frontier models such as GPT-5 achieve impressive scores on medical benchmarks, these results may be misleading. Stress tests show that models often rely on test-taking strategies rather than true medical understanding. For example, models can sometimes guess the correct answer even when crucial information, such as medical images, is missing. Furthermore, their responses can change dramatically with minor prompt adjustments.

Shortcut Learning and Brittleness

Researchers evaluated six flagship models across six widely used medical benchmarks. The findings were sobering: models frequently exploit superficial cues and memorised data, rather than demonstrating robust, multimodal reasoning. Performance is brittle, with minor changes to input format or answer order causing significant shifts in predictions.

Fabricated Reasoning

One of the most concerning discoveries is that these models often produce confident, step-by-step medical explanations that do not reflect their actual reasoning process. While these fabricated rationales may sound convincing, they could be dangerous if relied upon in clinical practice.

Benchmarks Are Not All Equal

The study also highlights that medical benchmarks vary widely in what they measure, yet are often treated as interchangeable. This masks important failure modes and can lead to an overstatement of a model’s real-world readiness.

Implications for Healthcare

The authors caution that high scores on medical benchmarks do not equate to readiness for clinical use. There is a credibility gap: models may excel in exams but falter when faced with tasks requiring genuine understanding and robustness. The paper advocates for more rigorous evaluation, including stress testing and clinician-guided rubrics, to ensure that health AI systems are robust, trustworthy, and aligned with actual medical demands.

Limitations and Future Directions

While the study focuses on specific medical benchmarks, its lessons are broadly applicable. Rubric-based evaluation depends on clinician expertise, which may introduce subjectivity. Future work should aim to develop standardised, transparent evaluation frameworks that combine benchmark scores with stress testing.

Microsoft’s Paper: A Critical Perspective

Microsoft’s 2025 paper cautions against overestimating AI’s readiness for clinical integration. The authors stress the need for rigorous evaluation and warn against relying solely on supplier claims. Their experiments revealed that when input data was progressively removed, LLMs sometimes jumped to conclusions, highlighting potential risks in clinical decision-making.

However, it’s important to note that current use cases such as clinical summarisation and electronic discharge differ from direct clinical decision-making. The research itself could help improve models by introducing new rules and constraints, a process that only needs to be done once, unlike the continual retraining required for human staff.

Lessons from Past Technology Cycles

History shows that hype cycles often precede genuine transformation. The dot-com boom and bust gave way to the internet’s integration into daily life. Handheld devices evolved from Psion and Palm Pilots to the iPhone, which succeeded by fusing internet, cellular networks, sensors, and software. AI, combined with quantum computing, IoT and robotics, is poised for similar transformation.

In Conclusion

The experiments in Microsoft’s paper deliberately “blinded” the AI by removing more and more diagnostic data, then criticised its conclusions. This is akin to mocking a visually impaired person for failing to perform surgery. While the paper rightly calls out inappropriate uses of AI, dismissing its future potential is naïve; and key questions remain including, if AI presents pre-processed information to a human is it already making the decision in the ‘human in the loop’ scenario, with the human too busy or not even caring how the information was derived?

Upcoming Event

David Newey will be speaking on Artificial Intelligence at the HTN Live event in Leeds on the 12th November 2025.

References

Gu, Y., Fu, J., Liu, X., Valanarasu, J. M. J., Codella, N., Tan, R., Liu, Q., Jin, Y., Zhang, S., Wang, J., Wang, R., Song, L., Qin, G., Usuyama, N., Wong, C., Hao, C., Lee, H., Sanapathi, P., Hilado, S., Jiang, B., Alvarez-Valle, J., Wei, M., Gao, J., Horvitz, E., Lungren, M., Poon, H., & Vozila, P. (2025). The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks. arXiv preprint arXiv:2509.18234. Available at: https://arxiv.org/abs/2509.18234v1 [Accessed 1 October 2025]