Artificial Intelligence

Impact of Cognitive Bias on LLM Reasoning in Healthcare

Mehmet Nurullah KurutkanFebruary 5, 2026

The rapid integration of large language models into clinical decision-support systems has generated an intense methodological and ethical debate within the health informatics literature. At the centre of this debate lies not only the diagnostic accuracy of these systems but also the extent to which they approximate authentic clinical reasoning and how resilient they are when confronted with cognitive bias. The study conducted by Kim and colleagues directly engages with this problem by empirically examining the performance of reasoning-enhanced large language models under bias-inducing clinical prompts, thereby questioning the substantive validity of the “reasoning capability” claim .

The theoretical foundation of the study draws on the long-established literature on cognitive bias in clinical decision-making. Biases such as confirmation bias, recency bias, frequency bias, and status quo bias have been repeatedly shown to distort diagnostic judgement, trigger unnecessary testing, and lead to suboptimal treatment decisions. Although large language models have demonstrated near-expert performance in medical examinations and clinical question-answering tasks, their susceptibility to the same cognitive distortions observed in human clinicians has raised critical concerns regarding their safe deployment in healthcare .

Against this background, the primary objective of the research is to determine whether reasoning-augmented model variants exhibit greater resistance to bias-inducing prompts. The empirical evaluation relies on the BiasMedQA dataset, a benchmark composed of 1,273 clinical case vignettes derived from United States Medical Licensing Examination items and systematically modified to embed seven clinically relevant cognitive biases. These include self-diagnosis bias, recency bias, confirmation bias, frequency bias, cultural bias, status quo bias, and false consensus bias. Each vignette presents a clinical scenario, a bias-inducing cue, and multiple diagnostic options, thereby simulating distorted clinical reasoning environments .

The study evaluates three state-of-the-art large language models alongside their reasoning-enhanced counterparts: Llama-3.3-70B, Qwen3-32B, and Gemini-2.5-Flash. Model selection intentionally includes both open-weight and proprietary architectures to strengthen generalisability. Reasoning capability is operationalised through different technical implementations, including knowledge distillation from larger reasoning models and activation of dedicated “thinking modes.” This architectural diversity enables the authors to examine not only performance outcomes but also how reasoning is computationally constructed across systems .

Model performance is tested under three prompting strategies: a base prompt, an explicit debiasing prompt instructing rigorous clinical reasoning, and a few-shot prompt providing exemplar biased cases. This layered prompting design allows the disentanglement of architectural reasoning effects from prompt-engineering interventions. The study design diagram illustrates the systematic crossing of model type, bias category, and prompting strategy, producing a comprehensive evaluation matrix .

Findings initially suggest a performance advantage associated with reasoning augmentation. Across all models, reasoning-enhanced variants achieved higher diagnostic accuracy. For example, Llama-3.3-70B improved from approximately 61% accuracy to as high as 82%, while Qwen3-32B rose from roughly 55% to nearly 79%. Gemini-2.5-Flash also demonstrated performance gains, albeit more modest . These results confirm that reasoning processes can enhance problem-solving performance in clinical benchmarks.

However, the study’s most consequential finding is that improved accuracy does not translate into improved bias resistance. In several instances, reasoning actually amplified vulnerability to bias-inducing prompts. Both Llama and Gemini exhibited increased susceptibility to multiple bias categories, whereas Qwen showed reduction in only one bias type . Mixed-effects logistic regression analyses demonstrated that reasoning significantly increased the odds of correct responses (odds ratios ranging from 3.2 to 4.0) but did not consistently mitigate biased reasoning patterns .

More strikingly, reasoning sometimes generated new biases. Analyses revealed that while a substantial proportion of biased answers produced by base models were corrected through reasoning, an equivalent proportion of previously unbiased responses became biased after reasoning augmentation . This phenomenon indicates that reasoning does not merely filter bias but can restructure decision pathways in ways that introduce new distortions.

Prompt-based mitigation strategies produced more stable improvements. Both debiasing instructions and few-shot exemplars significantly reduced biased responses across all models, with few-shot prompting demonstrating the strongest corrective effect . Importantly, these strategies specifically reduced bias without materially altering unbiased error rates, suggesting targeted cognitive alignment rather than general performance inflation.

A secondary methodological contribution involves testing for data contamination. Gemini-2.5-Flash initially demonstrated unusually low bias rates. When exposed to four newly constructed, unpublished bias prompts, biased responses increased markedly. This pattern suggests potential overlap between training data and evaluation benchmarks and underscores the necessity of non-public datasets for robust model validation .

From a theoretical standpoint, the findings challenge the analogy between reasoning LLMs and dual-process cognition. While reasoning outputs resemble system-2 analytical thinking through chain-of-thought articulation, their continued susceptibility to bias suggests that these outputs may reflect structured pattern reproduction rather than authentic logical inference .

Clinical implications are substantial. Bias-susceptible LLMs could distort triage prioritisation, diagnostic framing, or treatment recommendations. The apparent sophistication of reasoning narratives may also induce automation bias, encouraging clinicians to over-trust AI outputs without sufficient scrutiny .

The study further highlights an operational burden: reasoning outputs often extend to half-page narratives, imposing review demands that may be unrealistic in time-constrained clinical workflows.

Limitations include reliance on hypothetical scenarios, restricted model sampling, absence of downstream patient outcomes, and uncertainty regarding generalisability across architectures . Nevertheless, the work provides a methodological blueprint for bias auditing in clinical AI systems.

In conclusion, the article demonstrates that while reasoning augmentation improves diagnostic accuracy, it does not reliably reduce cognitive bias and may, in certain contexts, exacerbate it. Prompt design emerges as a more effective bias-mitigation lever than reasoning architecture alone, signalling the need for safety-centred deployment frameworks in clinical AI integration .

Reference: Kim, S. H., Ziegelmayer, S., Busch, F., Mertens, C. J., Keicher, M., Adams, L. C., Bressem, K. B., Braren, R., Makowski, M. R., Kirschke, J. S., Hedderich, D. M., & Wiestler, B. (2026). Exposing the fragility of LLM reasoning through bias-inducing prompts: Evidence from BiasMedQA. BMJ Digital Health & AI, 2, e000189. https://doi.org/10.1136/bmjdhai-2025-000189

Mini Dictionary – Key Concepts

Cognitive Bias
Systematic deviation from rational clinical judgement caused by heuristics, prior beliefs, or contextual framing effects.

Bias-Inducing Prompt
A deliberately structured input designed to trigger cognitive bias in model reasoning or decision outputs.

Reasoning-Enhanced LLM
A large language model augmented with chain-of-thought processing, thinking modes, or distilled reasoning capabilities.

Debiasing Prompt
An instruction framework explicitly directing the model to recognise, evaluate, and mitigate cognitive bias during decision-making.

Few-Shot Prompting
A prompting technique that provides exemplar cases to guide model reasoning patterns and improve bias resistance.

Subscribe to the Health Topics Newsletter!

When theatres wait: a new Lean 4.0 study and the research it invites
June 23, 2026
Every idle minute in an operating theatre is expensive. A scrubbed team stands ready, a sterile room sits empty, and…
The Forbidden Forest of AI in Healthcare: Red Lines, Trojan Horses, and Yet-Uncharted Paths
June 20, 2026
If we compare the boundless advancement of technology to a vast and complex castle, the European Union Artificial Intelligence Act…
Medical AI’s 97 Percent Lie: The story of the driving school “champion”
June 18, 2026
Picture a student driver. On the school's practice course, they are brilliant. Parallel parking on the first try, hill starts…
When “AI-Detected” Does Not Mean “AI-Written”: A Reading of a New Turnitin Study
June 16, 2026
Few numbers in a classroom carry as much weight today as the percentage an AI detector prints next to a…
A Reader’s Guide to the New Logic of AI in Scholarly Publishing
June 15, 2026
Judging the Claim, Not the Tool — and Then Judging the System Too Based on: van Zoonen, W., Tursunbayeva, A.…
One Method, Many Names: The Problem of Terminological Fragmentation in the Patient Journey Mapping Literature
June 15, 2026
Introduction: Why Naming Matters The maturity of a research method is measured not only by how frequently it is applied,…
Ecotherapy and Health Outcomes: A Chronological Evidence Mapping of Conceptual Evolution and Outcome Diversification, 1980–2026
June 8, 2026
Abstract Background: Ecotherapy — an umbrella term encompassing forest therapy, horticultural therapy, green and blue care, wilderness and adventure therapy,…
The Concept of Digital Inclusion: A Conceptual and Integrative Introduction from the Perspective of Health Sciences and Health Management
June 4, 2026
Abstract Digital inclusion is a multidimensional concept that refers to the ability of individuals and communities to access information and…
Catalytic Investment and Catalytic Financing: A Conceptual Map for Health Management
June 1, 2026
A concept that has quietly reorganized how global health money is supposed to behave — and what it still leaves…
The Frenemy Concept: An Academic Framework Between Amity and Enmity
May 30, 2026
Concept Analysis · Multi-Disciplinary Synthesis A bibliometric mapping of a popular-culture term that has matured into a cross-disciplinary analytic category,…

Impact of Cognitive Bias on LLM Reasoning in Healthcare

Subscribe to the Health Topics Newsletter!

Related Posts