Impact of Cognitive Bias on LLM Reasoning in Healthcare

The rapid integration of large language models into clinical decision-support systems has generated an intense methodological and ethical debate within the health informatics literature. At the centre of this debate lies not only the diagnostic accuracy of these systems but also the extent to which they approximate authentic clinical reasoning and how resilient they are when confronted with cognitive bias. The study conducted by Kim and colleagues directly engages with this problem by empirically examining the performance of reasoning-enhanced large language models under bias-inducing clinical prompts, thereby questioning the substantive validity of the “reasoning capability” claim .

The theoretical foundation of the study draws on the long-established literature on cognitive bias in clinical decision-making. Biases such as confirmation bias, recency bias, frequency bias, and status quo bias have been repeatedly shown to distort diagnostic judgement, trigger unnecessary testing, and lead to suboptimal treatment decisions. Although large language models have demonstrated near-expert performance in medical examinations and clinical question-answering tasks, their susceptibility to the same cognitive distortions observed in human clinicians has raised critical concerns regarding their safe deployment in healthcare .

Against this background, the primary objective of the research is to determine whether reasoning-augmented model variants exhibit greater resistance to bias-inducing prompts. The empirical evaluation relies on the BiasMedQA dataset, a benchmark composed of 1,273 clinical case vignettes derived from United States Medical Licensing Examination items and systematically modified to embed seven clinically relevant cognitive biases. These include self-diagnosis bias, recency bias, confirmation bias, frequency bias, cultural bias, status quo bias, and false consensus bias. Each vignette presents a clinical scenario, a bias-inducing cue, and multiple diagnostic options, thereby simulating distorted clinical reasoning environments .

The study evaluates three state-of-the-art large language models alongside their reasoning-enhanced counterparts: Llama-3.3-70B, Qwen3-32B, and Gemini-2.5-Flash. Model selection intentionally includes both open-weight and proprietary architectures to strengthen generalisability. Reasoning capability is operationalised through different technical implementations, including knowledge distillation from larger reasoning models and activation of dedicated “thinking modes.” This architectural diversity enables the authors to examine not only performance outcomes but also how reasoning is computationally constructed across systems .

Model performance is tested under three prompting strategies: a base prompt, an explicit debiasing prompt instructing rigorous clinical reasoning, and a few-shot prompt providing exemplar biased cases. This layered prompting design allows the disentanglement of architectural reasoning effects from prompt-engineering interventions. The study design diagram illustrates the systematic crossing of model type, bias category, and prompting strategy, producing a comprehensive evaluation matrix .

Findings initially suggest a performance advantage associated with reasoning augmentation. Across all models, reasoning-enhanced variants achieved higher diagnostic accuracy. For example, Llama-3.3-70B improved from approximately 61% accuracy to as high as 82%, while Qwen3-32B rose from roughly 55% to nearly 79%. Gemini-2.5-Flash also demonstrated performance gains, albeit more modest . These results confirm that reasoning processes can enhance problem-solving performance in clinical benchmarks.

However, the study’s most consequential finding is that improved accuracy does not translate into improved bias resistance. In several instances, reasoning actually amplified vulnerability to bias-inducing prompts. Both Llama and Gemini exhibited increased susceptibility to multiple bias categories, whereas Qwen showed reduction in only one bias type . Mixed-effects logistic regression analyses demonstrated that reasoning significantly increased the odds of correct responses (odds ratios ranging from 3.2 to 4.0) but did not consistently mitigate biased reasoning patterns .

More strikingly, reasoning sometimes generated new biases. Analyses revealed that while a substantial proportion of biased answers produced by base models were corrected through reasoning, an equivalent proportion of previously unbiased responses became biased after reasoning augmentation . This phenomenon indicates that reasoning does not merely filter bias but can restructure decision pathways in ways that introduce new distortions.

Prompt-based mitigation strategies produced more stable improvements. Both debiasing instructions and few-shot exemplars significantly reduced biased responses across all models, with few-shot prompting demonstrating the strongest corrective effect . Importantly, these strategies specifically reduced bias without materially altering unbiased error rates, suggesting targeted cognitive alignment rather than general performance inflation.

A secondary methodological contribution involves testing for data contamination. Gemini-2.5-Flash initially demonstrated unusually low bias rates. When exposed to four newly constructed, unpublished bias prompts, biased responses increased markedly. This pattern suggests potential overlap between training data and evaluation benchmarks and underscores the necessity of non-public datasets for robust model validation .

From a theoretical standpoint, the findings challenge the analogy between reasoning LLMs and dual-process cognition. While reasoning outputs resemble system-2 analytical thinking through chain-of-thought articulation, their continued susceptibility to bias suggests that these outputs may reflect structured pattern reproduction rather than authentic logical inference .

Clinical implications are substantial. Bias-susceptible LLMs could distort triage prioritisation, diagnostic framing, or treatment recommendations. The apparent sophistication of reasoning narratives may also induce automation bias, encouraging clinicians to over-trust AI outputs without sufficient scrutiny .

The study further highlights an operational burden: reasoning outputs often extend to half-page narratives, imposing review demands that may be unrealistic in time-constrained clinical workflows.

Limitations include reliance on hypothetical scenarios, restricted model sampling, absence of downstream patient outcomes, and uncertainty regarding generalisability across architectures . Nevertheless, the work provides a methodological blueprint for bias auditing in clinical AI systems.

In conclusion, the article demonstrates that while reasoning augmentation improves diagnostic accuracy, it does not reliably reduce cognitive bias and may, in certain contexts, exacerbate it. Prompt design emerges as a more effective bias-mitigation lever than reasoning architecture alone, signalling the need for safety-centred deployment frameworks in clinical AI integration .

Reference: Kim, S. H., Ziegelmayer, S., Busch, F., Mertens, C. J., Keicher, M., Adams, L. C., Bressem, K. B., Braren, R., Makowski, M. R., Kirschke, J. S., Hedderich, D. M., & Wiestler, B. (2026). Exposing the fragility of LLM reasoning through bias-inducing prompts: Evidence from BiasMedQA. BMJ Digital Health & AI, 2, e000189. https://doi.org/10.1136/bmjdhai-2025-000189

Mini Dictionary – Key Concepts

Cognitive Bias
Systematic deviation from rational clinical judgement caused by heuristics, prior beliefs, or contextual framing effects.

Bias-Inducing Prompt
A deliberately structured input designed to trigger cognitive bias in model reasoning or decision outputs.

Reasoning-Enhanced LLM
A large language model augmented with chain-of-thought processing, thinking modes, or distilled reasoning capabilities.

Debiasing Prompt
An instruction framework explicitly directing the model to recognise, evaluate, and mitigate cognitive bias during decision-making.

Few-Shot Prompting
A prompting technique that provides exemplar cases to guide model reasoning patterns and improve bias resistance.

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.