Navigating LLMs in Healthcare: Pitfalls and Potentials

The dominant narrative in the field is that large language models (LLMs) have broad potential in healthcare, but should be treated as “limited yet useful assistive tools” that are safe only under careful use. Most studies show that LLMs can achieve adequate accuracy and speed on specific tasks such as coding, summarization, clinical decision support and patient education, yet assume they should not sit at the center of clinical decision-making without human oversight and additional technical safeguards due to hallucination, bias and opacity concerns (Dormosh et al., 2025; Ong et al., 2025; Xu et al., 2025).

Tension 1 – General-purpose “miracle model” or narrow task-specific toolbox?

This tension combines methodological, evidentiary and practical conflicts. In marketing and popular discourse, LLMs are framed as general-purpose artificial intelligence capable of “zero-shot” generalization over medical knowledge. However, the synthesis of current studies shows that in their raw form these models often perform at chance level or make critical errors on many clinical tasks, while strong performance typically requires substantial task-specific fine-tuning, retrieval-augmented generation (RAG) or carefully engineered expert interfaces (Dormosh et al., 2025; Erden et al., 2025; Pesapane et al., 2025).

A first source of tension is task diversity. In ESC heart failure guideline adherence classification, six general-purpose LLMs achieve macro F1 scores below 0.40 in most settings, whereas ORPO-based task-specific fine-tuning raises F1 to at least 0.90 for four models (Dormosh et al., 2025). At the same time, GPT-4V is highly sensitive for edema on sacroiliitis MRI but fails on chronic structural changes (Erden et al., 2025), and GPT-4 reaches only around 50% sensitivity and 37.5% specificity on mammography (Pesapane et al., 2025). Locally fine-tuned Llama-3 for structuring BI-RADS reports matches the performance of the commercial Qwen-Max, suggesting that with appropriate data and schemas, local models can be highly competitive (Sheng et al., 2025).

A second source is input modality. LLMs perform much better on text-based, structurally well-defined tasks such as risk-of-bias (RoB) assessment or rewriting hospital discharge summaries, while multimodal models still struggle with pure image interpretation (Erden et al., 2025; Mine et al., 2025; Pesapane et al., 2025). A third is the role assigned to the model: in some studies, LLMs act as primary decision-makers (for example, automatic RoB assessment or image interpretation); in others, they serve as data generators or auxiliary tools, such as generating synthetic bone scintigraphy images to augment small datasets (Haberl et al., 2025). A fourth source is the unresolved question of what constitutes “sufficient” performance: many papers report improvement on F1 or AUC, but rarely specify whether these correspond to clinically acceptable error thresholds.

The evidentiary balance indicates that task-specific fine-tuning, RAG and dedicated data schemas can yield high accuracy for narrowly defined tasks, and this finding is consistent across multiple studies (Dormosh et al., 2025; Sheng et al., 2025; Choi et al., 2025; Haberl et al., 2025). By contrast, the claim that “a general-purpose, untuned model can solve medicine” is not supported; repeated evidence from imaging, pediatric panoramic radiography and mammography shows serious performance limitations (Erden et al., 2025; Mine et al., 2025; Pesapane et al., 2025). The asymmetry arises because optimistic, “works everywhere” claims are often extrapolated from a small set of impressive success stories, whereas systematic evaluations tend to show raw models performing inadequately.

Current deadlocks include the absence of clear clinical acceptance thresholds, inconsistent use of accuracy metrics as surrogates for patient safety, and the near absence of systematic analysis of the costs of customization. Questions such as how many tasks a hospital would need to support with separate fine-tuned models, and how to manage model maintenance, drift and governance across this portfolio, remain largely unaddressed. Most data are single-center and short-term, with little testing of distributional shift across institutions or populations.

Potential research leverage points include multi-task, multi-center portfolio experiments comparing raw models, prompt-only strategies, RAG and task-specific fine-tuning across 10–15 heterogeneous tasks (coding, summarization, decision support, imaging), with outcomes such as accuracy, integration time, maintenance cost and model drift. A second area is real-world generalizability studies that follow models trained on one institution into other sites and populations, tracking performance decay, error types and changes in clinical behavior. A third is formal cost-effectiveness analysis contrasting “many narrow models” with “one large general-purpose model” in terms of long-term maintenance, harm from errors and infrastructure costs.

Tension 2 – Hallucination: solvable engineering problem or structural defect?

This tension involves conceptual, methodological, normative and practical conflicts. Hallucination rates are one of the primary safety concerns for LLM use in healthcare. Some studies report that RAG, multi-source evidence, example-based fine-tuning and safety prompts can reduce hallucinations to “acceptable” levels (Xu et al., 2025; Nishisako et al., 2025; Giuffre et al., 2025). Others show that errors remain stubborn and frequent in adversarial or complex clinical contexts and that safety prompting reduces but does not eliminate hallucinations; temperature reduction often has little effect (Omar et al., 2025; Esmaeilzadeh, 2025; Zada et al., 2025; Hose et al., 2025).

A major source of conflict is definitional. Some studies define hallucination as “medically incorrect or fabricated information” (Nishisako et al., 2025), others as “acceptance of a false element introduced by an adversarial prompt” (Omar et al., 2025), and others as “an error that can affect patient safety” (Hose et al., 2025). These non-aligned definitions preclude direct comparison of reported rates. A second source is the focus of technical interventions. Frameworks such as MEGA-RAG use multi-source retrieval, re-ranking and contradiction-aware response correction, reducing hallucination by more than 40% in some settings (Xu et al., 2025). A Japanese cancer information chatbot that performs RAG over trusted sources exhibits dramatically fewer hallucinations than a conventional model but refuses to answer out-of-scope questions, reducing response coverage (Nishisako et al., 2025). Under adversarial hallucination attacks, six models show hallucination rates of 50–82%; safety prompting reduces but does not remove these, and lowering temperature has no meaningful effect (Omar et al., 2025).

A third source is clinical context and risk tolerance. In low-risk environments such as mental health support bots equipped with strong guardrails, hallucinations may be practically invisible (Campellone et al., 2025). In high-risk domains such as complex clinical decision-making or medication management, even a 10–15% rate of critical safety events is unacceptable (Esmaeilzadeh, 2025).

The evidence base suggests that RAG over reliable sources and strict guardrails can meaningfully reduce hallucinations (Xu et al., 2025; Nishisako et al., 2025; Campellone et al., 2025). Yet adversarial scenarios and free-text patient queries continue to yield persistent, high error rates (Omar et al., 2025; Zada et al., 2025; Hose et al., 2025). On this basis, claims that “the hallucination problem is solved” are untenable; a more defensible conclusion is that hallucinations can be managed in tightly controlled, narrow settings.

Key deadlocks include the absence of a unified metric capturing both the technical and normative dimensions of hallucination, the lack of real-world longitudinal data on patient harm, and unresolved trade-offs between non-response (“I don’t know”) and access. Hallucination is rarely weighted by harm severity, and long-term observational data on patient outcomes are essentially absent.

Research leverage points include the development of risk-weighted hallucination metrics based on expert panels that assign higher costs to life-threatening treatment errors compared to minor factual inaccuracies, followed by multi-center, multi-language comparisons of RAG and generic deployments using this scale. A second is embedded observational work in patient portals or information services, logging model outputs, human moderation, hallucinations and complaints or adverse events over time. A third is experimentation with hallucination–coverage trade-offs by varying guardrail aggressiveness and measuring changes in user satisfaction, knowledge gain and risk of harmful behavior.

Tension 3 – Patient education: readability, completeness and responsibility

This is primarily a practical, normative and methodological tension. LLMs appear highly effective at simplifying and personalizing patient education materials, yet precisely this simplification process risks omitting critical safety content and misdirecting patients. At the same time, newer models seem to include fewer medical warnings and disclaimers over time (Sharma et al., 2025), intensifying the trade-off between comprehensibility, safety and responsibility.

Task and topic differences are central. In relatively well-defined, frequent conditions such as cardiac rehabilitation, urogynecology or post-operative rhinoplasty, ChatGPT-4/4o achieves high accuracy and safety ratings (Coskun et al., 2025; Vurture et al., 2025; Ibas et al., 2025). By contrast, in more niche or complex domains such as ENT emergencies or rhinology, ChatGPT-3.5 provides incomplete or wrong answers for most questions, and experts can easily recognize these deficiencies (Soon & Perry, 2025; Huang et al., 2025).

A second source is the difference between “rewriting” and “de novo generation.” GPT-4 can reduce institutional heart failure education materials from 10th to 7th grade reading level while sometimes making them more comprehensive, with high accuracy scores (King et al., 2025). However, when transforming discharge letters into patient-friendly letters, it captures 78% of safety-related learning objectives but systematically omits 15% of them, particularly those regarding complication prevention and understanding, and makes medical errors in 3.9% of sentences (Eisinger et al., 2025).

Free-form self-diagnosis scenarios further expose limits. In the EvalPrompt study, only 31% of ChatGPT-4 responses to realistic, open-ended patient queries are judged both correct and clear; the remainder are ambiguous or erroneous (Zada et al., 2025). Concurrently, analysis of the TIMed-Q dataset shows that between 2022 and 2025, the proportion of medical responses containing disclaimers falls from around 20% to below 1%, with newer models often omitting warnings entirely (Sharma et al., 2025).

The balance of evidence indicates strong capabilities in simplifying institutional materials and answering certain patient questions accurately (Coskun et al., 2025; Ibas et al., 2025; King et al., 2025), but repeated limitations in free self-diagnosis and complex sub-specialty domains (Huang et al., 2025; Soon & Perry, 2025; Zada et al., 2025). Declining use of disclaimers and the lack of standardized safety messaging add a further risk layer (Sharma et al., 2025).

Current deadlocks include reliance on readability metrics such as FKGL without systematic mapping of which content is omitted. Critical topics like complication prevention, medication side effects and alarm symptoms are often missing, but these omissions have not been rigorously coded across conditions. No study has yet assessed the medium- or long-term behavioral impact of LLM-based patient education on adherence, emergency visits or self-care. Disclaimers are treated in a binary fashion (present/absent), with little attention to content, placement, tone or cultural appropriateness.

Research leverage points include safety-sensitive content analyses that define standard sets of “critical safety messages” for specific conditions and code LLM outputs against these lists at scale. Randomized trials in clinical portals or discharge settings could compare LLM-simplified versus conventional materials over 3–6 months in terms of adherence, emergency visits, complications and health literacy. Experimental work on disclaimer design could vary length, specificity and modality to examine effects on trust, warning neglect and intention to consult professionals.

Tension 4 – Justice: do LLMs reduce or reproduce health inequities?

This tension is normative, political, evidentiary and methodological. Some studies show that LLMs express explicit or implicit biases based on race, gender, income, insurance status or LGBTQIA+ identity, generating discriminatory recommendations (Chang et al., 2025; Omar et al., 2025; Liu et al., 2025; Bouguettaya et al., 2025; Currie et al., 2025; Huang et al., 2025). Others find near-neutral behavior with respect to race or gender on certain tasks, and even suggest that models may be less biased than humans in some contexts (Hanna et al., 2025; Young et al., 2025; Wan et al., 2025). Frameworks such as FairMed promise technical fairness improvements (Lu et al., 2025). Consequently, the question “Are LLMs structurally biased?” has no single answer; bias appears highly task- and context-dependent.

Task and output heterogeneity drive much of the disagreement. Chang et al. (2025) assess anti-LGBTQIA+ medical bias using 38 prompts composed of clinical notes and questions, finding that 43–65% of responses from four models are inappropriate, often due to hallucinated or biased content. Omar et al. (2025) generate 1.7 million responses for 1,000 emergency department vignettes with 32 sociodemographic variations; some groups (e.g., Black, homeless, LGBTQIA+) are funneled toward unnecessary invasive tests, mental health evaluation or more aggressive triage, while high-income labels receive more advanced diagnostics. Liu et al. (2025) show that three Chinese LLMs systematically disadvantage women, low-income and uninsured patients in educational examples and treatment suggestions.

By contrast, studies on HIV discharge instructions find no meaningful linguistic differences across racial or ethnic groups in four LLMs (Hanna et al., 2025), and opioid recommendations do not show race or gender effects in another evaluation (Young et al., 2025). Wan et al. (2025) similarly report near-neutral behavior on certain standardized tasks.

Fairness metrics also diverge. Some work uses statistical parity difference, disparate impact ratio or action disparity indices and reports improvements under fairness-aware training (Lu et al., 2025). Others rely on expert clinical coders to rate “worse treatment” (Bouguettaya et al., 2025) or simple distributional comparisons of recommendations (Currie et al., 2025; Huang et al., 2025). These metrics are rarely linked to guideline concordance or health outcomes. Furthermore, there is evidence of a performance–fairness trade-off: larger models can generate higher-quality synthetic EHRs but also more polarized racial and gender patterns (Huang et al., 2025), while FairMed-like approaches improve fairness metrics with limited analysis of their impact on clinical correctness (Lu et al., 2025).

Overall, there is strong evidence of substantial injustices in certain high-risk scenarios (Chang et al., 2025; Omar et al., 2025; Liu et al., 2025; Bouguettaya et al., 2025; Currie et al., 2025), alongside examples of neutral performance in tasks such as standardized discharge instructions or analgesia recommendations (Hanna et al., 2025; Young et al., 2025; Wan et al., 2025). This pattern suggests that fairness issues are less intrinsic properties of models and more emergent properties of specific tasks, datasets and prompt designs.

Deadlocks include the proliferation of partially overlapping, sometimes conflicting fairness notions (statistical parity, equalized odds, perceived unfair treatment) and their weak connection to clinical guidelines or real-world care standards. Evidence is heavily concentrated in the United States and China, with minimal work in the Global South, rural contexts or minority languages. Cross-sectional designs dominate, leaving the real influence of model recommendations on clinician behavior underexplored.

Promising research directions include guideline-linked fairness assessments that simultaneously evaluate both fairness metrics and guideline adherence on the same vignette sets, for example in chest pain or hypertension scenarios with systematically varied race and income labels. Interface and context experiments could manipulate the salience of demographic information in prompts and measure how this influences recommendations. Finally, intersectional, multi-country benchmark datasets incorporating diverse health systems and identities could enable clearer evaluation of fairness-aware methods such as FairMed across contexts.

Tension 5 – Clinical decision support: co-pilot or autopilot?

This tension is ethical, theoretical and practical. At one extreme, LLMs function as “intelligent assistants” or co-pilots alongside clinicians; at the other, they automate specific decision tasks such as RoB assessment, prescription checking or triage. Evidence suggests clear benefits in co-pilot configurations but growing risks of overreliance, cognitive bias and blurred accountability as autonomy increases (Ong et al., 2025; Bentegeac et al., 2025; Degany et al., 2025; Esmaeilzadeh, 2025; Campellone et al., 2025; Abate et al., 2025).

One driver of this tension is model confidence and calibration. Nine models achieve 56–89% accuracy on USMLE questions, yet self-reported confidence is almost always above 90% and only weakly correlated with correctness (AUROC 0.52–0.68), whereas token-level probabilities predict errors much better (AUROC 0.71–0.87) (Bentegeac et al., 2025). This implies that verbal confidence cannot serve as a reliable basis for autonomous use.

A second driver is cognitive bias and ethical consistency. A reasoning-oriented model such as O1 shows no measurable cognitive bias in 7 out of 10 clinical vignettes but exhibits bias in others, sometimes performing worse than GPT-4 or clinicians, particularly in cases with “uncertainty-closing” cues (Degany et al., 2025). GPT-4 scores highly on average in ethical dilemmas, yet adheres to ethical principles in only about 60% of certain scenarios (Xiong et al., 2025). Esmaeilzadeh (2025) finds that three advanced models, under varied prompting strategies, produce critical safety violations or severe empathy/communication failures in 12.2% of responses.

A third driver is human–machine collaboration design. In medication error detection, an LLM-based clinical decision support system (CDSS) has limited standalone accuracy, yet the combination of clinical pharmacologist plus LLM co-pilot achieves the best performance, particularly for high-harm potential errors, increasing human detection capability by a factor of 1.5 (Ong et al., 2025). In a randomized trial of a digital mental health intervention, a guardrail-equipped generative bot performs similarly to a rule-based bot in terms of user relationship and experience, with no major safety events but somewhat better empathy expression (Campellone et al., 2025).

In background safety tasks, performance is mixed. For WHO’s causality algorithm for vaccine adverse events following immunization (AEFI), ChatGPT aligns with human ratings at a moderate level but adheres to the algorithm’s steps in only 34% of cases, and Gemini performs worse (Abate et al., 2025), calling into question fully automated safety pipelines.

Evidence strongly supports guardrailed co-pilot configurations as outperforming either LLMs or humans alone (Ong et al., 2025; Campellone et al., 2025). However, for fully automated tasks such as RoB, causality assessment or ethical decision-making, the evidence for human replacement is weak; calibration and ethical compliance issues remain serious (Bentegeac et al., 2025; Esmaeilzadeh, 2025; Abate et al., 2025; Degany et al., 2025).

Deadlocks include the cliché that “LLMs should only assist and humans should make final decisions,” without clear answers to how responsibility is actually divided, whose recommendations dominate in interfaces, and to what extent clinicians can maintain genuine authority under time pressure. New cognitive biases in human–LLM joint decision-making, such as automation bias and anchoring, are scarcely measured. Legal and ethical frameworks for assigning responsibility in hybrid decisions remain indistinct.

Research priorities include simulated clinical decision experiments in which residents and specialists solve vignettes under different interface designs (hidden suggestions, explanatory suggestions, probabilistic outputs, “second opinion” labels) and outcomes such as accuracy, decision time, confidence and automation bias are tracked. Real-world staged pilots in tasks like prescription checking could compare human-only, invisible LLM and visible, explanatory LLM conditions with longitudinal monitoring of error rates, serious harm and workload. Interface experiments using internal uncertainty metrics (e.g., token probability–based indicators) could test how best to convey model uncertainty without inducing overtrust or underuse.

Tension 6 – Automating the scientific process: efficiency or epistemic erosion?

This tension spans methodological, epistemic and practical dimensions. LLMs can dramatically accelerate systematic reviews, RoB assessments, data extraction and even causal inference, but risk replacing already inconsistent human judgment with an opaque, partly hallucinating “black box” (Dobler et al., 2025; Huang et al., 2025; Kim et al., 2025; Di Pumpo et al., 2025; Forero et al., 2025; Abate et al., 2025).

A key source of ambiguity is that human gold standards are themselves inconsistent. In a GPT-4–assisted extraction study, inter-reviewer agreement on RoB ratings between human systematic reviews is very low (κ = 0.094), while GPT-4 shows only fair to moderate agreement with humans (Kim et al., 2025). This raises the question of what it means epistemically when LLMs match inconsistent human judgments.

Performance varies sharply by task. GPT-4 achieves 88.6% agreement with humans on study characteristics tables and sometimes surpasses them, and LLMs can reach 70–90% accuracy in RoB2 domain ratings when used to answer signaling questions that are then fed into algorithmic decision rules, while cutting assessment time from 31.5 to 1.9 minutes per study (Huang et al., 2025). However, in WHO’s vaccine causality algorithm, ChatGPT adheres to decision steps in only a third of cases, with even poorer performance from Gemini (Abate et al., 2025).

Usage patterns further shape outcomes. Some groups propose LLMs as a “second reader” parallel to human reviewers (Rose et al., 2025; Di Pumpo et al., 2025), whereas others experiment with fully automated RoB or level-of-evidence ratings, in which certain risk domains are systematically underestimated (Forero et al., 2025; Di Pumpo et al., 2025).

The evidence suggests that LLM-assisted data extraction and signaling-question generation can offer substantial efficiency gains at acceptable accuracy when embedded in structured processes and under human oversight (Huang et al., 2025; Kim et al., 2025). For higher-order methodological judgments such as RoB synthesis or causality, the evidence is heterogeneous; some tools track human performance, others are systematically more lenient or stricter (Forero et al., 2025; Abate et al., 2025; Di Pumpo et al., 2025).

Deadlocks include the near-total absence of studies that follow LLM-assisted review processes through to their downstream clinical or policy decisions, such as guideline recommendations or regulatory decisions. The relative harms of different error patterns (e.g., underestimating some biases while overestimating others) have not been systematically mapped. The implications of LLM integration for transparency and reproducibility—in particular, how to resolve conflicts between different models’ outputs—remain largely untheorized.

Priority research includes randomized comparisons of fully human versus LLM-assisted systematic review workflows applied to the same question, comparing final meta-analytic estimates, recommendation directions, time and cost, and error profiles. Large-scale “error pathology” analyses could compare human and LLM RoB/causality ratings to identify domains where models are consistently optimistic or pessimistic and the safety implications of these discrepancies. Finally, standardized “LLM use reports” for systematic reviews and related work—detailing models, prompts, training data and oversight—could be developed and piloted to support transparency and appraisal.

Tension 7 – Evaluation frameworks: shared language or metric anarchy?

This final tension is conceptual, methodological and normative. Studies deploy a wide range of metrics for “safety,” “bias,” “empathy,” “ethical alignment” and “hallucination,” but there is limited standardization or cross-study comparability. Several groups propose comprehensive evaluation frameworks (Templin et al., 2025; Hose et al., 2025; Giuffre et al., 2025; Nishisako et al., 2025), yet the majority of the literature still relies on ad hoc criteria, small vignette sets and subjective ratings. This fragmentation impedes cumulative knowledge-building and policy-making.

Disciplinary differences contribute to this fragmentation. Patient safety work tends to classify LLM errors by input/output stage and clinical severity (Hose et al., 2025). Fairness studies focus on machine-learning metrics such as statistical parity or SPD, while ethics-centered work favors principle-based checklists and expert panel scores (Esmaeilzadeh, 2025; Chang et al., 2025). Educational studies evaluate learning outcomes, academic integrity and critical thinking (Rodger et al., 2025; Kim & Vajravelu, 2025).

Many studies remain fixated on single “performance” metrics (accuracy, AUC, F1), whereas frameworks such as EVAL integrate similarity measures, reward models and human ratings to produce richer but more complex assessments (Giuffre et al., 2025). Nishisako et al. (2025) highlight “refusal to answer” as an important meta-outcome alongside hallucination, which most benchmarks ignore.

The disconnect from regulation and policy is also notable. Templin et al. (2025) propose a five-step bias assessment framework for clinical deployment, but most empirical work continues to run isolated, single-institution tests with little reference to regulatory contexts. Infodemiology approaches, such as those by Bandeira et al. (2025), conceptualize LLMs within broader public health information ecosystems, yet technical benchmarks rarely engage with these wider concerns.

The current evidence base thus consists of many small, internally consistent but mutually incomparable experiments, punctuated by early but not yet dominant standardization efforts. Deadlocks include unarticulated conflicts between stakeholder perspectives—clinicians, patients, ethicists, developers and regulators may have incompatible definitions of a “good model”—and the prevalence of closed evaluation sets that hinder replication and longitudinal monitoring. Rapid model iteration quickly renders static benchmarks obsolete.

Research leverage points include multi-stakeholder Delphi processes to reach consensus on a core “metric basket” for health LLMs, spanning performance, calibration, safety, fairness, explainability and usability. Open, living benchmark platforms for medical LLMs, updated periodically and governed by independent consortia, could integrate error taxonomies (Hose et al., 2025), bias tests (Templin et al., 2025) and hallucination metrics (Nishisako et al., 2025). Finally, policy-linked composite indicators such as a “health information ecosystem risk score,” combining hallucination, misinformation potential, correction mechanisms and user trust, could inform regulation and procurement decisions.

References:

Abate, A., Poncato, E., Barbieri, M. A., Powell, G., Rossi, A., Peker, S., … Sessa, M. (2025). Off-the-shelf large language models for causality assessment of individual case safety reports: A proof-of-concept with COVID-19 vaccines. Drug Safety, 48(7), 805–820. https://doi.org/10.1007/s40264-025-01531-y

Bandeira, A., Goncalves, L. H., Holl, F., Shaibu, J. U., Goncalves, M. L., Payinda, R., … Mackey, T. (2025). Viewpoint on the intersection among health information, misinformation, and generative AI technologies. JMIR Infodemiology, 5. https://doi.org/10.2196/69474

Bentegeac, R., Le Guellec, B., Kuchcinski, G., Amouyel, P., & Hamroun, A. (2025). Token probabilities to mitigate large language models overconfidence in answering medical questions: Quantitative study. Journal of Medical Internet Research, 27. https://doi.org/10.2196/64348

Bouguettaya, A., Stuart, E. M., & Aboujaoude, E. (2025). Racial bias in AI-mediated psychiatric diagnosis and treatment: A qualitative comparison of four large language models. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01746-4

Campellone, T. R., Flom, M., Montgomery, R. M., Bullard, L., Pavez, A., Morales, M., … Darcy, A. (2025). Safety and user experience of a generative artificial intelligence digital mental health intervention: Exploratory randomized controlled trial. Journal of Medical Internet Research, 27. https://doi.org/10.2196/67365

Chang, C. T., Srivathsa, N., Bou-Khalil, C., Swaminathan, A., Lunn, M. R., Mishra, K., … Daneshjou, R. (2025). Evaluating anti-LGBTQIA+ medical bias in large language models. PLOS Digital Health, 4(9). https://doi.org/10.1371/journal.pdig.0001001

Choi, J. Y., Kim, D. E., Kim, S. J., Choi, H., & Yoo, T. K. (2025). Application of multimodal large language models for safety indicator calculation and contraindication prediction in laser vision correction. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01487-4

Coskun, C., & Cetinkaya, B. (2025). Evaluating the accuracy and readability of ChatGPT-4 responses about cardiac rehabilitation for heart failure patients. Acta Cardiologica. https://doi.org/10.1080/00015385.2025.2576451

Currie, G., Hewis, J., Hawk, E., & Rohren, E. (2025). Gender and ethnicity bias of text-to-image generative artificial intelligence in medical imaging, part 2: Analysis of DALL-E 3. Journal of Nuclear Medicine Technology, 53(2), 162–168. https://doi.org/10.2967/jnmt.124.268359

Degany, O., Laros, S., Idan, D., & Einav, S. (2025). Evaluating the o1 reasoning large language model for cognitive bias: A vignette study. Critical Care, 29(1). https://doi.org/10.1186/s13054-025-05591-5

Di Pumpo, M., Riccardi, M. T., De Vita, V., & Damiani, G. (2025). Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels. European Journal of Public Health. https://doi.org/10.1093/eurpub/ckaf072

Dobler, D., Binder, H., Boulesteix, A.-L., Igelmann, J.-B., Kohler, D., Mansmann, U., … Schmid, M. (2025). ChatGPT as a tool for biostatisticians: A tutorial on applications, opportunities, and limitations. Statistics in Medicine, 44(23–24). https://doi.org/10.1002/sim.70263

Dormosh, N., Boonstra, M., Abu-Hanna, A., Asselbergs, F. W., & Calixto, I. (2025). Exploring the potential of large language models for assessing medication adherence to the ESC heart failure guidelines. JAMIA Open, 8(6). https://doi.org/10.1093/jamiaopen/ooaf155

Eisinger, F., Holderried, F., Mahling, M., Stegemann-Philipps, C., Herrmann-Werner, A., Nazarenus, E., … Holderried, M. (2025). What’s going on with me and how can I better manage my health? The potential of GPT-4 to transform discharge letters into patient-centered letters. Journal of Medical Internet Research, 27. https://doi.org/10.2196/67143

Erden, Y., Dilek, G., Temel, M. H., Soylu, H. H., Kalfaoglu, M. E., & Bagcier, F. (2025). Evaluating the performance of ChatGPT-4V in detecting inflammatory MRI findings of sacroiliitis. Journal of Imaging Informatics in Medicine. https://doi.org/10.1007/s10278-025-01742-w

Esmaeilzadeh, P. (2025). Ethical implications of using general-purpose LLMs in clinical settings: A comparative analysis of prompt engineering strategies and their impact on patient safety. BMC Medical Informatics and Decision Making, 25(1). https://doi.org/10.1186/s12911-025-03182-6

Forero, D. A., Abreu, S. E., & Tovar, B. E. (2025). Automated analyses of risk of bias and critical appraisal of systematic reviews: A comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association, 32(9), 1471–1476. https://doi.org/10.1093/jamia/ocaf117

Giuffre, M., You, K., Pang, Z., Kresevic, S., Chung, S., Chen, R., … Shung, D. L. (2025). Expert of experts verification and alignment (EVAL) framework for large language models safety in gastroenterology. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01589-z

Haberl, D., Ning, J., Kluge, K., Kumpf, K., Yu, J., Jiang, Z., … Spielvogel, C. P. (2025). Generative artificial intelligence enables the generation of bone scintigraphy images and improves generalization of deep learning models. European Journal of Nuclear Medicine and Molecular Imaging, 52(7), 2355–2368. https://doi.org/10.1007/s00259-025-07091-8

Hanna, J. J., Wakene, A., Johnson, A., Lehmann, C. U., & Medford, R. J. (2025). Assessing racial and ethnic bias in text generation by large language models for health care-related tasks. Journal of Medical Internet Research, 27. https://doi.org/10.2196/57257

Hose, B.-Z., Handley, J. L., Biro, J., Reddy, S., Krevat, S., Hettinger, A. Z., & Ratwani, R. M. (2025). Development of a preliminary patient safety classification system for generative AI. BMJ Quality & Safety, 34(2), 130–132. https://doi.org/10.1136/bmjqs-2024-017918

Huang, A. E., Chang, M. T., Khanwalkar, A., Yan, C. H., Phillips, K. M., Yong, M. J., … Patel, Z. M. (2025). Utilization of ChatGPT for rhinology patient education: Limitations in a surgical sub-specialty. OTO Open, 9(1). https://doi.org/10.1002/oto2.70065

Huang, J., Lai, H., Zhao, W., Xia, D., Bai, C., Sun, M., … Ge, L. (2025). Large language model-assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool. Journal of Medical Internet Research, 27. https://doi.org/10.2196/70450

Huang, R., Wu, H., Yuan, Y., Xu, Y., Qian, H., Zhang, C., … Liu, Y. (2025). Evaluation and bias analysis of large language models in generating synthetic electronic health records. Journal of Medical Internet Research, 27. https://doi.org/10.2196/65317

Ibas, M., Dursun, S., Paksoy, M., Ocal, R., & Karatas, E. (2025). Accuracy and safety of ChatGPT-4o responses in rhinoplasty postoperative counseling. Acta Oto-Laryngologica, 145(9), 851–856. https://doi.org/10.1080/00016489.2025.2541612

King, R. C., Samaan, J. S., Haquang, J., Bharani, V., Margolis, S., Srinivasan, N., … Ghashghaei, R. (2025). Improving the readability of institutional heart failure-related patient education materials using GPT-4. JMIR Cardio, 9. https://doi.org/10.2196/68817

Kim, J. K., Chua, M. E., Li, T. G., Rickard, M., & Lorenzo, A. J. (2025). Novel AI applications in systematic review: GPT-4 assisted data extraction, analysis, review of bias. BMJ Evidence-Based Medicine, 30(5), 313–322. https://doi.org/10.1136/bmjebm-2024-113066

Kim, J., & Vajravelu, B. N. (2025). Assessing the current limitations of large language models in advancing health care education. JMIR Formative Research, 9. https://doi.org/10.2196/51319

Liu, C., Zheng, J., Liu, Y., Wang, X., Zhang, Y., Fu, Q., … Liu, C. (2025). Potential to perpetuate social biases in health care by Chinese large language models: A model evaluation study. International Journal for Equity in Health, 24(1). https://doi.org/10.1186/s12939-025-02581-5

Lu, H., Lin, Y., Li, Z., Yiu, M. L., Gao, Y., & Uddin, S. (2025). Toward fair medical advice: Addressing and mitigating bias in large language model-based healthcare applications. Artificial Intelligence in Medicine, 168. https://doi.org/10.1016/j.artmed.2025.103216

Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J., & Powell, D. (2025). Cognitive bias in clinical large language models. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01790-0

Nishisako, S., Higashi, T., & Wakao, F. (2025). Reducing hallucinations and trade-offs in responses in generative AI chatbots for cancer information. JMIR Cancer, 11. https://doi.org/10.2196/70176

Omar, M., Sorin, V., Collins, J. D., Reich, D., Freeman, R., Gavin, N., … Klang, E. (2025). Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support. Communications Medicine, 5(1). https://doi.org/10.1038/s43856-025-01021-3

Omar, M., Soffer, S., Agbareia, R., Bragazzi, N. L., Apakama, D. U., Horowitz, C. R., … Klang, E. (2025). Sociodemographic biases in medical decision making by large language models. Nature Medicine, 31(6), 1873–1881. https://doi.org/10.1038/s41591-025-03626-6

Ong, J. C. L., Jin, L., Elangovan, K., Lim, G. Y. S., Lim, D. Y. Z., Sng, G. G. R., … Ting, D. S. W. (2025). Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Reports Medicine, 6(10). https://doi.org/10.1016/j.xcrm.2025.102323

Pesapane, F., Nicosia, L., Rotili, A., Penco, S., Dominelli, V., Trentin, C., … Cassano, E. (2025). A preliminary investigation into the potential, pitfalls, and limitations of large language models for mammography interpretation. Discover Oncology, 16(1). https://doi.org/10.1007/s12672-025-02005-4

Rose, C. J., Bidonde, J., Ringsten, M., Glanville, J., Berg, R. C., Cooper, C., … Potrebny, T. (2025). Using a large language model (ChatGPT) to assess risk of bias in randomized controlled trials: Protocol for a pilot study. BMC Medical Research Methodology, 25(1). https://doi.org/10.1186/s12874-025-02631-0

Sharma, S., Alaa, A. M., & Daneshjou, R. (2025). A longitudinal analysis of declining medical safety messaging in generative AI models. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01943-1

Sheng, W., Wang, Y., Xiao, L., Guo, B., Zhang, Y., Qiao, L., … Zhang, Y. (2025). BI-RADS-compliant structured mammography reporting using locally deployed large language models. European Radiology. https://doi.org/10.1007/s00330-025-12147-2

Soon, S., & Perry, B. (2025). Paging Dr. ChatGPT: Safety, accuracy and readability of ChatGPT in ENT emergencies. Australian Journal of Otolaryngology, 8. https://doi.org/10.21037/ajo-24-56

Templin, T., Fort, S., Padmanabham, P., Seshadri, P., Rimal, R., Oliva, J., … Sinnott-Armstrong, N. (2025). Framework for bias evaluation in large language models in healthcare settings. NPJ Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-01786-w

Vurture, G., Jenkins, N., Ross, J., Sansone, S., Conner, E., Jacobson, N., … Baum, J. (2025). Addressing commonly asked questions in urogynecology: Accuracy and limitations of ChatGPT. International Urogynecology Journal. https://doi.org/10.1007/s00192-025-06184-0

Wan, Z., Guo, Y., Bao, S., Wang, Q., & Malin, B. A. (2025). Evaluating sex and age biases in multimodal large language models for skin disease identification from dermatoscopic images. Health Data Science, 5. https://doi.org/10.34133/hds.0256

Xu, S., Yan, Z., Dai, C., & Wu, F. (2025). MEGA-RAG: A retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health. Frontiers in Public Health, 13. https://doi.org/10.3389/fpubh.2025.1635381

Young, C. C., Enichen, E., Rao, A., & Succi, M. D. (2025). Racial, ethnic, and sex bias in large language model opioid recommendations for pain management. Pain, 166(3), 511–517. https://doi.org/10.1097/j.pain.0000000000003388 Zada, T., Tam, N., Barnard, F., Van Sittert, M., Bhat, V., & Rambhatla, S. (2025). Medical misinformation in AI-assisted self-diagnosis: Development of a method (EvalPrompt) for analyzing large language models. JMIR Formative Research, 9. https://doi.org/10.2196/66207

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.