The provided source, titled “One shot at trust: building credible evidence for medical artificial intelligence”, authored by Ramez Kouzy, Julian C. Hong, and Danielle S. Bitterman, and published in Lancet Digital Health in 2025, addresses the critical challenges and responsibilities associated with the rapid development and deployment of medical Artificial Intelligence (AI), particularly Large Language Models (LLMs).
The article highlights an unprecedented surge in medical AI activity since January 2023, with LLM use cases generating widespread enthusiasm. Despite this fervor reflecting genuine technological advances, the authors caution that the swift pace of development demands thoughtful consideration of evaluation and communication strategies, especially given the direct impact on human lives in the medical field.
The central concern articulated in the article is the growing disconnect between technological promises and demonstrable real-world value, which risks creating a “trust deficit” that could undermine the sustainable adoption of AI in clinical practice. To prevent this, the authors advocate for a rigorous preimplementation research methodology for LLMs that parallels established paradigms of clinical investigation.
Key challenges and issues identified in the source include:
- Overstated Conclusions and Imprecise Communication: The article points out a troubling trend where preliminary or simulation-based study results are described using terms like “randomized controlled trials” (RCTs) or “randomized clinical trials,” even when their methods are synthetic and bear little resemblance to real-world clinical practice. This labeling, without clear acknowledgment of their preclinical nature, becomes a misnomer and risks misleading audiences.
- Superficial Engagement by Readers: Clinicians, policymakers, and the broader healthcare community often encounter these papers through high-impact journals and, due to time constraints or lack of expertise, rely heavily on titles and abstracts. This superficial engagement prevents crucial methodological details or limitations from reaching the intended audience, leading to an uncritical interpretation of claims and fueling a “hype cycle”.
- Reliance on Inadequate Outcome Measures: There is a heavy reliance on computational metrics or surrogate endpoints that may not reflect meaningful clinical impact. This mirrors historical lessons from oncology, where over-reliance on surrogate endpoints did not consistently predict actual patient benefit. The authors emphasize that different stakeholders (healthcare providers, administrators, patients) require different evaluation criteria and success metrics that current approaches often miss.
- Erosion of Trust and Compromised Patient Outcomes: The lack of clear distinctions between exploratory studies and validated clinical trials threatens to erode trust, create confusion, and may lead to the adoption of technologies based on inflated expectations rather than robust evidence, potentially resulting in misguided decisions and compromised patient outcomes.
To address these critical issues, the article proposes a path forward centered on establishing clear standards and fostering collaboration:
- Standards for Meaningful Clinical Evidence: There is a need for clear standards for meaningful clinical evidence in LLM medical research, including frameworks for distinguishing between simulation studies and RCTs, establishing appropriate comparators, and defining clinically relevant outcome measures.
- Collaborative Models: Such standards should emerge from collaborative efforts, drawing insights from clinical research methodology, implementation science, and AI evaluation frameworks. The authors suggest that cooperative groups in oncology, which have successfully established consensus frameworks, serve as a model for healthcare AI. Emerging initiatives like the Trustworthy and Responsible AI Network (TRAIN) are highlighted as promising models for coordinating efforts across institutional boundaries.
- Risk-Stratified Validation: A risk-stratified validation method is required. High-stakes implementations with potential patient safety implications warrant rigorous clinical trials, while lower-risk applications could proceed through alternative validation pathways. Simulation studies, while valuable for efficient evaluation, must be clearly differentiated from clinical validation.
- Precision and Integrity in Communication: Researchers, publishers, and technology developers are held responsible for communicating findings with precision and integrity, avoiding overstatement or “spin” that could erode trust and impede long-term progress. This includes clear acknowledgment of limitations and uncertainties.
The authors conclude by emphasizing that the medical community stands at a crucial juncture, where the primary obligation remains towards patient care and scientific truth. They assert that successful LLM integration requires methodical, rigorous evaluation and transparent communication to ensure these powerful tools genuinely advance the practice of medicine and maintain the foundational trust of the medical field.
Reference: Kouzy, R., Hong, J. C., & Bitterman, D. S. (2025). One shot at trust: building credible evidence for medical artificial intelligence. Lancet Digital Health. https://doi.org/10.1016/j.landig.2025.100883
Research questions:
- How can a rigorous pre-implementation research methodology for medical LLMs be developed and implemented, paralleling established paradigms of clinical investigation? This includes establishing clear frameworks for distinguishing between simulation studies and actual randomized controlled trials (RCTs).
- What are the most appropriate and meaningful clinical outcome measures for evaluating medical LLMs across different applications and stakeholder perspectives? This addresses the current heavy reliance on computational metrics or surrogate endpoints that may not reflect true clinical impact or patient benefit.
- How can collaborative models, similar to oncology cooperative groups or initiatives like the Trustworthy and Responsible AI Network (TRAIN), be effectively established to develop consensus frameworks for evaluating healthcare AI and prioritizing clinical trials?
- What specific standards and frameworks are needed to ensure precision and integrity in the communication of medical AI research findings, particularly to prevent the overstatement of preliminary or simulation-based results and avoid misleading audiences?
- How can a risk-stratified validation method be developed and applied for medical LLMs, where high-stakes implementations with patient safety implications undergo rigorous clinical trials, while lower-risk applications proceed through alternative, yet robust, validation pathways?
- What are the best practices for researchers, publishers, and technology developers to communicate findings with clarity, avoiding “spin” or overstatement that could erode trust and impede long-term progress in healthcare AI adoption?
- How can the medical community ensure that the rapid adoption of LLMs aligns with the primary obligation towards patient care and scientific truth, by maintaining trust through methodical, rigorous evaluation and transparent communication?
