Artificial Intelligence

AI’s Role in Systematic Reviews: The Elicit Example

Mehmet Nurullah KurutkanJune 27, 2025

This article, titled “Using artificial intelligence for systematic review: the example of elicit“, was authored by Nathan Bernard, Yoshimasa Sagawa Jr, Nathalie Bier, Thomas Lihoreau, Lionel Pazart, and Thomas Tannou. It was published in BMC Medical Research Methodology.

The article addresses the increasing use of artificial intelligence (AI) tools to assist researchers with various tasks, particularly in the systematic review process. Systematic reviews are described as rigorous and time-consuming processes requiring a high degree of completeness. While several AI tools have been developed to handle key milestones in this process, questions remain about their effectiveness compared to traditional human research.

Elicit is highlighted as one such AI tool, based on language models like GPT-3, which functions as a powerful research assistant. It distinguishes itself by its ability to generate a summary of the research question asked. Elicit aims to identify the most relevant papers using semantic similarity across multiple databases (though it primarily relies on Semantic Scholar) and then summarizes the question by analyzing each abstract. It also offers features like filters for keywords, article type, publication years, and organization of results by population, intervention, or study methodology. Elicit operates as a process-based system, retrieving, analyzing, and summarizing the eight most likely articles to answer a question.

The aim of this study was to determine whether AI-assisted research using Elicit adds value to the systematic review process compared to traditional screening methods. To achieve this, the researchers compared results from an umbrella review conducted independently of AI with those obtained through AI-based searching using Elicit.

Elicit’s contribution was assessed based on three criteria:

Repeatability: Defined as Elicit’s ability to provide consistent results under identical conditions but at different times. The search process was replicated three times.
Accuracy: Defined as Elicit’s ability to retrieve relevant articles. Articles obtained via Elicit were reviewed using the same inclusion/exclusion criteria as the umbrella review.
Reliability: Defined as the agreement between Elicit-assisted screening and the classical screening method. This was assessed by comparing the number of publications found by both methods.

Key Findings:

Repeatability: The repeatability test showed varied results across trials, with 246 results in trial 1, 169 in trial 2, and 172 in trial 3. After pooling and removing duplicates, 241 articles were identified. The study noted a lack of repeatability, as Trial 2, for example, only found 4 of the 6 included studies.
Accuracy: Six articles were included at the conclusion of the selection process from Elicit’s search. The study found that Elicit’s accuracy needs improvement, as many articles (206) were excluded due to inappropriate outcomes or interventions. Elicit’s accuracy is also influenced by the formulation of the research question.
Reliability: When comparing Elicit’s results with the AI-independent umbrella review, 3 common articles were identified, representing 17.6% of the studies finally included by the classic screening method. Additionally, 3 articles were exclusively identified by Elicit, while 17 articles were exclusively identified by the AI-independent umbrella review search. This indicated a lack of reliability when compared to the classical screening method.

The study highlighted several limitations of AI tools like Elicit in the systematic review process:

Lack of repeatability and reliability.
Limited search functionality: Elicit does not use a traditional keyword-based search, necessitating manual sorting by year for comprehensiveness.
Single database reliance: Elicit relies on articles referenced in a single database (Semantic Scholar), limiting comprehensiveness.
Reduced precision and nuances: The small number of articles found by Elicit reduced the level of precision and nuances, which are key objectives of systematic reviews.
Accuracy issues: Many articles were excluded by Elicit due to inappropriate outcomes or interventions.
Problematic referencing: Elicit incorrectly cited the umbrella review protocol, emphasizing the need for human control for referencing.
Research noise: Variations in the number of articles found per study across trials suggest “research noise” impacting accuracy and reliability.

Implications and Recommendations: Despite these limitations, the article concludes that AI research assistants, such as Elicit, can serve as valuable complementary tools for researchers during systematic literature review processes. They have not yet reached a level of development where they can fully replace traditional approaches. AI can support specific tasks like data screening, extraction, or risk of bias assessment to improve comprehensiveness, accuracy, and mitigate human biases. When combined with human analysis, AI tools can enhance time efficiency, optimization, and rigor.

However, the limitations underscore the critical need to maintain human oversight when using these tools. The authors propose several principles for methodological rigor and integrity when using AI:

Acknowledge that these tools lack evidence on their validity, reliability, and accuracy.
Use AI tools only at certain stages of the systematic review process, not for automating the entire process.
For transparency and reproducibility, it is crucial to mention the use of AI tools in the methodology section.
Implement the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) AI reporting guidelines, which are currently under development, to provide a framework for AI use.

In essence, while Elicit shows potential for enhancing comprehensiveness by identifying new articles that might otherwise be missed, its current performance demonstrates a lack of repeatability and reliability, indicating that AI tools are powerful assistants but require careful human supervision and integration into established systematic review methodologies.

Reference: Bernard, N., Sagawa, Y., Jr., Bier, N., Lihoreau, T., Pazart, L., & Tannou, T. (2025). Using artificial intelligence for systematic review: the example of elicit. BMC Medical Research Methodology, 25(75). https://doi.org/10.1186/s12874-025-02528-y

Video

Podcast Link

https://notebooklm.google.com/notebook/1ccc5b9a-dd89-4feb-9139-42a4cad96568/audio

Subscribe to the Health Topics Newsletter!

When One Method Is Not Enough: The Multimethod SEM Framework for Rigorous Research
March 12, 2026
Physicians rarely rely on a single diagnostic test when confronting a complex disease. They combine imaging, laboratory work, and genetic…
Can Generative AI Strengthen Critical Thinking? A Pedagogical Framework for LLM Integration in Higher Education
March 12, 2026
The rapid integration of large language models (LLMs) such as GPT-4 and DeepSeek R1 into higher education has generated considerable…
Analysis theories on artificial intelligence, ChatGPT, data science, and metaverse
February 15, 2026
The rapid convergence of artificial intelligence, data science, generative AI systems such as ChatGPT, and immersive environments like the metaverse…
Lotus Protocol: A New Approach to Systematic Reviews
February 13, 2026
The article How to Conduct a Multi-Domain Systematic (Literature) Review? Guidelines Using The Lotus Protocol addresses a growing methodological gap…
The Health Benefits of Voluntary Simplicity
February 12, 2026
Voluntary simplicity is a multidimensional lifestyle orientation that refers to individuals’ conscious reduction of consumption levels in order to build…
Reviewer Fatigue and the Future of Peer Evaluation
February 11, 2026
The contemporary academic publishing ecosystem is sustained by peer review, a system widely regarded as the epistemic backbone of scientific…
Factors Driving 30-Day ED Revisits in Older Patients
February 11, 2026
Population ageing has transformed emergency care demand patterns worldwide, placing unprecedented pressure on emergency departments (EDs) and exposing systemic gaps…
Addressing Care Worker Burnout: Key Findings
February 11, 2026
The growing complexity of long-term care needs, combined with chronic workforce shortages, has positioned nursing homes among the most psychologically…
Impact of Loneliness on Quality of Life in Older Adults
February 10, 2026
This article, titled “Loneliness as a Predictor of Quality of Life in Older Adults Receiving Primary Health Care in Türkiye:…
Analysis of Patient Participation: Trends and Insights
February 10, 2026
The growing emphasis on patient-centered healthcare has transformed the role of patients from passive recipients of care into active partners…

AI’s Role in Systematic Reviews: The Elicit Example

Video

Podcast Link

Subscribe to the Health Topics Newsletter!

Related Posts