AI’s Role in Systematic Reviews: The Elicit Example

This article, titled “Using artificial intelligence for systematic review: the example of elicit“, was authored by Nathan Bernard, Yoshimasa Sagawa Jr, Nathalie Bier, Thomas Lihoreau, Lionel Pazart, and Thomas Tannou. It was published in BMC Medical Research Methodology.

The article addresses the increasing use of artificial intelligence (AI) tools to assist researchers with various tasks, particularly in the systematic review process. Systematic reviews are described as rigorous and time-consuming processes requiring a high degree of completeness. While several AI tools have been developed to handle key milestones in this process, questions remain about their effectiveness compared to traditional human research.

Elicit is highlighted as one such AI tool, based on language models like GPT-3, which functions as a powerful research assistant. It distinguishes itself by its ability to generate a summary of the research question asked. Elicit aims to identify the most relevant papers using semantic similarity across multiple databases (though it primarily relies on Semantic Scholar) and then summarizes the question by analyzing each abstract. It also offers features like filters for keywords, article type, publication years, and organization of results by population, intervention, or study methodology. Elicit operates as a process-based system, retrieving, analyzing, and summarizing the eight most likely articles to answer a question.

The aim of this study was to determine whether AI-assisted research using Elicit adds value to the systematic review process compared to traditional screening methods. To achieve this, the researchers compared results from an umbrella review conducted independently of AI with those obtained through AI-based searching using Elicit.

Elicit’s contribution was assessed based on three criteria:

  • Repeatability: Defined as Elicit’s ability to provide consistent results under identical conditions but at different times. The search process was replicated three times.
  • Accuracy: Defined as Elicit’s ability to retrieve relevant articles. Articles obtained via Elicit were reviewed using the same inclusion/exclusion criteria as the umbrella review.
  • Reliability: Defined as the agreement between Elicit-assisted screening and the classical screening method. This was assessed by comparing the number of publications found by both methods.

Key Findings:

  • Repeatability: The repeatability test showed varied results across trials, with 246 results in trial 1, 169 in trial 2, and 172 in trial 3. After pooling and removing duplicates, 241 articles were identified. The study noted a lack of repeatability, as Trial 2, for example, only found 4 of the 6 included studies.
  • Accuracy: Six articles were included at the conclusion of the selection process from Elicit’s search. The study found that Elicit’s accuracy needs improvement, as many articles (206) were excluded due to inappropriate outcomes or interventions. Elicit’s accuracy is also influenced by the formulation of the research question.
  • Reliability: When comparing Elicit’s results with the AI-independent umbrella review, 3 common articles were identified, representing 17.6% of the studies finally included by the classic screening method. Additionally, 3 articles were exclusively identified by Elicit, while 17 articles were exclusively identified by the AI-independent umbrella review search. This indicated a lack of reliability when compared to the classical screening method.

The study highlighted several limitations of AI tools like Elicit in the systematic review process:

  • Lack of repeatability and reliability.
  • Limited search functionality: Elicit does not use a traditional keyword-based search, necessitating manual sorting by year for comprehensiveness.
  • Single database reliance: Elicit relies on articles referenced in a single database (Semantic Scholar), limiting comprehensiveness.
  • Reduced precision and nuances: The small number of articles found by Elicit reduced the level of precision and nuances, which are key objectives of systematic reviews.
  • Accuracy issues: Many articles were excluded by Elicit due to inappropriate outcomes or interventions.
  • Problematic referencing: Elicit incorrectly cited the umbrella review protocol, emphasizing the need for human control for referencing.
  • Research noise: Variations in the number of articles found per study across trials suggest “research noise” impacting accuracy and reliability.

Implications and Recommendations: Despite these limitations, the article concludes that AI research assistants, such as Elicit, can serve as valuable complementary tools for researchers during systematic literature review processes. They have not yet reached a level of development where they can fully replace traditional approaches. AI can support specific tasks like data screening, extraction, or risk of bias assessment to improve comprehensiveness, accuracy, and mitigate human biases. When combined with human analysis, AI tools can enhance time efficiency, optimization, and rigor.

However, the limitations underscore the critical need to maintain human oversight when using these tools. The authors propose several principles for methodological rigor and integrity when using AI:

  • Acknowledge that these tools lack evidence on their validity, reliability, and accuracy.
  • Use AI tools only at certain stages of the systematic review process, not for automating the entire process.
  • For transparency and reproducibility, it is crucial to mention the use of AI tools in the methodology section.
  • Implement the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) AI reporting guidelines, which are currently under development, to provide a framework for AI use.

In essence, while Elicit shows potential for enhancing comprehensiveness by identifying new articles that might otherwise be missed, its current performance demonstrates a lack of repeatability and reliability, indicating that AI tools are powerful assistants but require careful human supervision and integration into established systematic review methodologies.

Reference: Bernard, N., Sagawa, Y., Jr., Bier, N., Lihoreau, T., Pazart, L., & Tannou, T. (2025). Using artificial intelligence for systematic review: the example of elicit. BMC Medical Research Methodology, 25(75). https://doi.org/10.1186/s12874-025-02528-y

Video

Podcast Link

https://notebooklm.google.com/notebook/1ccc5b9a-dd89-4feb-9139-42a4cad96568/audio

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.