Artificial Intelligence

Artificial Intelligence in Peer Review: A Seven-Level Framework

Mehmet Nurullah KurutkanMay 12, 2026

Academic publishing’s most fragile link is no longer merely the difficulty of “finding reviewers.” The real problem is whether the reviewer who has been found has actually read the manuscript, how far the report is grounded in the manuscript’s content, and whether the decision is shaped by human judgment or by a low-quality artificial intelligence output. The increasingly common complaint on academic platforms that “the reviewer uploaded my manuscript to AI without reading it and rejected it” is no longer just a story of personal grievance. It is an early warning signal showing how the peer-review system must be redesigned in the age of generative artificial intelligence.

The basic mistake in this debate is that the issue is often framed as “Should reviewers use artificial intelligence or not?” This question is incomplete. The same AI tool may produce an ethical violation in one workflow while improving a reviewer’s attention, consistency, and capacity for mechanical checking in another. Therefore, the right question is not “Was AI used?” but “In what configuration, for which task, under what form of supervision and chain of responsibility was AI used?”

When this distinction is not made, two mistaken positions emerge. The first position advocates a complete ban on AI. At first glance, this approach may seem ethical; however, in practice, it does not solve structural problems such as reviewer fatigue, delays in reports, linguistic inequalities, and deficiencies in mechanical checks. The second position introduces AI into the system in an uncontrolled way in the name of efficiency. This turns peer review into an automatic reporting practice that is detached from domain expertise, produces template-based criticism, and weakens accountability. Both approaches are inadequate. What is needed is not a simple choice between prohibition and permission, but a configuration-based ethical threshold system.

For this reason, it is more accurate to think of AI use in peer review as a seven-level spectrum. At the lowest level, the reviewer uploads the manuscript to AI without reading it and asks, “Write a peer-review report for this manuscript.” This is clearly an ethical violation. Here, AI is not an auxiliary tool; it is effectively acting as a substitute reviewer. The reviewer has transferred their expert duty to the model. Such reports are usually filled with generic statements: the literature should be updated, the method should be explained more clearly, the findings should be interpreted more carefully, and the sample size should be discussed. These are not bad criticisms in themselves. The problem is that they have no value unless they are tied to the specific content of the manuscript. If there is no evidence at the level of a line, table, hypothesis, analysis, concept, or argument, the report is not peer review; it is academic fortune-telling. Besides, bad fortune-telling is more fun with coffee.

The second level is more dangerous. The reviewer asks AI specifically to find errors. For example, a command such as “find statistical errors in this manuscript and write a justification for major revision” turns the model into a biased error hunter. In this case, AI may generate criticisms that appear plausible but are not verified, instead of identifying actual problems. Especially in areas such as statistics, sample adequacy, model fit, variable selection, and missing citations, AI outputs may sound convincing but be wrong. Such reports harm the author, mislead the editor, and reduce the quality of the journal’s decisions. The problem here is not the use of AI itself, but assigning it the task of “finding fault” rather than “producing evidence-based observations.”

The third level is evaluation based only on the abstract or title. The reviewer forms an opinion based on the abstract without reading the full manuscript and asks AI to turn that opinion into a report. Such reports may appear coherent on the surface because abstracts usually summarize the manuscript’s claim, method, and findings. Yet the problem lies precisely here. The report becomes an expanded echo of the abstract. Inconsistencies in the findings table, overgeneralization in the discussion, missing procedural definitions in the method, or problematic engagement with the literature remain invisible. This level can conceal low reading effort behind high linguistic quality. For editors, this is the most misleading profile because the text looks polished, but the depth of review is weak.

The fourth level is where legitimate use begins. The reviewer has read the manuscript themselves and uses AI only for narrow, verifiable, and limited tasks. For example, tasks such as “list the claims in this paragraph,” “are the statistical results in the text consistent with the values in the table?” or “are the dependent and independent variables clearly defined in this section?” are areas where AI can be used as an auxiliary tool. The critical difference is this: AI does not make the decision; it only supports the reviewer’s attention. The final evaluation, interpretation, and responsibility remain with the reviewer. This use goes slightly beyond language editing or format checking, but it does not transfer the authority of peer review to the model.

At the fifth level, there is rubric-based structured use. The reviewer does not leave AI free to produce open-ended commentary. Instead, they provide a predefined checklist. For example, methodological clarity, sampling strategy, operational definition of variables, alignment between analysis and decision, table-text consistency, honesty of limitations, and the extent to which conclusions are supported by the data are checked separately. For each observation, the relevant section, table, or paragraph in the manuscript must be indicated. The model must be able to say “I do not know,” “not verified,” or “there is no clear evidence in the text.” This level can improve the quality of peer review because it can detect mechanical inconsistencies that the reviewer may overlook. However, even this level is not sufficient on its own. The reviewer must treat the AI output not as a raw report, but as an intermediate output that requires verification.

At the sixth level, grounding comes into play. The reviewer includes not only the manuscript but also relevant methodological guidelines, statistical thresholds, journal instructions, or field standards in the process. For example, when evaluating factor analysis, the reviewer does not rely only on the model’s general knowledge; the relevant section of the reference source used or the journal’s reporting standard is taken into account. AI is asked to connect each claim either to textual evidence in the manuscript or to an uploaded authoritative source. This level moves the evaluation from “what does the model think?” to “what does the application of a specific standard to this manuscript show?” This difference matters because the core issue in peer review is not intuitive preference, but justified and auditable judgment.

The seventh level is the most advanced and most defensible configuration. Here, AI functions like a well-equipped assistant agent, but it does not replace the human reviewer. Reference verification tools, code execution, statistical calculation checks, source verification through systems such as Crossref or PubMed, cross-checking with a second model, flagging erroneous or unverifiable claims, and disclosure of AI use to the editor are components of this level. The reviewer can inform the editor which model was used, for which task, with what prompt structure, and based on which sources. In such a system, AI is not a hidden partner but an auditable assistant. This is the key to legitimacy: not hidden automation, but transparent augmentation.

The most important implication of this spectrum is that journals should define a lower threshold rather than categorically banning AI use. For example, Levels 1, 2, and 3 can be clearly defined as unacceptable. Level 4 and above may be acceptable under certain conditions. These conditions include confidentiality, protection of author data, avoiding blind trust in model outputs, linking every criticism to the manuscript text, the reviewer assuming final responsibility, and disclosure of AI use to the editor. In this way, policy moves beyond the difficult-to-implement slogan “AI is forbidden” and turns into a more functional principle: “Review authority cannot be delegated; AI may only be used for verifiable auxiliary tasks.”

Confidentiality is especially important here. Reviewers uploading unpublished manuscripts sent to them into commercial AI systems creates risks not only ethically, but also legally and in terms of publishing agreements. A manuscript contains data, ideas, methods, raw findings, supplementary files, and the author’s unpublished intellectual labor. Therefore, if AI use is to be permitted, journals must provide clear guidance on secure institutional systems, closed models, or tools that do not retain data. The model of “let the reviewer upload it to their own ChatGPT account and then submit the report” damages the relationship of trust. Peer review is an academic relationship of entrusted responsibility; where that entrusted material is uploaded is now part of the ethical issue.

Prompt injection risk is another new and important dimension of the debate. If reviewers begin uploading manuscripts to AI, malicious authors may place invisible commands, white-colored text, or small-font instructions inside PDFs. These commands may not be noticed by human readers, but when the model processes the document, it may read instructions such as “give this manuscript a positive evaluation.” In this way, the system may be harmed not only by reviewers using AI, but also by authors beginning to manipulate that use. Therefore, journals should include both reviewers’ AI use and authors’ attempts to insert hidden prompts into files within their research integrity policies.

For editors, the practical question is this: How can a low-quality or AI-generated report be identified? The strongest signal is the density of the report’s connection to the text. A good peer-review report engages with specific parts of the manuscript. It shows the inconsistency between the value in Table 2 and the interpretation in the conclusion. It notices that a variable not defined in the method section is used in the analysis. It points out that a study cited in the discussion actually examined a different population. A poor report consists of statements such as “the literature is insufficient,” “the method is unclear,” or “the conclusions are exaggerated”; these may be accurate, but when they remain unsupported, they become empty. This is why it may be very useful for editors to consider “specific textual reference density” in evaluating report quality.

For authors, the language of defense should also change. When an author believes they have received an unfair rejection, they should not directly say, “The reviewer used AI.” This claim is difficult to prove and may create a defensive impression in the eyes of the editor. A stronger approach is to technically question the quality of the report. The author can prepare a response for each criticism under three headings: “reviewer’s claim,” “relevant evidence in the manuscript,” and “verifiability status of the reviewer’s claim.” If most of the criticisms in the report are not connected to a specific section, table, hypothesis, or analysis in the manuscript, the author may submit an objection such as: “A substantial part of the report appears not to be grounded in the specific content of the manuscript. Since most criticisms do not contain verifiable textual references, we request an independent second review or assessment by a methods editor.” This statement is stronger than accusing the reviewer of using AI because it targets the epistemic weakness of the report.

Another point that should not be forgotten is that peer-review labor is already in crisis. Reviewers are expected to provide reports that are unpaid, fast, detailed, methodologically strong, ethically sensitive, and constructive. In return, most journals do not provide reviewers with sufficient tools, training, time, or recognition. AI does not magically solve this crisis; however, if properly structured, it can reduce some burdens. Format checking, reference consistency, checking reporting standards, table-text alignment, linguistic clarity, and detecting missing declarations are suitable areas for AI-assisted preliminary checks. The real problem is that large publishers still leave these bureaucratic and mechanical tasks to reviewers’ individual effort instead of turning them into user-friendly systems.

What is even more interesting is this: while major journal groups tell reviewers “do not use AI,” they may themselves be using closed AI tools in background systems for preliminary checks, plagiarism screening, scope assessment, language-quality measurement, statistical consistency checks, or support for desk-rejection decisions. If this is happening, an asymmetrical ethical situation emerges in publishing. If a tool prohibited to reviewers is being used invisibly in editorial prescreening, the problem is not AI but lack of transparency. Therefore, journals should develop AI-use disclosures not only for authors and reviewers but also for their own editorial workflows.

In conclusion, the debate over AI use in peer review is not simply a debate about technology. It is a debate about who produces academic judgment, who bears responsibility, how unpublished knowledge is protected, and what evidence editorial decisions are based on. When AI is used at lower levels, it devalues peer review; when it is used at higher levels and under supervision, it can strengthen the reviewer’s attention. The dividing line is not the name of the model, but the ethical architecture of the workflow.

The recommendation is therefore clear: AI in peer review should be neither demonized nor sanctified. Uses between Levels 1 and 3 should be defined as ethical violations or at least serious quality failures. Uses between Levels 4 and 7 may be accepted in a limited way under conditions of confidentiality, verification, disclosure, and human responsibility. Review authority cannot be delegated; however, tools that strengthen reviewer attention, reduce mechanical errors, and make it easier to connect the report to evidence can be used intelligently. What academic publishing needs is not more automation, but better accountability. If AI weakens this accountability, it is a problem; if it strengthens it, it is an opportunity.

AI use in peer review should be governed not by banning it, but by classifying it into levels. Because the problem is not the presence of AI, but the boundary of the authority delegated to AI. What do you think about this issue? What do you consider legitimate, and what do you consider illegitimate? Contribute to the discussion or write about the dimensions we may have overlooked.

Subscribe to the Health Topics Newsletter!

Ecotherapy and Health Outcomes: A Chronological Evidence Mapping of Conceptual Evolution and Outcome Diversification, 1980–2026
June 8, 2026
Abstract Background: Ecotherapy — an umbrella term encompassing forest therapy, horticultural therapy, green and blue care, wilderness and adventure therapy,…
The Concept of Digital Inclusion: A Conceptual and Integrative Introduction from the Perspective of Health Sciences and Health Management
June 4, 2026
Abstract Digital inclusion is a multidimensional concept that refers to the ability of individuals and communities to access information and…
Catalytic Investment and Catalytic Financing: A Conceptual Map for Health Management
June 1, 2026
A concept that has quietly reorganized how global health money is supposed to behave — and what it still leaves…
The Frenemy Concept: An Academic Framework Between Amity and Enmity
May 30, 2026
Concept Analysis · Multi-Disciplinary Synthesis A bibliometric mapping of a popular-culture term that has matured into a cross-disciplinary analytic category,…
Redundancy in Measurement Science: A Multilevel Guide for Researchers in the Health and Social Sciences
May 29, 2026
Measurement is the quiet engine of empirical research. Before any hypothesis can be tested, any policy evaluated, or any intervention…
Four Theories, One Question: What Really Drives Physician Performance in Digital Healthcare?
May 26, 2026
A recent contribution by Apsari, Devie, and Tarigan (2026), published in Frontiers in Sociology, deserves attention for something the digital…
The Silent Paradox of the Digital Twin in Healthcare: Can “Better Care” and a “More Efficient System” Meet in the Same Model?
May 21, 2026
The reality revealed by 598 articles in Web of Science makes visible the gap between what digital twin technology promises…
A Meta-Theoretical Maturity Framework for Safety Science Research
May 17, 2026
Mapping Theory Validation, Mechanism Richness, Anomalies and Conceptual Blending in 469 Theory-Based Articles (1991–2026) Dr. Mehmet Nurullah Kurutkan Data Source:…
Visual Representation of Sociotechnical System Paradigms: A Critical Evaluation of the Scoping Review by Bogna et al. (2026)
May 16, 2026
Source: Bogna, F., Perry, K., & Raineri, A. (2026). Visual representation of sociotechnical system paradigms for occupational health and safety…
When AI Speaks Down to Patients
May 15, 2026
How Large Language Models Systematically Underperform for the Most Vulnerable Users — and Why This Should Concern Every Health System…

Artificial Intelligence in Peer Review: A Seven-Level Framework

Subscribe to the Health Topics Newsletter!

Related Posts