This pioneering study, “Can a Large Language Model Judge a Child’s Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment,” addresses a critical challenge in legal proceedings: the credibility assessment of witness statements, particularly those from children who have experienced sexual abuse. Authored by Zeki Karataş from Recep Tayyip Erdoğan University, the research published in the Journal of Evidence-Based Social Work investigates the inter-rater reliability (IRR) between human experts and an advanced large language model (LLM), ChatGPT (GPT-4o Plus), in evaluating such sensitive statements within the established Criteria-Based Content Analysis (CBCA) framework. Given that child sexual abuse cases frequently lack physical evidence, often hinging on a “statement-versus-statement” dilemma, the accurate and objective assessment of a child’s testimony is paramount to prevent traumatic false positive or false negative decisions.
The study emphasizes that while “credibility” can refer to truthfulness, in this context, it signifies the believability of a statement, acknowledging that children’s accounts may not be entirely accurate due to factors like suggestibility, developmental limitations, or inadequate narrative skills. Therefore, the imperative is to move beyond intuitive assessments towards evidence-based, objective, and valid methods.
The CBCA Framework and its Challenges
The Criteria-Based Content Analysis (CBCA) is one of the most widely accepted and empirically studied methods for assessing statement credibility, particularly in alleged child sexual abuse cases. It forms the analytical core of the broader Statement Validity Assessment (SVA) process, which holistically evaluates testimony by examining case documentation, interview conditions, and contextual factors. CBCA’s theoretical foundation is the “Undeutsch Hypothesis,” positing that genuine personal experiences yield qualitatively and quantitatively richer and more complex narratives than fabricated accounts.
The method involves coding an interview transcript for the presence of 19 specific criteria, standardized by Steller and Koehnken (1989), grouped into categories such as General Characteristics, Specific Contents, and Motivation-Related Contents. The presence of these criteria supports a statement’s credibility, although their absence does not automatically imply deception, as narrative quality can be affected by a child’s age, trauma, or developmental stage.
Despite its widespread use, CBCA is not without significant challenges, most notably inter-rater reliability (IRR). While overall CBCA scores often show satisfactory IRR, this consistency frequently masks poor agreement on individual criteria, especially those requiring subjective interpretation like “unusual details” or “details characteristic of the offense”. This issue is compounded in multidisciplinary settings, highlighting a critical need for refining the tool or exploring novel approaches like AI to enhance consistency.
The Role of Artificial Intelligence and the Study’s Contribution
Recent advancements in artificial intelligence (AI), particularly Natural Language Processing (NLP) and Large Language Models (LLMs) like ChatGPT, offer potential for analyzing large data volumes and conducting complex evaluations. This study’s unique contribution is its pioneering incorporation of an advanced LLM as a rater alongside traditional human evaluators (a forensic psychologist and a social worker) in the CBCA of child sexual abuse statements. The research seeks to determine if AI can help resolve consistency issues among human raters or add a standardized dimension to the evaluation process, while also acknowledging risks such as model biases, lack of transparency, and inability to process non-verbal cues. The study positions AI not as a replacement for human experts, but as a powerful cognitive tool to augment their capabilities.
Methodology
This quantitative study employed a comparative research design to examine inter-rater reliability.
- Sample: The study utilized 65 anonymized transcripts of forensic interviews with child sexual abuse victims (48 girls, 17 boys, aged 7-17, M=14.6, SD=2.37) from judicial interview units in the Black Sea region of Turkey. Cases with intellectual disabilities, psychiatric conditions, substantial statement changes, or insufficient audio/video quality were excluded.
- Raters: Three independent raters evaluated each statement:
- A forensic psychologist specializing in children and adolescents, trained in CBCA.
- A social worker with experience in forensic social work and CBCA methodology.
- An artificial intelligence model, ChatGPT (GPT-4o Plus), an LLM developed by OpenAI.
- Human raters evaluated statements independently, without additional case information or discussion.
- AI Model Training and Implementation: The methodology relied on prompt engineering (rather than fine-tuning) with the publicly available ChatGPT. The model was supplied with foundational CBCA literature, detailed explanations of the 19 criteria, application principles, and scoring guidelines. To enhance accuracy, preliminary testing involved corrective feedback for misinterpretations, emphasizing contextual sensitivity. ChatGPT was specifically instructed to provide direct excerpts from transcripts to support its scoring, ensuring transparency. The model’s reliability was cross-validated using Google AI Studio, yielding similar results.
- CBCA Scoring: Each of the 19 criteria was scored on a three-point scale: 0 (absent), 1 (partially/weakly present), or 2 (strongly/clearly present).
- Statistical Analysis: Inter-rater reliability was rigorously assessed using multiple methods, including Intraclass Correlation Coefficient (ICC), Cohen’s Kappa (κ), Kendall’s W Coefficient of Concordance, Krippendorff’s Alpha (α), and Percent Agreement.
Key Findings
The results highlight a stark contrast between human-human and human-AI reliability.
- Human-Human Reliability: A high degree of inter-rater reliability was found between the forensic psychologist and the social worker. For 15 of 19 criteria, ICC values were > .75, indicating “good” to “excellent” agreement. High agreement was observed for criteria like “Admitting lack of memory” (ICC = .959). However, “Logical structure” (ICC = .342) and “Details characteristic of the offense” (ICC = .351) showed poor to non-existent agreement, with negative Krippendorff’s Alpha for the latter, indicating systematic disagreement on these subjective criteria.
- Human-AI Reliability (Main Finding): The inclusion of the AI model resulted in a dramatic and significant decrease in reliability when compared to human experts.
- For the majority of CBCA criteria, reliability coefficients fell into the “poor” range.
- Negative Kappa and ICC values were observed for several criteria, such as “Unstructured production” (CBCA 2), “Details misunderstood” (CBCA 10), “Self-deprecation” (CBCA 17), “Logical structure” (CBCA 1), “Quantity of details” (CBCA 3), and “Superfluous details” (CBCA 9), indicating that the AI’s agreement with human experts was often worse than chance, reflecting a fundamental difference in evaluation logic.
- While the AI showed moderate success with a few more concrete criteria, such as “Contextual embedding” (CBCA 4), “Reproduction of conversation” (CBCA 6), “Accounts of subjective mental state” (CBCA 12), and “Attribution of perpetrator’s mental state” (CBCA 13), this was interpreted as sophisticated pattern recognition and lexical classification, rather than genuine contextual or psychological understanding.
Discussion and Implications
The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM. The study concludes that the current prompt-engineered LLM cannot reliably replicate expert judgment in the complex task of credibility assessment. Implicit meanings, cultural references, and psychological dynamics, which human experts intuitively interpret, appear beyond the current capabilities of such models.
Therefore, the study advocates for a realistic repositioning of AI’s role in forensic evaluation: not as an autonomous “judge,” but as a “cognitive assistant” to support expert workflows. In this assistive capacity, AI could serve as an efficiency tool by rapidly highlighting relevant sections in lengthy transcripts. It could also function as a standardization aid or “second reader,” flagging potential contradictions or overlooked details to reduce individual rater variance and encourage thoroughness. Furthermore, AI holds promise as an interactive training platform, providing clear examples of CBCA criteria for novice practitioners.
Crucially, the study emphasizes that the ultimate responsibility for interpretation, contextualization, and final decision-making must remain firmly with human experts. The professional’s ability to understand the broader case context, non-verbal cues (absent in transcripts), and the unique human dimensions of a child’s testimony remains uniquely human and irreplaceable.
Limitations and Future Research
Several limitations should be considered:
- The study focused on a single, prompt-engineered LLM (ChatGPT), meaning findings may not generalize to other AI architectures or fine-tuned models.
- The sample size (N=65), while adequate, may limit statistical power for rarely observed criteria.
- Exclusive reliance on written transcripts meant vital non-verbal cues were excluded.
- The cultural context of Turkey and the applicability of a Western-developed CBCA to Turkish children’s narrative styles remain unexamined.
- The use of a commercial, cloud-based LLM without a secure “sandbox” environment presents ethical and data security concerns.
Future research should prioritize testing fine-tuned or purpose-built AI models trained on large, diverse datasets, incorporating multimodal AI systems capable of analyzing text, audio, and video, and investigating AI’s ability to adapt to cultural nuances. Furthermore, the development of robust ethical frameworks and technical safeguards (e.g., data governance, algorithmic transparency, secure “sandbox” environments) must be a collaborative effort between AI experts, legal professionals, and social work practitioners to ensure responsible exploration of these powerful technologies.
In conclusion, while prompt-engineered LLMs can recognize surface-level patterns, they currently struggle with the deep psychological and contextual nuances essential for interpreting child testimony. The assessment of such sensitive testimony is a profoundly human endeavor requiring expert judgment, empathy, and contextual awareness, capabilities that remain beyond current accessible AI tools. AI’s future promise in this field lies in wisely and responsibly supporting, not replacing, human expertise.
Reference: Karataş, Z. (2025). Can a Large Language Model Judge a Child’s Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment. Journal of Evidence-Based Social Work. https://doi.org/10.1080/26408066.2025.2547211
