Artificial Intelligence

AI in Patient Education: Assessing Turkish Language Models

Mehmet Nurullah KurutkanFebruary 2, 2026

This article investigates whether AI-based large language models provide the same quality of patient education across all languages, specifically testing Turkish as a language with relatively low representation in medical datasets. The research focuses on two chronic, immune-mediated diseases, psoriasis and psoriatic arthritis, which require lifelong management, treatment adherence, and lifestyle adjustments. Because these factors make clear and patient-centered education clinically critical, the authors, Atilan and Cetin (2026), argue that while these models appear to be fast and scalable tools, they should not be deployed in low-resource medical languages like Turkish without being systematically evaluated for readability, scientific accuracy, and patient-centered communication.

The study establishes its problem by highlighting a significant gap in existing literature, noting that most publications regarding the performance of large language models are based on English texts. In contrast, Turkish is an agglutinative and morphologically rich language where medical term parsing, tokenization, and contextual placement can be more problematic. Therefore, a model that appears effective in English might not produce the same level of reliability or alignment with patient literacy in Turkish. Given that psoriasis and psoriatic arthritis involve visible lesions, stigma, and significant psychosocial impacts like anxiety and depression, the authors emphasize that educational materials must provide more than just medical information; they must also incorporate psychosocial frameworks and the language of shared decision-making.

The methodology is designed as a comparative mapping of current capabilities rather than a simple generation of evidence. Seven different models were selected for the study: ChatGPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, Grok 3, Qwen 2.5, DeepSeek R1, and Mistral Large 2. These models were tested through their official web interfaces using default settings and a zero-shot approach. To ensure a fair comparison, a structured prompt set consisting of 40 questions was used to generate brochures for both diseases. These brochures were then evaluated using the Ateşman index for Turkish readability and the DISCERN tool to measure the reliability and quality of treatment information. Additionally, the average word count per question was calculated to determine information density, and the statistical analysis involved blinded evaluations by two dermatologists.

The findings indicate that there is no single best model, as each occupies a different balance point between readability and scientific reliability. Readability scores ranged from 61.6 to 80.2, suggesting that while some models produce text that is relatively easy to read, others remain at a medium difficulty level. ChatGPT-4o and Qwen 2.5 stood out for their readability, whereas Gemini and Mistral produced more complex structures. Interestingly, the study found that shorter texts are not necessarily more understandable. For instance, Mistral produced relatively short responses but remained weak in readability, suggesting that the issue lies more in syntax and structural clarity than in word count alone.

Information density also revealed distinct communication styles among the models. Gemini and Claude produced the most detailed and information-heavy texts, while ChatGPT and Mistral produced much shorter responses. This finding suggests a tension in patient education: while high-density texts might lead patients with lower health literacy to abandon the material, overly brief texts might omit clinical details and the necessary balance of risks and benefits. This numerical representation of the gap between depth and accessibility is one of the most valuable contributions of the article.

When evaluating scientific reliability and the quality of treatment information using DISCERN, the results showed a moderate but variable overall quality. Claude 3.7 Sonnet and Gemini 2.0 Flash achieved the highest quality scores, followed by Grok in the upper-middle range, with ChatGPT, DeepSeek, and Qwen in the middle. Mistral performed significantly worse than the others. A major weakness across nearly all models was the failure to cite specific sources or provide dates for the information, which are essential for building trust in health education. While the models were skilled at creating clear goals and relevant content, they struggled when asked for the evidence behind their claims.

The study also specifically examined patient-centered communication. Claude and ChatGPT performed relatively well here, as they were able to move beyond listing symptoms and treatments to discuss quality of life, emotional burdens, and the importance of patient participation in decision-making. Models like Qwen, Grok, and Mistral remained focused on biomedical symptoms, largely neglecting the psychosocial dimension. This suggests that scientific accuracy does not automatically equate to patient-centeredness; a scientifically strong text may still fail to support behavioral change if it ignores the patient’s perspective.

In the discussion, the authors suggest that the linguistic features of Turkish likely cause fluctuations in how these models parse medical terms and context. Their clinical recommendation is clear: none of these models should be used to produce unsupervised patient brochures. Instead, they propose a workflow involving clinician verification and quality assurance. Model selection should be based on specific goals; if readability is the priority, ChatGPT-style models are preferred with factual verification, whereas if scientific rigor is the priority, Claude or Gemini should be used with subsequent language simplification.

The authors acknowledge several limitations, noting that only one output was generated per model and the evaluation occurred at a single point in time, which may not capture the inherent randomness or the rapid updates of these AI systems. Furthermore, the outputs were not directly compared to human-written official Turkish patient materials. Despite these limitations, the study serves as an important early warning and framework. It suggests that in high-stakes environments like Turkish medical education, these models act as talented storytellers but inconsistent carriers of evidence. Without a professional pipeline to verify and simplify their output, using them for health communication remains a significant risk.

Reference: Atilan, A. U., & Cetin, N. (2026). An old disease, a new linguistic challenge for large language models: Patient education on psoriasis and psoriatic arthritis in an underrepresented medical language. International Journal of Medical Informatics, 209, 106246. https://doi.org/10.1016/j.ijmedinf.2025.106246

Subscribe to the Health Topics Newsletter!

When One Method Is Not Enough: The Multimethod SEM Framework for Rigorous Research
March 12, 2026
Physicians rarely rely on a single diagnostic test when confronting a complex disease. They combine imaging, laboratory work, and genetic…
Can Generative AI Strengthen Critical Thinking? A Pedagogical Framework for LLM Integration in Higher Education
March 12, 2026
The rapid integration of large language models (LLMs) such as GPT-4 and DeepSeek R1 into higher education has generated considerable…
Analysis theories on artificial intelligence, ChatGPT, data science, and metaverse
February 15, 2026
The rapid convergence of artificial intelligence, data science, generative AI systems such as ChatGPT, and immersive environments like the metaverse…
Lotus Protocol: A New Approach to Systematic Reviews
February 13, 2026
The article How to Conduct a Multi-Domain Systematic (Literature) Review? Guidelines Using The Lotus Protocol addresses a growing methodological gap…
The Health Benefits of Voluntary Simplicity
February 12, 2026
Voluntary simplicity is a multidimensional lifestyle orientation that refers to individuals’ conscious reduction of consumption levels in order to build…
Reviewer Fatigue and the Future of Peer Evaluation
February 11, 2026
The contemporary academic publishing ecosystem is sustained by peer review, a system widely regarded as the epistemic backbone of scientific…
Factors Driving 30-Day ED Revisits in Older Patients
February 11, 2026
Population ageing has transformed emergency care demand patterns worldwide, placing unprecedented pressure on emergency departments (EDs) and exposing systemic gaps…
Addressing Care Worker Burnout: Key Findings
February 11, 2026
The growing complexity of long-term care needs, combined with chronic workforce shortages, has positioned nursing homes among the most psychologically…
Impact of Loneliness on Quality of Life in Older Adults
February 10, 2026
This article, titled “Loneliness as a Predictor of Quality of Life in Older Adults Receiving Primary Health Care in Türkiye:…
Analysis of Patient Participation: Trends and Insights
February 10, 2026
The growing emphasis on patient-centered healthcare has transformed the role of patients from passive recipients of care into active partners…

AI in Patient Education: Assessing Turkish Language Models

Subscribe to the Health Topics Newsletter!

Related Posts