This article investigates whether AI-based large language models provide the same quality of patient education across all languages, specifically testing Turkish as a language with relatively low representation in medical datasets. The research focuses on two chronic, immune-mediated diseases, psoriasis and psoriatic arthritis, which require lifelong management, treatment adherence, and lifestyle adjustments. Because these factors make clear and patient-centered education clinically critical, the authors, Atilan and Cetin (2026), argue that while these models appear to be fast and scalable tools, they should not be deployed in low-resource medical languages like Turkish without being systematically evaluated for readability, scientific accuracy, and patient-centered communication.
The study establishes its problem by highlighting a significant gap in existing literature, noting that most publications regarding the performance of large language models are based on English texts. In contrast, Turkish is an agglutinative and morphologically rich language where medical term parsing, tokenization, and contextual placement can be more problematic. Therefore, a model that appears effective in English might not produce the same level of reliability or alignment with patient literacy in Turkish. Given that psoriasis and psoriatic arthritis involve visible lesions, stigma, and significant psychosocial impacts like anxiety and depression, the authors emphasize that educational materials must provide more than just medical information; they must also incorporate psychosocial frameworks and the language of shared decision-making.
The methodology is designed as a comparative mapping of current capabilities rather than a simple generation of evidence. Seven different models were selected for the study: ChatGPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, Grok 3, Qwen 2.5, DeepSeek R1, and Mistral Large 2. These models were tested through their official web interfaces using default settings and a zero-shot approach. To ensure a fair comparison, a structured prompt set consisting of 40 questions was used to generate brochures for both diseases. These brochures were then evaluated using the Ateşman index for Turkish readability and the DISCERN tool to measure the reliability and quality of treatment information. Additionally, the average word count per question was calculated to determine information density, and the statistical analysis involved blinded evaluations by two dermatologists.
The findings indicate that there is no single best model, as each occupies a different balance point between readability and scientific reliability. Readability scores ranged from 61.6 to 80.2, suggesting that while some models produce text that is relatively easy to read, others remain at a medium difficulty level. ChatGPT-4o and Qwen 2.5 stood out for their readability, whereas Gemini and Mistral produced more complex structures. Interestingly, the study found that shorter texts are not necessarily more understandable. For instance, Mistral produced relatively short responses but remained weak in readability, suggesting that the issue lies more in syntax and structural clarity than in word count alone.
Information density also revealed distinct communication styles among the models. Gemini and Claude produced the most detailed and information-heavy texts, while ChatGPT and Mistral produced much shorter responses. This finding suggests a tension in patient education: while high-density texts might lead patients with lower health literacy to abandon the material, overly brief texts might omit clinical details and the necessary balance of risks and benefits. This numerical representation of the gap between depth and accessibility is one of the most valuable contributions of the article.
When evaluating scientific reliability and the quality of treatment information using DISCERN, the results showed a moderate but variable overall quality. Claude 3.7 Sonnet and Gemini 2.0 Flash achieved the highest quality scores, followed by Grok in the upper-middle range, with ChatGPT, DeepSeek, and Qwen in the middle. Mistral performed significantly worse than the others. A major weakness across nearly all models was the failure to cite specific sources or provide dates for the information, which are essential for building trust in health education. While the models were skilled at creating clear goals and relevant content, they struggled when asked for the evidence behind their claims.
The study also specifically examined patient-centered communication. Claude and ChatGPT performed relatively well here, as they were able to move beyond listing symptoms and treatments to discuss quality of life, emotional burdens, and the importance of patient participation in decision-making. Models like Qwen, Grok, and Mistral remained focused on biomedical symptoms, largely neglecting the psychosocial dimension. This suggests that scientific accuracy does not automatically equate to patient-centeredness; a scientifically strong text may still fail to support behavioral change if it ignores the patient’s perspective.
In the discussion, the authors suggest that the linguistic features of Turkish likely cause fluctuations in how these models parse medical terms and context. Their clinical recommendation is clear: none of these models should be used to produce unsupervised patient brochures. Instead, they propose a workflow involving clinician verification and quality assurance. Model selection should be based on specific goals; if readability is the priority, ChatGPT-style models are preferred with factual verification, whereas if scientific rigor is the priority, Claude or Gemini should be used with subsequent language simplification.
The authors acknowledge several limitations, noting that only one output was generated per model and the evaluation occurred at a single point in time, which may not capture the inherent randomness or the rapid updates of these AI systems. Furthermore, the outputs were not directly compared to human-written official Turkish patient materials. Despite these limitations, the study serves as an important early warning and framework. It suggests that in high-stakes environments like Turkish medical education, these models act as talented storytellers but inconsistent carriers of evidence. Without a professional pipeline to verify and simplify their output, using them for health communication remains a significant risk.
Reference: Atilan, A. U., & Cetin, N. (2026). An old disease, a new linguistic challenge for large language models: Patient education on psoriasis and psoriatic arthritis in an underrepresented medical language. International Journal of Medical Informatics, 209, 106246. https://doi.org/10.1016/j.ijmedinf.2025.106246
