This paper introduces LM-RRG, a novel method for Radiology Report Generation (RRG), which aims to automatically create accurate and comprehensive radiology reports based on given chest X-ray images. RRG is a field that has garnered significant attention from both radiology and computer science communities due to its potential to reduce the workload of radiologists. While existing approaches have made advancements, they often fall short of clinical standards, particularly in their ability to reflect the comprehensive clinical significant and insignificant errors that radiologists would normally emphasize in a report. Unlike general image captioning, RRG requires a keen focus on image details, especially in medically significant regions, and the ability to reason about disease-related symptoms, including the use of negations like “no pneumothorax”.
To address these challenges, the authors propose LM-RRG, a method inspired by the recent developments in large models (LMs) and reinforcement learning from human feedback (RLHF). The LM-RRG framework is composed of three key parts. First, an LLM-driven visual feature extractor is designed to analyze and interpret different regions of the chest X-ray image, highlighting areas of medical importance. This extractor uniquely leverages text descriptions for various anatomical regions, generated by a large language model (LLM) such as GPT-4, to guide the visual feature extraction process, thereby avoiding the pitfalls of explicit region detection. It extracts both global and specific regional visual features from the image. Second, a multimodal report generator is developed using the decoder of a large multimodal model (LMM), such as BLIP-2. This generator produces radiology reports auto-regressively by leveraging multimodal prompts constructed from the extracted visual features and textual instructions. This component is finetuned within a multitask learning framework, incorporating losses for both report generation and disease classification to enhance the visual features with disease-related information.
Finally, to further refine the clinical quality of the generated reports, LM-RRG incorporates a novel clinical quality reinforcement learning (CQRL) strategy. This strategy utilizes the Radiology Report Clinical Quality (RadCliQ) metric as a reward function. RadCliQ is a composite metric specifically designed to evaluate the clinical (in-)significant errors in reports, combining scores from metrics like BLEU, BertScore, CheXbert vector similarity, and RadGraph F1. The model’s parameters are updated using proximal policy optimization (PPO), mirroring techniques used in RLHF. Extensive experiments on the MIMIC-CXR and IU-Xray datasets demonstrate the superiority of LM-RRG over state-of-the-art methods, showcasing significant improvements in both Natural Language Generation (NLG) and Clinical Efficacy (CE) metrics, and a substantial improvement in the RadCliQ score, particularly on the MIMIC-CXR dataset. Ablation studies further validate the effectiveness and necessity of each proposed module within the LM-RRG framework.
Reference: Zhou, Z., Shi, M., Wei, M., Alabi, O., Yue, Z., & Vercauteren, T. (2024). Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning [Preprint]. arXiv.
