This article, “Real-world data: a brief review of the methods, applications, challenges and opportunities,” by Fang Liu and Demosthenes Panagiotakos, provides a concise overview of Real-World Data (RWD) in the medical and healthcare fields. Published in BMC Medical Research Methodology, this review serves as a primer for readers interested in RWD, outlining its types, sources, common analytical approaches, and the challenges and opportunities associated with its use.
The increased adoption of technology, such as the internet, social media, wearable devices, and e-health services, has led to a rapid generation of diverse digital data beyond traditional clinical trials. The US FDA defines Real-World Data (RWD) as information relating to patient health status and/or the delivery of healthcare that is routinely collected from various sources. These sources include claims and billing activities, electronic health records (EHRs), product and disease registries, and data from mobile and wearable devices.
The growing accessibility of RWD, coupled with advancements in Artificial Intelligence (AI) and Machine Learning (ML) techniques, has sparked significant interest in leveraging RWD to enhance clinical research efficiency and bridge evidence gaps between research and practice. For instance, during the COVID-19 pandemic, RWD was used to generate real-world evidence (RWE) on vaccine effectiveness, model control strategies, characterize illnesses, and study behavioral and mental health changes.
However, RWD possesses several unique characteristics that differentiate it from data collected in controlled trials, posing both challenges and opportunities:
- Observational Nature: RWD is observational, contrasting with controlled settings.
- Messiness and Heterogeneity: Many RWD types are unstructured (e.g., texts, imaging) and can be inconsistent due to variations in entry across providers and health systems. In summary, RWD is described as messy, incomplete, heterogeneous, and prone to various measurement errors and biases.
- Voluminous and Dynamic: RWD can be generated at high frequencies (e.g., millisecond-level measurements from wearables), leading to voluminous and dynamic datasets.
- Incompleteness: RWD may lack key endpoints for analysis because it is often collected for purposes other than research.
- Bias and Errors: RWD can be subject to selection bias and measurement errors, potentially making a dataset unrepresentative of the target population. Data quality of RWD is not consistently high, making quality assessments challenging.
The article details several common types of RWD, including:
- Electronic Health Records (EHRs): Collected as part of routine care, these are typically noisy, heterogeneous, and dynamic, offering opportunities for data-driven discoveries and improved predictions, especially when linked with other data.
- Registry Data: This includes product, health services, and disease registries, providing valuable data for understanding disease courses, supporting regulatory decisions, and aiding clinical trial design, particularly for rare diseases.
- Claims Data: Generated during healthcare claim processing, these data are used to understand patient/prescriber behavior, estimate disease prevalence, and study medication usage, though they are known to contain fraud.
- Patient-Reported Outcome (PRO) Data: Directly reported by patients on their health status, used to provide RWE on intervention effectiveness and symptom monitoring, but subject to recall bias.
- Wearable Device Data: These devices generate continuous data streams, enabling large-scale research studies that would otherwise be infeasible in controlled trials.
The review highlights several methods for utilizing and analyzing RWD, such as pragmatic clinical trials, which test interventions in real-world settings and leverage data from EHRs and claims. Another approach is target trial emulation, which applies randomized trial design principles to observational data to draw valid causal inferences. RWD can also serve as historical controls for controlled trials, especially for studying rare events due to its voluminous nature. Furthermore, Machine Learning (ML) techniques are increasingly popular for predictive modeling with RWD due to their ability to handle voluminous, messy, multi-modal, and unstructured data without strong assumptions. Combining statistical inference with ML is considered more effective for generating RWE and learning causal relationships.
Despite the transformative potential, significant challenges exist across the RWD lifecycle:
- Data Quality: RWD often lacks critical information and is messy, impacting the accuracy and precision of results. Addressing this requires consistent documentation, pre-processing (e.g., imputation, denoising), and early stakeholder engagement to establish standards.
- Efficient and Practical ML and Statistical Procedures: The noisy and heterogeneous nature of RWD can lead to under-performance of existing procedures, necessitating new RWD-specific methods. The availability of open-source tools also increases the risk of misuse without proper training.
- Explainability and Interpretability: Modern ML models are often “black-box,” making it difficult to understand input-output relationships and causal effects. Interpretability is crucial in healthcare for building trust.
- Reproducibility and Replicability: Ensuring that RWD analyses are robust and their outputs are reproducible (same data, code yields same results) and replicable (new data yields same findings) is critical for scientific rigor, though challenging due to RWD’s nature. Sharing data/code and detailed documentation are vital.
- Privacy: RWD often contains sensitive information, and privacy risks increase when linking disparate databases. Adhering to privacy principles (lawfulness, fairness, purpose limitation, data minimization) and deploying privacy-enhancing technologies like differential privacy and federated learning are essential.
- Diversity, Equity, Algorithmic Fairness, and Transparency (DEAT): RWD can improve generalizability, but certain types may be biased or unbalanced, exacerbating health disparities. Efforts are needed to access data from under-represented groups and ensure algorithmic fairness to prevent systematic disadvantages. Transparency in data processing is critical for building trust among all stakeholders.
These challenges are interconnected, with data quality affecting analytical procedures, and privacy/DEAT relating to how data is analyzed and shared. In conclusion, RWD offers a valuable and cost-effective data source with the potential to generate valid and unbiased RWE, enhancing medical and health-related research and decision-making when used appropriately.
Reference: Liu, F., & Panagiotakos, D. (2022). Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Medical Research Methodology, 22(1), 287. https://doi.org/10.1186/s12874-022-01768-6

Note: Here are five of the most important concepts from the article, with brief English definitions:
Electronic Health Records (EHRs): These are a common and significant type of Real-World Data, collected as part of routine care across clinics, hospitals, and healthcare institutions. EHR data are typically noisy, heterogeneous, and dynamic, encompassing both structured and unstructured information (e.g., text, imaging). They offer unprecedented opportunities for data-driven discoveries, improved predictions (especially when linked with administrative and claims data), and the validation of findings from clinical trials. EHRs have been used in pragmatic clinical trials, such as the ADAPTABLE trial, to identify patients for recruitment and collect patient-reported outcomes.
Real-World Data (RWD): According to the US FDA, RWD in the medical and healthcare field are “data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources”. These sources are diverse, including the internet, social media, wearable devices, mobile devices, claims and billing activities, various registries (product, disease, health services), and electronic health records (EHRs). RWD are characterized as observational, often unstructured, voluminous and dynamic, potentially incomplete, and subject to various biases and measurement errors, making them “messy, incomplete, heterogeneous”.
Real-World Evidence (RWE): This refers to the evidence generated from the analysis of Real-World Data. RWE holds significant potential for designing and conducting confirmatory trials and for addressing medical questions that traditional research methods might not cover. For instance, during the COVID-19 pandemic, RWD was used to generate RWE on vaccine effectiveness, to model control strategies, and to characterize illnesses. When used appropriately, RWD has the potential to generate valid and unbiased RWE, enhancing the efficiency of medical and health-related research and decision-making.
Challenges of Real-World Data (RWD): Despite its potential, RWD presents several challenges across its lifecycle. Key challenges include data quality issues, as RWD often lacks critical information, is messy, heterogeneous, and prone to measurement errors, negatively impacting accuracy and precision. There’s also a need for efficient and practical ML and statistical procedures that can effectively handle RWD’s noisy and heterogeneous nature, as existing methods may underperform. Other challenges involve explainability and interpretability of modern ML models (often “black-box” in nature), ensuring reproducibility and replicability of RWD analyses, safeguarding privacy due to sensitive information, and addressing Diversity, Equity, Algorithmic Fairness, and Transparency (DEAT) to prevent biases and ensure data from under-represented groups is included. These challenges are interconnected.
Machine Learning (ML) / Artificial Intelligence (AI): These are powerful and increasingly popular tools for predictive modeling and analysis of RWD. ML techniques are particularly adept at handling voluminous, messy, multi-modal, and unstructured data types without requiring strong assumptions about data distribution. Examples include deep learning for abstract data representations and natural language processing (NLP) for EHR texts. ML has seen a rapid surge in applications for RWD, including health informatics, personalized healthcare, and understanding diseases like COVID-19. While traditionally used for predictions and classification, their role in generating regulatory-level RWE is evolving. Combining statistical inference with ML is considered more effective for generating RWE and learning causal relationships.
