How to Choose the Right Variables in Health Services Machine Learning

As healthcare datasets grow more complex, researchers face a critical challenge: how to distill thousands of variables into a model that’s both accurate and interpretable. The recent paper by Dong et al. (2025) published in Health Services and Outcomes Research Methodology offers a much-needed roadmap for this journey. The authors present a structured and transparent framework for variable selection in machine learning models, with a focus on health services research—especially when tackling social determinants of health (SDOH).

In modern healthcare research, choosing the “right” variables is not just about predictive accuracy—it’s also about policy relevance, interpretability, and fairness. The authors argue that combining expert judgment with algorithmic tools like Random Forests and LASSO yields models that are more robust and informative.

Highlights from the Framework

The paper dissects multiple variable selection methods:

  • Manual Selection: Useful for preserving theoretical relevance but prone to bias.
  • Correlation Matrix: Helps manage multicollinearity but ignores non-linear relationships.
  • PCA: Great for dimensionality reduction, poor for interpretability.
  • Random Forest & Boruta: Good at capturing complex patterns and ranking variable importance.
  • LASSO & Stepwise Regression: Efficient at building parsimonious models, but can discard subtly important predictors.

What sets this paper apart is its insistence on interpretability. For instance, while PCA reduces complexity, it also makes the model’s logic harder to understand—unacceptable when decisions affect real patients.

A Case Study: Cancer Care Disparities: To ground their framework, the authors apply it to the LexisNexis SDOH dataset (442 variables). Their goal? Identify patients at high risk for inequitable cancer treatment. They walk through a multi-stage process: removing highly missing or homogeneous variables, using correlation matrices to filter out redundancy, applying Boruta for feature importance, and validating with stepwise regression and CART.

The result is a lean, interpretable model that highlights specific social risk factors—exactly the kind of tool policymakers and clinicians need to address healthcare inequities.

This study isn’t just about variable selection—it’s about meaningful machine learning in healthcare. By balancing statistical rigor with domain knowledge, Dong et al. provide a blueprint for researchers navigating the messy, nuanced world of health services data.

Reference: Dong, W., Lal, T., Liu, F., Pronovost, P., Bora, S., & Hoehn, R. S. (2025). Methodological considerations for optimal variable selection in machine learning for health services research. Health Services and Outcomes Research Methodology. https://doi.org/10.1007/s10742-025-00347-8

Video

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.