Scaling Methods for Measuring Health-State Preferences

This paper, authored by Debra G. Froberg and Robert L. Kane and published in the Journal of Clinical Epidemiology in 1989, is titled “Methodology for Measuring Health-State Preferences—II: Scaling Methods.” It represents the second part of a series dedicated to the intricate process of quantifying preferences for various health states. Following an initial discussion in Part I regarding the choice between holistic and decomposed strategies for data gathering, this installment shifts its focus to the critical decision among different scaling methods. The authors highlight that this choice has garnered considerable attention in the literature, with various investigators advocating for the superiority of specific techniques.

The fundamental problem addressed in this paper is the quantification or measurement of preferences for health states. To achieve this, the authors draw upon a long tradition of psychometric research, which involves adapting the principles of psychophysics. Psychophysics is the study of how people perceive and make judgments about physical phenomena, such as the brightness of lights or the loudness of sounds. While psychophysics has shown that humans can make consistent numerical estimates of sensory stimuli, even when the relationship between stimulus intensity and sensation is not linear, psychometrics extends these methods to measure subjective judgments for which no physical scale exists, including preferences and values for abstract concepts like health.

The paper systematically describes and compares six prominent scaling methods:

  • Standard Gamble (SG)
  • Time Trade-Off (TTO)
  • Rating Scale (RS), which includes category ratings
  • Magnitude Estimation (ME)
  • Equivalence (EQ)
  • Willingness-to-Pay (WTP)

Before delving into these methods, the authors introduce four crucial measurement concepts and definitions relevant to quantifying health-state preferences:

  • Scaling stimuli versus scaling persons: The paper clarifies that measuring the desirability of health states is a stimulus-scaling task, distinct from scaling people based on their responses to a health status instrument. This distinction is vital for selecting appropriate scaling techniques and determining how variability in preferences is handled.
  • Verifiable versus nonverifiable stimuli: Unlike in psychophysics, where subjective scales can be compared to an external standard of accuracy, health-state preferences lack such a factual standard. Therefore, the validation process for health preference scaling methods relies on the incremental accumulation of evidence rather than a single definitive comparison.
  • Levels of measurement (nominal, ordinal, interval, ratio): The authors explain that scaling methods produce different levels of measurement. Ratio scales are the highest, allowing for absolute magnitudes and all fundamental algebraic operations (addition, subtraction, multiplication, division), meaning one can definitively say one health state is “twice as desirable” as another. Interval scales provide information on rank order and the distance between health states but not absolute magnitude. While an interval scale is sufficient for most powerful statistical analyses, health status indexes, and cost-effectiveness analyses, ordinal scales offer only meager information, yet are often (incorrectly) treated as interval data in practice.
  • Direct versus indirect scaling methods: In direct scaling, respondents are explicitly asked to make judgments at a certain level of measurement, and the data are treated as such (e.g., generating interval or ratio scales). In indirect scaling, respondents make judgments at one level (e.g., ordinal), which are then converted to a different level by the investigator using theoretical assumptions. All scaling models currently used in health preference literature exclusively employ direct scaling, assuming subjects can directly generate interval or ratio scales due to their ease of use and recent evidence supporting their validity.

The paper provides a detailed description of each scaling method:

  • Standard Gamble (SG): Originating in decision theory, it is based on utility theory axioms and involves decision-making under uncertainty. Respondents choose between a certain health state and a gamble (e.g., a treatment with a probability p of returning to normal health and 1-p of immediate death). The probability p is varied until indifference is reached, yielding the preference value. Variations exist for states considered worse than death (resulting in negative preference values) and temporary states, often aided by probability wheels and a “back and forth” questioning technique.
  • Time Trade-Off (TTO): Developed as a simpler alternative to the standard gamble for health research. It presents a choice between two certain alternatives, asking respondents how much time (years of life) they would give up to be in a healthier state. For example, a person might choose between living t years in a chronic state or X years in perfect health, where X < t. The preference value for state i is X/t. This method can also be adapted for states worse than death and temporary states, frequently using visual aids.
  • Rating Scale (RS) / Category Ratings: A psychometric method where respondents place health states on a line anchored by the best (e.g., “perfect health”) and worst (e.g., “death”) states. To achieve an interval scale, respondents are instructed to reflect perceived differences in their placements. A common variation, category ratings, involves sorting health states into a specified number of categories (often 10), assuming equal preference changes between adjacent categories. This is the most frequently used method and is versatile for chronic, acute, or worse-than-death states, often with visual aids.
  • Magnitude Estimation (ME): Proposed to achieve ratio-level measurement, it asks respondents to provide a number or ratio indicating how much better or worse other health states are compared to a given standard. In practice, studies using this method have shown inconsistency in selecting the standard health state and defining the zero point.
  • Equivalence (EQ): An adaptation of the psychometric “method of adjustment,” where respondents decide how many people in one health state are equivalent to a specified number in another health state. It is conceptually similar to magnitude estimation.
  • Willingness-to-Pay (WTP): Recommended for measuring health preferences, often used in cost-benefit and cost-effectiveness analyses. It asks respondents what proportion of their income they would be willing to pay for a complete cure for a condition. Proportion of income is considered more useful than a dollar amount due to its reduced influence by income level.

The scaling methods are then rigorously evaluated based on their reliability, validity, and feasibility.

  • Reliability: This refers to the consistency of results. The paper discusses intra-rater reliability (single rater’s consistency), test-retest reliability (stability over time), and inter-rater reliability (consistency among judges). Most methods show acceptable intra-rater reliability and satisfactory test-retest reliability over short periods (up to 6 weeks). However, lower coefficients for measurements taken a year apart likely reflect actual changes in preferences rather than just measurement error. Inter-rater reliability generally appears acceptable, with some exceptions. The authors note that data for all three types of reliability are often missing across methods, indicating a need for further research.
  • Validity: This assesses whether a method accurately measures what it intends to. The paper primarily focuses on construct validation, which involves two approaches: examining the convergence of results across different scaling methods and testing hypothetical relationships between preferences and other variables.
    • Convergence of methods: Studies comparing various methods have yielded mixed results. While correlations among methods are often moderately high, they do not consistently produce equivalent scale values. This discrepancy is attributed to the different cognitive tasks involved in each method, which can influence how respondents attend to stimuli, recall experiences, and select reference points. For example, the standard gamble and willingness-to-pay methods were shown to focus on different aspects of a disease (pain vs. activities of daily living impairments).
    • Challenges to Standard Gamble Validity: Despite its theoretical grounding in expected utility theory, empirical evidence suggests that people often violate its axioms, leading to potential biases like risk aversion. This conservatism towards risk can inflate utilities derived from the standard gamble compared to non-gamble methods.
    • Category Ratings vs. Magnitude Estimation: Historically debated, category ratings have shown to meet empirical criteria for equal intervals in some experiments, while magnitude estimation’s validity can depend on the appropriate definition of its zero point.
    • Testing predictions: A few studies have successfully demonstrated predicted relationships, such as Time Trade-Off scores aligning with the severity of end-stage renal disease and magnitude estimation scales correlating with court awards for injuries.
  • Feasibility: This criterion considers the economy and acceptability of the methods to respondents.
    • Standard Gamble and Time Trade-Off are inherently expensive due to their reliance on lengthy interviews with trained personnel and complex branching procedures. The standard gamble, in particular, can be challenging for respondents due to its probabilistic nature and potential aversion to risk, leading to inconsistent answers, especially in large population studies. While more feasible in clinical settings, the Time Trade-Off method is generally considered easier for respondents than the Standard Gamble.
    • Category Ratings and Magnitude Estimation are generally the least expensive and easiest to understand and administer.
    • The Equivalence method is noted as too complex for routine use outside a laboratory and can confuse or even offend respondents.
    • The Willingness-to-Pay method has historically suffered from low response rates, with respondents finding the task difficult to understand, feeling hostile to the question, or lacking knowledge about healthcare costs. However, improvements in questionnaire design and interviewer training have demonstrated that high response rates and plausible answers are achievable, particularly with educated respondents and younger populations.

Based on this comprehensive evaluation, Froberg and Kane conclude with specific recommendations:

  • The category ratings method is recommended for large-sample studies due to its ease of administration and apparent validity.
  • Magnitude estimation is suggested when a ratio-level scale is required (e.g., to state that one health state is twice as desirable as another), provided the zero point is correctly defined as the absence of disease and disability.
  • Decision-oriented methods, particularly the time trade-off and standard gamble, may be more effective in small-scale investigations and individual decision-making because they encourage thoughtful consideration by requiring respondents to make a decision. However, the standard gamble is not recommended for population studies due to its complexity, cost, and administration difficulties.
  • The equivalence method is not recommended due to its complexity and potential to offend respondents.
  • The willingness-to-pay method requires further research to establish its psychometric qualities before it can be broadly endorsed for health preference studies.

In summary, this paper provides a robust framework for understanding the theoretical foundations, practical applications, and psychometric considerations of various scaling methods used in health-state preference measurement. It offers critical insights and specific recommendations for researchers and practitioners, emphasizing the importance of aligning method selection with study objectives, respondent characteristics, and the desired level of measurement.

Reference: Froberg, D. G., & Kane, R. L. (1989). Methodology for measuring health-state preferences—II: Scaling methods. J Clin Epidemiol, 42(5), 459–471.

Video

Podcast Link

https://notebooklm.google.com/notebook/9f42ca74-0e3d-4c10-ad88-c84af3ed9f99?artifactId=0d9e51c8-53f5-4df6-ab9f-0ac5f2aeddfb

Subscribe to the Health Topics Newsletter!

Google reCaptcha: Invalid site key.