Concept, Scope, Hazards, Blind Spots, and Limitations
1. The Core of the Concept
Item-level content overlap analysis is a psychometric audit method that, in the scale development and evaluation literature, interrogates what a construct actually measures—independently of its label—through the substantive content of its items. Its underlying premise is straightforward: what a scale measures is determined not by its name but by the concrete meaning of the items it contains. Consequently, the question of whether two scales measure the “same” or “different” things must be answered by examining how much of their item content is shared (Larsen & Bong, 2016).
This framework was designed to expose two classic fallacies. The first is the jingle fallacy, named by Truman Kelley in 1927: the assumption that two scales measure the same thing because they bear the same name (Kelley, 1927). The second is the jangle fallacy, formulated by Edward Thorndike in 1904: the assumption that two scales measure different things because they bear different names (Thorndike, 1904). Item-level content overlap analysis renders these assumptions empirically testable; once the substantive content of the items is laid bare, the extent to which two scales actually share the same conceptual territory becomes visible—regardless of what their labels claim.
The fundamental mathematical tool of this analysis is the Jaccard index: |A ∩ B| / |A ∪ B|, that is, the ratio of items shared between two scales to the total number of unique items across both (Fried, 2017). This simple ratio summarizes, on a 0-to-1 scale, how much two instruments overlap in content. A value of 1 indicates complete overlap and 0 indicates complete divergence; in practice, values that destabilize scientific inference typically fall in the 0.20–0.50 band. More advanced approaches include expert-judgment-based construct identity analysis (Larsen & Bong, 2016), retest-corrected item correlations, EGA (Exploratory Graph Analysis) and UVA (Unique Variable Analysis), dependent-correlation external-criterion tests (Gonzalez et al., 2020; Blötner, 2025), and sentence-embedding-based cosine similarity. All of these share a common starting point: looking at the substantive content of items, not at the scale’s label.
2. Scope
The scope of item-level content overlap analysis extends along four axes.
The first axis is construct validity. Conventional construct validity audits do not systematically detect item-level overlap; internal consistency coefficients (Cronbach’s α, McDonald’s ω), factor loadings, and confirmatory factor analysis (CFA) fit indices, taken alone, cannot tell us how much a scale overlaps with other scales. Item-level content overlap analysis surfaces this hidden dimension of construct validity—whether scales circulating under the same construct label genuinely share the same content.
The second axis is construct proliferation. The social sciences and the health-measurement literature generate new constructs every decade; yet many of their items overlap substantially with those of pre-existing constructs. The classic study by Le, Schmidt, Harter, and Lauver (2010) showed that “distinct” organizational behavior constructs—job satisfaction, organizational commitment, job involvement, and work engagement—shared between 70% and 86% of their reliable variance. Item-level content overlap analysis distinguishes which portion of this proliferation reflects genuine conceptual innovation and which portion reflects mere relabeling.
The third axis is cumulative science. Meta-analyses, systematic reviews, evidence syntheses, and clinical guidelines rely on the assumption that different scales measuring the “same” construct can have their results pooled. For example, under the heading of “depression,” seven different scales—PHQ-9, BDI-II, CES-D, HADS-D, MADRS, HRSD, and IDS—enter meta-analyses. Without item-level content overlap analysis, the legitimacy of such pooling cannot be known. Fried’s (2017) findings are unambiguous: the seven depression scales together encompass 52 distinct symptoms; the mean Jaccard overlap is only 0.36; 40% of the symptoms appear in only a single scale; and only 12% appear across all seven instruments.
The fourth axis is clinical decision-making. The question of whether a patient “has depression,” “is experiencing burnout,” or “has a low quality of life” can yield different answers depending on which scale is used. In cervical dystonia, seven scales contain 91 distinct symptoms with a mean overlap of only 0.17, and no symptom appears across all seven instruments (Chrobak et al., 2024). This means that the same patient may face different treatment thresholds in different clinics.
3. The Hazards Generated by This Problem
Leaving item-level overlap invisible produces six concrete and interconnected hazards at the scientific and clinical levels.
The first hazard is spurious convergent validity. A high correlation between two scales is typically interpreted as “both measure the same construct.” However, if two scales share 30% of their items verbatim or near-verbatim, the correlation between them is not evidence of conceptual proximity but the mathematically inevitable consequence of content identity (Gonzalez et al., 2020). This does not constitute evidence of genuine construct validity; it systematically inflates the standard “we provided convergent validity evidence” claim of scale development papers.
The second hazard is spurious incremental validity. The claim that a newly developed scale “explains unique variance beyond scale X” is, in most cases, simply variance produced by the non-overlapping items. That is, the new scale does not surpass X; items that differ from X explain variance that differs from X. This is not the discovery of a new construct—it is merely a difference in the item pool. This mechanism is the principal fuel of the proliferating construct ecosystem (Le et al., 2010).
The third hazard is meta-analytic distortion. If seven depression scales overlap with a mean Jaccard of only 0.36, then a meta-analysis that pools effect sizes from these seven scales is technically combining different concepts in the same pool (Fried, 2017). The resulting summary effect size is neither the true effect of any single scale nor the general effect of “depression.” The same problem has been documented for the trauma exposure literature (Karstoft & Armour, 2022), for psychosis risk (Bernardin et al., 2023), and for cancer-related fatigue (Muench et al., 2024).
The fourth hazard is replication failure. One researcher reports a finding using scale X; a second researcher tests the same construct using scale Y; the results disagree. This is interpreted as a genuine replication failure. But if the Jaccard overlap between X and Y is 0.30, this is not a failure—it is not a replication: the two studies measured different symptom clusters (Bernardin et al., 2023). An important component of the replication crisis is the unrecognized gap of item overlap.
The fifth hazard is the bifurcation of research domains. When the same concept is studied under two different names, each name generates its own literature; the two literatures advance independently and rediscover the same findings in different vocabularies. The Dark Triad versus the Dark Factor D, burnout versus burnout disorder, work engagement versus job satisfaction—these literatures are concrete examples of this hazard (Le et al., 2010). The result is a literature that is economically inefficient and conceptually inflated.
The sixth hazard is inconsistency in clinical and policy decisions. The choice of patient-safety culture scale determines whether a hospital is judged to have a “mature safety culture” or one “in need of improvement.” Fatigue assessment influences whether a cancer patient is assigned an intervention (Muench et al., 2024). Dystonia severity scores determine botulinum toxin dosing (Chrobak et al., 2024). Low item-level overlap among scales causes the same patient to navigate the same system with different outcomes—what may be termed scale-dependent clinical decision variance.
4. What Is Neglected When This Analysis Is Not Performed
When item-level content overlap analysis is not performed, a series of elements that do not appear on the dashboard of scale evaluation—but whose effects persist—remain neglected.
The first is the substantive content of items. Conventional CFA and reliability approaches tell us that items “load well” but do not evaluate what the items are about. A sleep item, an appetite item, a concentration item, and a hopelessness item can cluster under a single factor with high loadings, presenting a statistically homogeneous construct. Yet the structure is heterogeneous in content; clinical intervention must respond to different items differently. Without an item-level overlap audit, this content heterogeneity remains invisible (Fried, 2017).
The second is idiosyncratic (single-scale-specific) items. In Fried’s (2017) study of 52 depression symptoms, 40% of symptoms appear in only one of the seven scales; in the mania study, this rate is 36% (Chrobak et al., 2018); for neurological soft signs scales, a similar pattern is observed, with mean overlap of only 0.27 (Chrobak et al., 2021); in psychosis risk, overlap is 0.19 ± 0.50, with a very high proportion of idiosyncratic content (Bernardin et al., 2023). These items appear as “trivial particularities” in scale-to-scale comparison but may be of substantial clinical importance. Conventional scale evaluation does not detect these idiosyncratic items; overlap analysis renders this systematic.
The third is universal (core) items. Which symptoms or indicators are common across all scales? This defines the core shared territory of the construct. In cervical dystonia, no symptom is common across all seven scales (Chrobak et al., 2024); for depression, the proportion of common items is only 12% (Fried, 2017). Without this core-item map, the question “what is the essence of the construct?” remains unanswered.
The fourth is contamination (content contamination). When the items of one scale are structurally entangled with the items of another construct, the scale cannot measure the nominal construct in pure form. In antenatal depression, the sleep, appetite, weight, and somatic items of the Zung SDS overlap with the physiology of normal pregnancy; without overlap analysis, these items are interpreted as depression symptoms (Chen et al., 2020). Contamination between burnout and depression, between anxiety and depression, and between moral distress and burnout can only be surfaced through item-level audit.
The fifth is loss of cross-cultural equivalence. When a scale is translated into another language, translation can alter inter-item overlap: two nuanced English items may collapse into a single phrasing in Turkish (spurious overlap inflation), or two overlapping English items may diverge in Turkish (cross-cultural equivalence loss). These mechanisms are invisible if overlap analysis is performed only on the English originals; an additional overlap analysis is required on the post-translation item pool.
The sixth is redundancy load. Within a single scale, items may also meaningfully overlap with one another (intra-scale redundancy); this is not a cross-scale comparison question but rather a question about the justification of the scale’s own item count. The EGA + UVA method reveals which items within a scale are too similar to one another (wTO > 0.20)—that is, which are effectively “asking the same item again.” Without this, scales are artificially inflated in length, respondents bear unnecessary burden, and reliability coefficients appear misleadingly high.
The seventh is mapping the construct. As a by-product, item-level content overlap analysis produces the actual internal geography of the construct: which sub-dimensions exist, which symptom clusters move together, which peripheral items sit in isolated regions (Karstoft & Armour, 2022). This map differs from what conventional factor analysis provides, because factor analysis only sees the covariance among items that exist and cannot show items that do not exist (i.e., items that exist in other scales but are absent in this one). Overlap analysis, by contrast, makes the absent visible.
5. The Limitations of Item-Level Content Overlap Analysis Itself
The method is powerful, but it cannot see everything. Its limits should be known just as its hazards are.
The first limitation is its inability to capture psychometric weight. The Jaccard and expert-judgment approaches render a 0/1 (present/absent) or categorical decision about whether two items share “the same content.” Yet one item may represent the construct more strongly or more weakly than another; factor loadings, discrimination parameters, and item-total correlations may differ. Item-content identity is not the same as item-psychometric equivalence.
The second limitation is context dependence. The same item may be evaluated with different biases in different scales: response format (3-point vs. 5-point Likert), reference period (“past week” vs. “past month”), order effects, and social desirability context. Two scales may share the same item wording, yet the context can produce different responses. Overlap analysis flattens context.
The third limitation is its possible blindness to translation and language equivalence. The fact that two English items are deemed “the same” does not guarantee that their Turkish equivalents will be. Conversely, two English items deemed different may translate to the same expression in Turkish. Sentence-embedding-based methods can mitigate this problem with multilingual models but cannot eliminate it; idiomatic and cultural meaning loadings cannot be fully captured.
The fourth limitation is expert bias. In Larsen-and-Bong-style expert construct identity evaluations, experts’ own paradigms push them to see certain similarities and miss others (Larsen & Bong, 2016). The same item pair can yield different overlap rates across different expert pools; this can be partly controlled with ICC but not entirely eliminated.
The fifth limitation is its inability to see response process. Two respondents may answer the same item via different mental processes: one retrieves an autobiographical memory, the other gives a general judgment. This response-process difference means the construct’s actual aspects being measured may differ even though the item content appears identical. Cognitive interview methods can close this gap, but item overlap analysis alone cannot detect it.
The sixth limitation is its inability to see structural and causal relationships. Item overlap analysis compares item contents horizontally; it does not consider causal relationships among items (situations in which one item precedes and influences another). The contribution of network psychometrics over the past decade has been to show that causal relationships among items also define the nature of a construct; pure overlap analysis cannot reach this dimension.
The seventh limitation is the incompleteness of external validity on its own. Even at 30% overlap, two scales may exhibit identical correlations with external criteria (jangle); even at 70% overlap, they may exhibit meaningfully different correlations with external criteria (latent jingle). Item overlap analysis alone cannot deliver a jingle/jangle verdict; it must be complemented by external-criterion tests such as dependent correlations (Gonzalez et al., 2020; Blötner, 2025).
The eighth limitation is its disregard for the dynamic nature of constructs. Constructs change over time, across populations, across developmental stages. The “depression” item may not carry the same content for adolescents as for adults. Item overlap analysis is typically a static cross-sectional snapshot; it does not produce a lifespan-evolution map of the construct.
The ninth limitation is the representativeness problem. The scales being compared consist of those already published; how well the sample of scales represents the universe of the construct is unknown. “Seven depression scales” do not exhaust the universe of the depression construct; “seven burnout scales” do not cover all possible manifestations of the burnout concept. The findings of the analysis carry forward this sample-dependent structure.
The tenth limitation is its inability to generate alternative constructs. Item overlap analysis audits what already exists; but it does not say which construct has been mislabeled or whether a new construct is needed. That is a conceptual/theoretical decision, not a methodological output.
6. Conclusion
Item-level content overlap analysis is the most direct, most transparent empirical tool we have against a century-old pair of jingle and jangle fallacies (Kelley, 1927; Thorndike, 1904; Larsen & Bong, 2016). When added to the conventional scale-evaluation apparatus (Cronbach’s α, confirmatory factor analysis, convergent/discriminant validity), it lets us see what scales actually measure—independently of their labels—and renders visible such hazards as spurious convergent validity, spurious incremental validity, meta-analytic distortion, replication failure, literature bifurcation, and clinical decision inconsistency (Fried, 2017; Le et al., 2010; Gonzalez et al., 2020). It also has its own limits: it cannot, on its own, capture psychometric weight, context, translation equivalence, response processes, causal inter-item relationships, or the dynamic nature of constructs. Item overlap analysis, therefore, does not deliver a validity verdict on its own; but a validity verdict rendered without it is incomplete. For every domain of health measurement—patient-reported outcomes, healthcare-worker scales, health management, clinical assessment—item-level content overlap analysis is becoming a mandatory minimum standard.
Mini Glossary
The following table contains brief definitions of the technical and semi-technical concepts used throughout the article. Where a Turkish equivalent is not yet firmly established, the English original is preserved as the primary term.
| Construct | An abstract psychological or social entity not directly observable, but measured through its indicators (e.g., items). Examples: depression, burnout, patient safety culture. |
| Construct validity | The body of evidence showing that a scale actually measures the construct it claims to measure. Comprises convergent, discriminant, content, and criterion validity. |
| Construct proliferation | The phenomenon by which most scales marketed under new names actually re-iterate items from existing constructs. |
| Jingle fallacy | The assumption that two scales measure the same construct because they bear the same name (Kelley, 1927). Different concepts may circulate under the same label. |
| Jangle fallacy | The assumption that two scales measure different constructs because they bear different names (Thorndike, 1904). The same concept may circulate under different labels. |
| Jaccard index | The content-overlap ratio computed by |A ∩ B| / |A ∪ B|. The number of items shared between two scales divided by the total number of unique items across both scales. Ranges from 0 to 1. |
| Psychometrics | The statistical discipline concerned with the theory and methods of measuring psychological and social constructs. |
| Item | A single question, statement, or indicator within a scale. The smallest building block of a construct measure. |
| Factor analysis | The family of statistical methods that estimate a small number of latent dimensions (factors) underlying a large number of items. |
| CFA (Confirmatory Factor Analysis) | Tests whether a pre-specified factor structure is supported by the data. |
| EGA (Exploratory Graph Analysis) | A network-psychometrics method that models items as a graph and uncovers their dimensional structure. |
| UVA (Unique Variable Analysis) | A method applied to EGA output that identifies and removes redundantly similar (locally dependent) items. |
| wTO (weighted Topological Overlap) | A network metric capturing the extent to which two nodes share neighborhood structure. Typical UVA threshold: 0.20–0.25. |
| Cronbach’s alpha (α) | The classical reliability coefficient measuring the internal consistency of a scale’s items. Values range from 0 to 1. |
| McDonald’s omega (ω) | A modern reliability coefficient based on a factor model, offered as an alternative to Cronbach’s alpha. More accurate when unidimensionality is not assumed. |
| Factor loading | The standardized coefficient indicating an item’s correlation with a latent factor. Higher values indicate stronger representation of the factor. |
| Item-total correlation | The correlation between an item’s score and the scale’s total score. A simple indicator of the item’s contribution to the measure. |
| Discrimination parameter | In Item Response Theory (IRT), the parameter indicating how well an item discriminates between levels of the latent ability. |
| Convergent validity | The expectation that two scales presumed to measure the same construct will be highly correlated. |
| Discriminant validity | The expectation that scales measuring different constructs will exhibit relatively low correlations. |
| Incremental validity | The capacity of a new scale to explain statistically significant additional variance beyond the variance explained by existing scales. |
| Distortion | The deviation of a measurement, computation, or inference from its true value; in this literature, “meta-analytic distortion” specifically refers to systematic bias produced by pooling different scales in the same meta-analysis. |
| Replication | The process by which a finding is reproduced by another researcher, in another sample, using a similar or identical method. |
| Idiosyncratic item | An item that appears in only one scale and in none of the rival scales. Represents a scale-specific component outside the construct’s core. |
| Contamination (content contamination) | The mixing of a scale’s items with the content of another structure (e.g., normal pregnancy physiology or anxiety items); prevents the scale from measuring the pure target construct. |
| Heterogeneity | Internal diversity within a group, scale, or construct; the absence of uniformity. |
| Nomological network | Cronbach and Meehl’s (1955) concept. The map of a construct’s expected theoretical relationships with other constructs. Provides the theoretical infrastructure of validity. |
| Likert scale | The ordinal response format introduced by Rensis Likert in 1932, in which respondents rate their level of agreement with a statement (e.g., on a 1–5 scale; Likert, 1932). |
| Cognitive interview | A structured interview technique used to surface the actual mental processes a respondent goes through when answering an item. |
| Sentence embedding | A machine-learning method that converts a sentence into a high-dimensional numerical vector (using models such as BERT, MPNet, BERTurk). Enables measurement of semantic similarity between sentences. |
| Cosine similarity | The cosine of the angle between two vectors. The standard metric for semantic proximity between sentences in embedding space. Ranges from −1 to 1. |
| ICC (Intraclass Correlation Coefficient) | Evaluates inter-rater agreement or the consistency of repeated measurements. |
| Dependent-correlation test | A test of whether two correlations from the same sample are statistically different (Gonzalez et al., 2020). Used in jingle/jangle decisions. |
| Meta-analysis | The secondary research method that statistically combines the results of different studies addressing the same question. |
| Autobiographical memory | The recall of specific events from one’s own life. An important mechanism shaping response process in cognitive interviewing. |
| Causal relationship | A directed link between two variables in which one directly affects the other; distinct from mere correlation. |
| Network psychometrics | A measurement approach that models items as nodes and partial correlations among them as edges. Offers an alternative to the classical latent-factor approach. |
| Extrinsic criterion | An independent variable external to the scale (e.g., a true medical diagnosis, a behavioral outcome) against which the scale’s correlation is tested. |
| Bifactor ESEM | Bifactor Exploratory Structural Equation Modeling. A flexible factor analytic approach that estimates a general factor and specific sub-factors within the same model. |
| MTMM (Multitrait-Multimethod) | A classical approach in which the same construct is measured with different methods and different constructs with the same method, organized in a matrix to evaluate validity (Campbell & Fiske, 1959). |
References
Bernardin, F., Gauld, C., Martin, V. P., Laprévote, V., & Dondé, C. (2023). The 68 symptoms of the clinical high risk for psychosis: Low similarity among fourteen screening questionnaires. Psychiatry Research, 330, 115592. https://doi.org/10.1016/j.psychres.2023.115592
Blötner, C. (2025). Extending Lawson and Robins’ (2021) guideline for the evaluation of jingle and jangle fallacies. Behavior Research Methods, 57(6), 177. https://doi.org/10.3758/s13428-025-02691-6
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.
Chen, X., Hu, W., Hu, Y., Xia, X., & Li, X. (2020). Discrimination and structural validity evaluation of Zung self-rating depression scale for pregnant women in China. Journal of Psychosomatic Obstetrics & Gynaecology, 43(1), 26–34. https://doi.org/10.1080/0167482X.2020.1770221
Chrobak, A. A., Siwek, M., Dudek, D., & Rybakowski, J. K. (2018). Content overlap analysis of 64 (hypo)mania symptoms among seven common rating scales. International Journal of Methods in Psychiatric Research, 27(3), e1737. https://doi.org/10.1002/mpr.1737
Chrobak, A. A., Krupa, A., Dudek, D., & Siwek, M. (2021). How soft are neurological soft signs? Content overlap analysis of 71 symptoms among seven most commonly used neurological soft signs scales. Journal of Psychiatric Research, 138, 404–412. https://doi.org/10.1016/j.jpsychires.2021.04.020
Chrobak, A. A., Rusinek, J., Dec-Ćwiek, M., Porębska, K., & Siwek, M. (2024). Content overlap of 91 dystonia symptoms among the seven most commonly used cervical dystonia scales. Neurological Sciences, 45(4), 1507–1514. https://doi.org/10.1007/s10072-023-07157-1
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.
Fried, E. I. (2017). The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. Journal of Affective Disorders, 208, 191–197. https://doi.org/10.1016/j.jad.2016.10.019
Gonzalez, O., MacKinnon, D. P., & Muniz, F. B. (2020). Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research, 56(1), 3–19. https://doi.org/10.1080/00273171.2019.1707061
Karstoft, K.-I., & Armour, C. (2022). What we talk about when we talk about trauma: Content overlap and heterogeneity in the assessment of trauma exposure. Journal of Traumatic Stress, 36(1), 71–82. https://doi.org/10.1002/jts.22880
Kelley, T. L. (1927). Interpretation of educational measurements. World Book Company.
Larsen, K. R., & Bong, C. H. (2016). A tool for addressing construct identity in literature reviews and meta-analyses. MIS Quarterly, 40(3), 529–551.
Le, H., Schmidt, F. L., Harter, J. K., & Lauver, K. J. (2010). The problem of empirical redundancy of constructs in organizational research: An empirical investigation. Organizational Behavior and Human Decision Processes, 112(2), 112–125. https://doi.org/10.1016/j.obhdp.2010.02.003
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 5–55.
Muench, A., Lampe, E. W., Garland, S. N., Dhaliwal, S., & Perlis, M. L. (2024). Constructing a picture of fatigue in the context of cancer: Assessment of construct overlap in common fatigue scales. Supportive Care in Cancer, 32(11), 737. https://doi.org/10.1007/s00520-024-08930-4
Thorndike, E. L. (1904). An introduction to the theory of mental and social measurements. Science Press.
