Evaluating GP Training Selection Methodologies

Mehmet Nurullah KurutkanSeptember 1, 2025

“Evaluation of three short-listing methodologies for selection into postgraduate training in general practice” by Patterson, Baron, Carr, Plint, and Lane (2009) evaluates the effectiveness and efficiency of different methods used to identify suitable candidates for postgraduate General Practice (GP) training in the UK.

1. Primary Objective of the Study: The primary objective of this study was to evaluate the effectiveness and efficiency of three distinct short-listing methodologies for selecting trainees into postgraduate general practice training in the UK. This research was spurred by significant changes in postgraduate medical training in the UK, brought about by the Modernising Medical Careers initiative, which increased the emphasis on robust selection methodologies. The ultimate goal was to inform optimal selection practices for postgraduate training across all medical specialties, a topic of considerable debate.

2. The Three Short-Listing Methodologies Evaluated: The study compared three short-listing methodologies: the Clinical Problem-Solving Test (CPST), Structured Application Form Questions (AFQs), and a new Situational Judgement Test (SJT).

Clinical Problem-Solving Test (CPST): This is a machine-marked test that evolved from an existing test of clinical knowledge. Operational papers typically contain around 100 questions and require 90 minutes to complete. It uses a mix of extended-match and single best-answer formats, covering various clinical areas defined by the UK training curriculum. Applicants apply clinical knowledge to solve problems, reflecting diagnostic processes or the development of patient management strategies. The CPST showed satisfactory reliability (Cronbach’s α = 0.89).
Structured Application Form Questions (AFQs): These comprise open-ended questions designed to target non-clinical competency domains specified in the person specification, such as empathy, integrity, and the ability to cope with pressure. Applicants are given 2 hours to complete seven questions, each with a 250-word limit, under invigilated conditions. Unlike the other two methods, AFQs are hand-marked by two independently trained assessors using a validated scoring framework, a process that requires a 1-day training workshop and calibration to enhance reliability. The AFQs showed satisfactory reliability (Cronbach’s α = 0.78).
Situational Judgement Test (SJT): This was a newly designed, machine-marked instrument for this context, consisting of 50 questions that depict work-related scenarios. Applicants are asked to identify appropriate responses from a list of alternatives, using formats such as ranking or multiple best-answer. The SJT specifically targeted non-clinical domains, including empathy, professional integrity, and the ability to cope with pressure. This study represents the first application of an SJT in postgraduate specialty selection, although it had been validated for medical school admissions. The SJT items were developed by experienced general practitioners and psychologists, and scoring relied on an agreed-upon response key with substantial expert agreement. The estimated reliability for operational length SJT forms ranged from 0.80 to 0.83.

3. Main Results Regarding Validity and Effectiveness: The study found that all three short-listing methodologies (AFQs, CPST, and SJT) were valid predictors of performance at the selection centre. Performance at the selection centre (SC) was used as the outcome measure, as it has been shown to predict performance three months into GP training.

Individual Validity: After correcting for restriction of range, the SJT showed the strongest association with SC performance (r = 0.56), followed by the CPST (r = 0.44), and then AFQs (r = 0.40). This means the SJT was the most effective independent predictor of later selection centre performance.
Incremental Validity: Both the structured AFQs and the SJT demonstrated incremental validity over the use of the CPST alone. This means they add significant predictive power when used in conjunction with the CPST. Hierarchical regression analyses confirmed that all three measures had significant incremental validity over other measures (P < 0.001). The SJT, in particular, offered the most incremental validity.
Optimum Combination: The results indicated that optimum validity and efficiency are achieved using a combination of the CPST and SJT. While the strongest overall prediction of SC performance was a combination of all three measures (uncorrected r = 0.51), the CPST and SJT combination was highlighted for its efficiency benefits.

4. Efficiency and Cost Implications: Resource-efficiency is a critical consideration in developing selection methodologies, especially for large-volume recruitment.

The AFQ method was found to be relatively costly to implement, with estimated costs of £50 (€59, $74) per applicant due to the need for substantial marking resources (two trained assessors per response) and invigilation. The hand-marking of AFQs requires approximately 30 minutes of assessor time per applicant.
In contrast, machine-marked tests, such as the CPST and SJT, had actual costs of £20 (€24, $30) per applicant.
Therefore, the combination of CPST and SJT represents the most effective and efficient battery of instruments because, unlike AFQs, both tests are machine-marked, thereby significantly reducing administrative and assessor time costs. It is acknowledged, however, that SJTs (and CPSTs) require initial investment in development costs, but they become highly cost-beneficial when processing large numbers of applicants.

5. Significance of the SJT’s Evaluation: The evaluation of the SJT in this study is particularly significant because it is the first study to evaluate a machine-marked SJT specifically for assessing non-clinical domains for postgraduate selection in the UK. SJTs are well-suited to assessing non-clinical domains such as empathy, integrity, and coping with pressure, which are crucial for success in medical training. The study demonstrated that this methodology is effective in this specific postgraduate context. Examples provided in the source illustrate the SJT’s focus on these non-clinical skills through scenario-based questions requiring ranking or multiple best-answer responses.

6. Broader Implications for Selection Practices: The results have important implications for developing selection systems for large-volume recruitment in medical specialties in the UK.

The findings directly inform optimal selection practices for postgraduate training across all medical specialties.
The study supports current proposals for selection into UK postgraduate training for all medical specialties regarding the use of machine-marked tests for short-listing purposes.
The emphasis on valid and efficient methodologies is crucial given the unprecedented changes in postgraduate medical training in the UK due to the Modernising Medical Careers initiative.

7. Identified Limitations and Suggestions for Future Research: The study acknowledges several limitations and proposes areas for future research:

Long-Term Predictive Validity: While the study clearly demonstrated the validity of the measures in predicting performance in the final stage of selection (the Selection Centre), further research is needed to examine whether these measures also predict actual performance in training and long-term work-based assessment once trainees are in post. This is currently being explored in a separate validation study.
Data Loss: Approximately 12% of SJT data was unavailable due to an administrative error. However, the sample size remained adequate (463 applicants), and there was no evidence of systematic bias in the lost data.
Pilot Nature of SJT: The SJT was a pilot, and applicants were aware that their scores would not affect selection decisions. This might have led candidates to not take the test with the same seriousness, potentially attenuating its observed reliability and validity. The validity coefficients might also be attenuated because it was a pilot rather than an operational test.
Generalizability to Other Specialties: Although there are common competencies across specialties, selection criteria can differ, particularly in non-clinical domains. Research is required to determine if an SJT is an appropriate methodology for use in other medical specialties.
Applicant Reactions and Perceptions of Fairness: Future research should explore applicant reactions and perceptions of fairness, which are crucial evaluative standards for selection methodologies.
Coaching or Practice Effects: Further research should investigate whether these assessments are prone to coaching or practice effects.

Reference: Patterson, F., Baron, H., Carr, V., Plint, S., & Lane, P. (2009). Evaluation of three short-listing methodologies for selection into postgraduate training in general practice. Medical Education, 43(1), 50–57. https://doi.org/10.1111/j.1365-2923.2008.03238.x

Video

Subscribe to the Health Topics Newsletter!

When One Method Is Not Enough: The Multimethod SEM Framework for Rigorous Research
March 12, 2026
Physicians rarely rely on a single diagnostic test when confronting a complex disease. They combine imaging, laboratory work, and genetic…
Can Generative AI Strengthen Critical Thinking? A Pedagogical Framework for LLM Integration in Higher Education
March 12, 2026
The rapid integration of large language models (LLMs) such as GPT-4 and DeepSeek R1 into higher education has generated considerable…
Analysis theories on artificial intelligence, ChatGPT, data science, and metaverse
February 15, 2026
The rapid convergence of artificial intelligence, data science, generative AI systems such as ChatGPT, and immersive environments like the metaverse…
Lotus Protocol: A New Approach to Systematic Reviews
February 13, 2026
The article How to Conduct a Multi-Domain Systematic (Literature) Review? Guidelines Using The Lotus Protocol addresses a growing methodological gap…
The Health Benefits of Voluntary Simplicity
February 12, 2026
Voluntary simplicity is a multidimensional lifestyle orientation that refers to individuals’ conscious reduction of consumption levels in order to build…
Reviewer Fatigue and the Future of Peer Evaluation
February 11, 2026
The contemporary academic publishing ecosystem is sustained by peer review, a system widely regarded as the epistemic backbone of scientific…
Factors Driving 30-Day ED Revisits in Older Patients
February 11, 2026
Population ageing has transformed emergency care demand patterns worldwide, placing unprecedented pressure on emergency departments (EDs) and exposing systemic gaps…
Addressing Care Worker Burnout: Key Findings
February 11, 2026
The growing complexity of long-term care needs, combined with chronic workforce shortages, has positioned nursing homes among the most psychologically…
Impact of Loneliness on Quality of Life in Older Adults
February 10, 2026
This article, titled “Loneliness as a Predictor of Quality of Life in Older Adults Receiving Primary Health Care in Türkiye:…
Analysis of Patient Participation: Trends and Insights
February 10, 2026
The growing emphasis on patient-centered healthcare has transformed the role of patients from passive recipients of care into active partners…

Evaluating GP Training Selection Methodologies

Video

Subscribe to the Health Topics Newsletter!

Related Posts