Can AI Grade OSCEs from Video Recordings?
A medical school experiment tested whether AI can assess clinical skills as accurately as human experts
Dear Medical Educator,
A new valuable study published just last week put AI to the test in a high-stakes arena: Can AI evaluate Objective Structured Clinical Examinations (OSCEs) as accurately as human experts?
Researchers compared two AI models with experts to assess video recordings of 196 medical students as they performed four essential clinical tasks:
Intramuscular injection
Square knot tying
Basic life support
Urinary catheterization
Each student was evaluated by five independent assessors:
One real-time human examiner
Two expert video reviewers
Two AI models (ChatGPT-4o and Gemini Flash 1.5)
All evaluators used standardized checklists.
AI vs. Human: The Scoring Gap
Across all skills, AI tended to score students higher than human evaluators—but the degree varied by task.
Intramuscular Injection
AI: 28.23 | Human: 25.25
Best agreement on visual steps like identifying landmarks; less so on verbal explanations.
Knot Tying
AI: 16.07 | Human: 10.44
Largest scoring gap; likely overestimated fine motor skill execution.
Basic Life Support
AI: 17.05 | Human: 16.48
Smallest difference; strong visual alignment, but mixed results on more nuanced steps.
Urinary Catheterization
AI: 26.68 | Human: 27.02
Most balanced evaluation; verbal tasks still showed variability, but AI held steady across visual components.
Where AI Excelled and Where It Fell Short
The effectiveness of AI varied by the nature of the task:
Visual Tasks: AI aligned well with human scores when tasks were predominantly visual like inserting needles or positioning equipment.
Auditory Tasks: AI struggled with tasks involving verbal interaction, such as explaining procedures. These steps, rich in nuance and context, often led to discrepancies between AI and human evaluators.
The risk of overscoring is evident as this study showed. The most important limitation is the uncertainty about how much we can rely on its outputs. The best path forward, for now, is a hybrid model with humans always in the loop. The study supports a collaborative future where AI serves as an assistant.
That said, more advanced models like Google's Gemini 2.5 Pro and OpenAI's o3 are already here. As future models improve, they may overcome current limitations. Human judgment remains essential for now, but its indispensability may not be permanent.
Yavuz Selim Kıyak, MD, PhD (aka MedEdFlamingo)
Follow the flamingo on X (Twitter) at @MedEdFlamingo for daily content.
Subscribe to the flamingo’s YouTube channel.
LinkedIn is another option to follow.
Who is the flamingo?
Related #MedEd reading:
Çiçek, F. E., Ülker, M., Özer, M., & Kıyak, Y. S. (2025). ChatGPT versus expert feedback on clinical reasoning questions and their effect on learning: a randomized controlled trial. Postgraduate medical journal, 101(1195), 458-463. https://academic.oup.com/pmj/article/101/1195/458/7917102
Kıyak, Y. S., & Kononowicz, A. A. (2024). Case-based MCQ generator: a custom ChatGPT based on published prompts in the literature for automatic item generation. Medical Teacher, 46(8), 1018-1020.. https://www.tandfonline.com/doi/full/10.1080/0142159X.2024.2314723
Kıyak, Y. S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://academic.oup.com/pmj/advance-article/doi/10.1093/postmj/qgae065/7688383
Kıyak, Y. S., & Emekli, E. (2024). A Prompt for Generating Script Concordance Test Using ChatGPT, Claude, and Llama Large Language Model Chatbots. Revista Española de Educación Médica, 5(3). https://revistas.um.es/edumed/article/view/612381