The End of Manual Scoring? How LLMs Hit 100% Accuracy in Grading Open-Text Assessments

Study finds LLMs can score written assessments with 100% precision if prompted right

May 24, 2025

Dear Medical Educator,

If you ever need to score open-text responses, and you have a rubric for it, LLMs can now do it with surprising accuracy.

A recent study, published in Medical Teacher last week, tested this with post-encounter notes (PNs) from OSCEs. (PNs = the written summary students complete after a standardized patient encounter, often scored with analytic rubrics.)

The findings? With the right prompt strategy and some simple tweaks, you can use GPTs and other LLMs to automatically score them, WITH 100% RELIABILITY.

Here’s how.

The setup and findings

The study used 5 sample PNs and had 7 LLMs score them 100 times each.

The goal: Apply an analytic rubric (discrete scoring for each clinical feature) to see how reliably the LLMs could extract points.

Four prompting methods were tested:

Simple prompt
Just ask the LLM to read the PN and assign points.
It's easy but has low reliability: 12% exact matches (GPT-3.5), 18% (GPT-4o)
Chain-of-thought prompt
Add a phrase like “think step by step” to guide reasoning. (Chain-of-thought = Encouraging the model to reason aloud before scoring.)
Still inconsistent: 14–35% exact matches (GPT-4o)
Two-step prompt chaining
Split the task:
Step 1: Ask LLM to identify rubric elements found in the note.
Step 2: Ask LLM to assign scores based on those identified elements.
Major jump in consistency: 78% exact matches (GPT-4o), 96–100% (Claude 3.5 Sonnet)
Two-step + external calculation
Same as above, but LLM lists the point values for each element (e.g. [2, 1, 1, 3...]) and score is calculated outside the LLM, deterministically. Why? Because LLMs struggle with precise arithmetic due to their architecture. With external calculation, no more AI math errors: 100% SUCCESS in grading across ALL models.

A previous study I briefly explained showed LLMs aren't very reliable at grading OSCE videos, but now we see that, with well-designed prompts, they can score text-based inputs with high accuracy.

Can AI Grade OSCEs from Video Recordings?

Yavuz Selim Kıyak

May 9

Read full story

Pro tip for implementation

Use low temperature (e.g., 0.05). “Temperature” is how random the model is. Low means more consistent/deterministic output but less "delicious" if you use it for writing/feedback. By default, it's set between 0.7 and 1.0 in OpenAI’s products. You can't adjust this in ChatGPT, but it is configurable via the API.

You can also use this setup to refine your rubric: where LLMs get inconsistent may reveal ambiguous scoring criteria.

All in All

If you’re scoring text with an analytic rubric:

Skip simple prompts
Use two-step chaining
Offload scoring math
Run multiple trials

You’ll get AI-assisted scoring that’s fast, scalable, and, more importantly, trustworthy.

These efforts aim to produce consistent outputs from non-reasoning LLMs. However, reasoning models like Google's Gemini 2.5 Pro, OpenAI's o3, and Anthropic's Claude 4 handle math better and reason more effectively. With these, we may not need elaborate prompt engineering to achieve reliable results.

Those who have subscribed to this newsletter already know, but if you haven't read yet, to understand why reasoning models are important, read this post:

What makes PhD-level AI (o1) different from GPT?

Yavuz Selim Kıyak

October 26, 2024

Read full story

I don’t believe prompt engineering has a long-term future as reasoning models improve, AI will eventually handle it better than humans. You’ll just specify the desired outcome, and it’ll create the necessary prompt.

Prompt Engineering Doesn’t Matter Anymore: Here’s Why

Yavuz Selim Kıyak

Apr 22

Read full story

Until that not-so-distant future arrives, we still need to suffer from prompt gymnastics.

Yavuz Selim Kıyak, MD, PhD (aka MedEdFlamingo)

Follow the flamingo on X (Twitter) at @MedEdFlamingo for daily content.
Subscribe to the flamingo’s YouTube channel.
LinkedIn is another option to follow.
Who is the flamingo?

Related #MedEd reading:

Çiçek, F. E., Ülker, M., Özer, M., & Kıyak, Y. S. (2025). ChatGPT versus expert feedback on clinical reasoning questions and their effect on learning: a randomized controlled trial. Postgraduate Medical Journal, 101(1195), 458–463. https://academic.oup.com/pmj/advance-article/doi/10.1093/postmj/qgae170/7917102

Kıyak, Y. S., & Kononowicz, A. A. (2024). Case-based MCQ generator: a custom ChatGPT based on published prompts in the literature for automatic item generation. Medical Teacher, 46(8), 1018-1020.. https://www.tandfonline.com/doi/full/10.1080/0142159X.2024.2314723

Kıyak, Y. S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://academic.oup.com/pmj/advance-article/doi/10.1093/postmj/qgae065/7688383

Kıyak, Y. S., & Emekli, E. (2024). A Prompt for Generating Script Concordance Test Using ChatGPT, Claude, and Llama Large Language Model Chatbots. Revista Española de Educación Médica, 5(3). https://revistas.um.es/edumed/article/view/612381

MedEdFlamingo's Newsletter

Can AI Grade OSCEs from Video Recordings?

What makes PhD-level AI (o1) different from GPT?

Prompt Engineering Doesn’t Matter Anymore: Here’s Why

Discussion about this post