AI Performed Worse Because We Tried to Help
Surprise: GPT-4o wrote better questions without extra help
Dear Medical Educator,
What if I told you that AI models could perform worse when we try to help them by providing extra information?
A recent study (published last week) just showed exactly that and it could change how you think about using AI in content generation.
The Setup
Researchers tested ChatGPT-4o’s ability to write USMLE-style pharmacology questions using some learning objectives. They ran two versions of the AI:
Standard GPT-4o: Given the learning objective, the model generated a multiple-choice question directly.
"Enhanced" GPT-4o (RAG): The model was given the learning objective plus extra information pulled from top pharmacology textbooks using a method called retrieval-augmented generation (RAG).
RAG is supposed to reduce AI hallucinations by feeding it relevant external content before it writes.
Sounds great in theory.
The Results
Turns out, GPT-4o did better without the extra "help":
Standard GPT-4o: 88% question accuracy
RAG-enhanced GPT-4o: 69.2% accuracy
That’s not a small difference, and underscores a key insight: sometimes, giving the model more information can actually confuse it.
Why Did RAG Make It Worse?
The high baseline (88%) from GPT-4o reflects something we've seen echoed in other studies: very large models often have strong internal knowledge of clinical pharmacology to generate good MCQs. When you inject redundant or loosely related textbook content, it can overwrite that concise internal process with a more error-prone synthesis of the external material.
RAG systems can search documents (like textbooks) to find "relevant" passages based on word overlap or statistical similarity (cosine similarity). But this method sometimes retrieves text that sounds related yet is subtly off-target.
Example: If the objective is about β-blocker contraindications, RAG might pull a paragraph on β-blocker mechanisms. Similar vocabulary, but not the right concept.
This mismatch probably confuses the model. It gets distracted or misled, leading to more errors in the final question.
So What Now?
I think, instead of relying on “blind” search tools, we should just ask ChatGPT (or another AI) to find and explain the most relevant info based on the necessary content.
Prompted properly, GPT-4o can understand not just words, but intent and context. That’s a kind of semantic search that can outperform lexical RAG methods. This opens a research question: Can AI not only generate assessment items, but also curate its own sources?
Actually, if we use reasoning models (such as OpenAI's o3, Google's Gemini 2.5 Pro, Claude Opus 4) rather than GPT-4o (a non-reasoning model), we might be able to achieve even higher-quality content generation with self-directed curation and justification of sources (assuming the sources are accessible). This is an exciting avenue for future of AI-assisted MCQ generation. This probably will also show that prompt quality is becoming less critical, as the model can find the necessary reasoning paths as long as the ultimate objective is clearly defined.
Until then, this study tells what to do:
Use GPT-4o with carefully written prompts and trusted objectives.
RAG isn't useless in some tasks, but it must be applied carefully.
Always keep a human in the loop (even 88% is less than 100%).
Yavuz Selim Kıyak, MD, PhD (aka MedEdFlamingo)
Follow the flamingo on X (Twitter) at @MedEdFlamingo for daily content.
Subscribe to the flamingo’s YouTube channel.
LinkedIn is another option to follow.
Who is the flamingo?
Related #MedEd reading:
Çiçek, F. E., Ülker, M., Özer, M., & Kıyak, Y. S. (2025). ChatGPT versus expert feedback on clinical reasoning questions and their effect on learning: a randomized controlled trial. Postgraduate Medical Journal, 101(1195), 458–463. https://academic.oup.com/pmj/advance-article/doi/10.1093/postmj/qgae170/7917102
Kıyak, Y. S., & Kononowicz, A. A. (2024). Case-based MCQ generator: a custom ChatGPT based on published prompts in the literature for automatic item generation. Medical Teacher, 46(8), 1018-1020.. https://www.tandfonline.com/doi/full/10.1080/0142159X.2024.2314723
Kıyak, Y. S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://academic.oup.com/pmj/advance-article/doi/10.1093/postmj/qgae065/7688383
Kıyak, Y. S., & Emekli, E. (2024). A Prompt for Generating Script Concordance Test Using ChatGPT, Claude, and Llama Large Language Model Chatbots. Revista Española de Educación Médica, 5(3). https://revistas.um.es/edumed/article/view/612381