Impact of prompting on large language model performance: ChatGPT-4 performance on the 2023 hand surgery self-assessment examination

Back

Orthopedics

Impact of prompting on large language model performance: ChatGPT-4 performance on the 2023 hand surgery self-assessment examination

Journal of Hand and Microsurgery

Research Authors: Benjamin Fiedler, Olivia A. Barron, Jeffery Hauck, Jacob Scioscia, Todd Philips, Adil S. Ahmed, Scott Mitchell

AIIM Authors: Anthony Bonanno, Nicholas Leonard

Approved by President Reda Riffi

Publication Date: Jul 16, 2025

Comprehensive Summary

This study analyzed the performance of ChatGPT-4 on the 2023 American Society for Surgery of the Hand (ASSH) Maintenance of Certification Self-Assessment Examination (SAE) before and after being exposed to five previous versions of the test. In March 2024, 195 questions from the 2023 ASSH SAE were used to evaluate the GPT-4 performance with no prior exposure to previous exams. On a separate system, completed ASSH SAEs from 2014, 2015, 2017, 2018, and 2020 were uploaded to GPT-4 with all questions, correct answers, and explanations. This GPT-4 then took the 2023 ASSH SAE, with all questions administered exactly the same as they were for the system with no prior exposure. Before prompting, GPT-4 correctly answered 131/195 (67.2%), and after, GPT-4 correctly answered 138/195 (70.8%). There was no statistically significant difference in performance between pre- and post-prompting. The results were then further broken down into categories: image vs text questions, anatomical distribution (finger, hand, wrist, forearm, elbow, misc.), and question sub-category (anatomy, basic science, diagnostic, management). No statistically significant difference in performance between pre- and post-prompting was found in any of these categories. The threshold to pass the ASSH SAE is a 50%, which both pre- and post-prompting Chat-GPT4 scored well above. The lack of statistically significant improvements in performance indicates that the sources for hand pathology that GPT-4 accesses are adequate on their own. This demonstrates a significant capacity of GPT-4 to accurately diagnose patient pathology and additionally score well on exams designed for orthopedic surgery residents.

Outcomes and Implications

While GPT models have been a viable source for at-home patient aid and question fielding for some time, this study shows that GPT-4 can serve as a tool for orthopedic surgeons and non-orthopedic providers to accurately diagnose and treat hand problems. Additionally, its capacity to correctly answer ASSH SAE questions could prove GPT-4 to be a helpful study resource for orthopedic residents to help learn complex concepts. GPT-4 could also be used in a clinical setting to take notes and assist providers in a from administrative perspective. Given the rapid progression of AI models, research into this topic could be repeated sometime in the future to align with LLM developments.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.