Comprehensive Summary
Lockhart et al. evaluated whether ChatGPT could perform at the same level as urology trainees on the Australian written fellowship exam, which are taken by trainees near the end of specialty training and prior to beginning independent practice. The authors compared 10 trainee exams with 10 AI-generated exams using ChatGPT (5 each from models GPT -3.5 and GPT-4.0), which were marked by five urology consultants. The results showed 9 out of 10 trainees passing the exams, while only 6 of 10 ChatGPT-generated exams passed. The GPT-3.5 model was responsible for 3 out of the 4 failed exams. The trainees had a higher percentage of passing questions (~89% vs ~81%), as well as higher aggregate scores (~79% vs 78%). Most importantly, the consultants accurately identified every single AI-generated paper, noting its lack of details and inappropriate decision-makings.
Outcomes and Implications
Although the study’s small sample size and exclusion of image-based questions limits the effectiveness of the study, the results clearly depicted ChatGPT’s lack of details and consistency in comparison to the trainee counterparts, highlighting the need for human oversight for clinical judgements. It is still notable that 4 of 5 papers by ChatGPT-4.0 models passed the exam, which suggests that AIs could potentially support medical education and surgical tasks in the future. Today, AI’s role in clinical context remains as a supplementary tool for workflow assistance, yet to replace human clinical expertise in medical assessments. Continued development and testing of advanced AI models will be important to ensure safe and accurate integration of AI into real-world clinical practice.