Comprehensive Summary
This systematic review of 60 studies evaluates the current applications of Large Language Models (LLMs) in orthopaedics, focusing on standardized exam questions and common patient questions. Studies were identified through multiple databases and screened for relevance. All studies assessed ChatGPT, with fewer including Bard, PerplexityAI, and Bing. In 31 studies about standardized exam questions, ChatGPT 4.0 consistently outperformed other models. Accuracy scores ranged from 47% to 74% without images and 36% to 66% with images. However, orthopaedic residents achieved higher scores (74%-75%) on the same questions, highlighting the gap between LLMs and clinical training. 22 studies examined the LLMs’ responses to common patient questions, which were generally satisfactory (Likert and DISCERN scores were in the upper ranges, and readability ranged from high school to post-graduate levels). Comparative studies demonstrated that ChatGPT outperformed Bard, though findings on other LLMs remain limited. Findings show current research on LLMs in orthopaedics is concentrated on patient communication and exam-style assessments rather than clinical decision-making.
Outcomes and Implications
Using LLMs to answer common patient questions can provide accessible and satisfactory answers for patients, potentially improving patient education and reducing physician workload. With increasing accuracy on exam-style questions, LLMs could serve as a study aid for orthopaedic residents and medical students. While there is potential for clinical implementation, such as documentation, triage systems, and decision-making, this review highlights the gaps between LLMs and experienced clinicians, as well as the need for further research and improvement of LLMs for a wider scope of use.