Comprehensive Summary
This study aims to evaluate the capacity and accuracy of several large language models (LLMs) when providing clinical recommendations for people experiencing postmenopausal osteoporosis. LLMs used were ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, and Gemini Advanced. Microsoft Copilot was excluded due to international restrictions. These LLMs were subjected to 12 categories of questions from the AACE Clinical Practice Guidelines, amounting to 42 total questions. Independent reviewers evaluated each set of responses, deeming each response either accurate or inaccurate, with accurate meaning that the response contained the key points stated in the AACE recommendations. Inaccurate responses were then further divided into their over-conclusive or insufficient. The best performing LLM was ChatGPT-4o, answering 88% of questions (37/42) accurately, obeying AACE guidelines. Following was ChatGPT-4.0 (64.3%), a tie between ChatGPT-3.5 and Gemini Advanced (57.1%), and finally Gemini (45.2%). The main reason for inaccurate responses was insufficient information (66.6 – 82.6% of cases). The study found no statistically significant difference when grouping LLMs by payment model or company affiliation. However, the ChatGPT group did exhibit a statistically significant difference from the other LLMs, with a significantly greater number of insufficient answers.
Outcomes and Implications
The accuracy differences in LLMs implies that considerable development is required before LLMs play a role in a clinical setting surrounding postmenopausal osteoporosis. However, they can still be beneficial to patients using them as one of multiple sources of information regarding their condition. Additionally, LLMs that displayed high accuracy, like ChatGPT-4o, could be used to help busy physicians stay up to date on guideline changes and new research on postmenopausal osteoporosis. It is likely that as the field of AI continues to grow, these LLMs will only get more accurate.