Comprehensive Summary
Niset et al. evaluated the diagnostic accuracy of open and closed source large language models (LLMs) in generating early diagnosis predictions in emergency medicine settings. The study was conducted at a tertiary emergency department in Belgium. Using data from 79 emergency department (ED) cases, 2,370 AI diagnostic predictions were generated by running six model pipelines that combined 2 embedding and 3 foundational models through a uniform RAG architecture. The LLM’s top 5 predictions were compared against diagnoses that were determined and finalized by three emergency physicians. After comparison, the models achieved a similar diagnostic performance, with accuracy in 62-72% of cases. Their performance was highly correlated on each case rather than the type of model, with specific and surgical diagnoses showing higher match rates than unspecific and medical cases. Additionally, missed cases were often due to nonspecific diagnoses, rather than the model’s limitations. Open-source models significantly outperformed GPT-4 based models in verifiable sourcing. Overall, open-source LLMs provided the best prediction performance with high transparency, demonstrating a viable alternative for aiding in clinical decision making within emergency medicine settings.
Outcomes and Implications
With medical staff shortages and increasing numbers of patients, emergency departments often operate under critical time constraints. Niset et al. highlight the potential use of AI models such as LLMs in aiding fast-paced clinical decisions by generating early diagnostic predictions from patient information while working under physician oversight. Niset et al.'s findings suggest open-source LLMs, with high diagnostic match rates and transparency, can be used as diagnostic support tools without compromising patient privacy or interpretability in emergency departments. However, the high impact of case specificity on model performance outcome demonstrates that physician oversight is still essential, especially for ambiguous or vague cases. This study demonstrates how LLMs can meaningfully support clinical judgment in emergency settings where there is a high volume of patients, limited time, and an immense pressure on providers.