Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

Back

Urology

Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

BJU International

Research Authors: Omer Faruk Asker, Muhammed Selim Recai, Yunus Emre Genc, Kader Ada Dogan, Tarik Emre Sener, Bahadir Sahi

AIIM Authors: Alexander Duval, Ethan Lowder

Approved by President Reda Riffi

Publication Date:

Comprehensive Summary

The paper, written by Asker et al., investigates the use, accuracy, calibration error, readability, and understandability of LLM chatbots when used to analyze objective measurements. The study was performed by taking 35 multiple-choice questions from the European Board of Urology in-service examination booklets, and had 5 LLMs answer the questions. The LLM’s were ChatGPT-4o, DeepSeek-R1, Gemini-2.0, Grok-2, and Claude 3.5. Along with answering the question, the chatbots were asked to explain their answer in one paragraph that would be understandable to a junior doctor and provide a confidence score from 0 to 100 based on how confident they were in the answer. ChatGPT-4o, DeepSeek-R1, and Grok-2 all tied for the highest accuracy with a score of 28 out of 35 questions correct, while Claude 3.5 and Gemini-2.0 achieved 27 and 24 correct answers, respectively. Each LLM performed differently based on the subspecialty it was being tested on; for example, in Paediatric Urology, DeepSeek-R1 and Grok-2 outperformed ChatGPT-4o.Regarding calibration error, the order from best to worst was ChatGPT-4, Grok-2, Claude 3, DeepSeek-R1, and Gemini-2.0, with respective calibration scores of 19.2, 30.3, 19.7, 21.7, and 38.4. The calibration error was based on how well the predicted confidence scores of each LLM aligned with actual accuracy, with the lower the number meaning the better the calibration error. In terms of readability, different metrics yielded varying results, but a consensus among the researchers was that DeepSeek-R1 performed the best in most metrics, with ChatGPT-4 also performing well in some. Understandability, from best to worst, was Claude 3.5, followed by Gemini-2.0, DeepSeek-R1, and Grok-2, with the lowest understandability score being ChatGPT-4o. The discussion states that in the field of Urology, each LLM has its own strengths and weaknesses, but generally, the improvement of LLMs will greatly contribute to the medical field if more testing is done to prove it is accurate enough.

Outcomes and Implications

Ensuring that research can be understood and digested quickly by physicians in training is crucial in helping to ease the workload of junior doctors. Efficient and accurate LLMs that can break down highly complex texts into more easily understandable information would give medical students and resident doctors more free time to hone their skills, relieve stress, and maintain a work-life balance. While there is no set timetable for when LLMs will be officially integrated into the healthcare world, having an easy-to-access pseudo assistant available free of charge would be an invaluable asset to most healthcare teams. More research is needed on the reliability, readability, and understandability of these systems, but integration seems inevitable.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.