Public Health

Comprehensive Summary

This study, conducted by Bentegeac et al., focuses on analyzing the confidence levels of AI when asked to generate responses in clinical settings. The researchers compared the confidence levels of token probabilities and various AI chatbots in predicting the accuracy of their responses to medical questions. Nine large language models, both commercial and open source, were prompted to respond to a set of 2522 questions from the United States Medical Licensing Examination. Simultaneously, the models rated their confidence from 0 to 100, and the token probability of each response was extracted. The models’ success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated. The researchers found that mean accuracy ranged from 56.5% for Phi-3-Mini (open source) to 89% for GPT-4o (commercial). However, all chatbots consistently expressed high levels of confidence in their responses. This meant that the chatbots expressed confidence but failed to predict response accuracy. In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs between 0.71 and 95%). Overall, all models demonstrated imperfect calibration, with a general trend toward overconfidence. The researchers argued that clinicians and patients should not rely on LLMs’ confidence and therefore information when responding to medical queries. Instead, clinicians should look towards using token probabilities as they are emerging as a promising alternative for gauging doubt in AI.

Outcomes and Implications

This research is important because it highlights the potential for data misrepresentation in chatbots, as they often present answers in an overly confident manner. The researchers argue that the data presented has to be double-checked, especially in high-stakes clinical situations. They found that current chatbots are overconfident about incorrect answers, and instead, clinicians should look towards using token probability.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

AIIM Research

Articles

© 2025 AIIM. Created by AIIM IT Team

AIIM Research

Articles

© 2025 AIIM. Created by AIIM IT Team

AIIM Research

Articles

© 2025 AIIM. Created by AIIM IT Team