Comprehensive Summary
This study evaluates the accuracy and relevance of using AI chatbots in providing acne management guidance compared to National Institute for Health and Care Excellence (NICE) guidelines. Using a double blind setup, six AI models – DeepSeek-R1, ChatGPT-4o, Gemini 2.0 Flash, Kimi K1.5, Claude 3.7 Sonnet, and Grok 3 – were tested using 10 standardized acne questions. Eight dermatology experts rated each of their responses on a 5-point Likert scale. Claude 3.7 Sonnet consistently achieved the highest accuracy and relevance scores. ChatGPT-4o and Deepseek-R1 performed well, however lagged slightly behind Claude 3.7 Sonnet. Gemini 2.0 Flash, Kimi K1.5, and Grok 3 had the most uneven performances. A regression analysis found a weak correlation between accuracy and relevance, indicating that factual correctness did not mean a clinically meaningful response. The authors conclude that while AI chatbots show potential, much refinement is still needed and the systems should only be used as supplemental tools under clinical supervision.
Outcomes and Implications
Accurate acne guidance is essential in preventing the misuse of medications and incorrect treatments. The strong performance of certain AI models suggests that they may be able to assist dermatologists in providing fast, guideline aligned advice. However, present models should only be used as supplemental tools. Dermatologist judgement still remains the best way to get treated.