Comprehensive Summary
Chen et. al completed a meta-analysis using data from 17 PubMed articles to determine factors that contribute to ChatGPT’s ability to accurately diagnose skin lesions. They completed statistical analysis using R, assessing data including the diagnosed dermatological condition, lesion’s classification, Fitzpatrick skin phototype, plus the ChatGPT model and its accession date. They found that ChatGPT’s diagnostic accuracy is about 70% less (p=0.01) when using visual data as compared to textual data. In addition, it was less accurate with darker skin tones. It also performed significantly worse (p=0.004) when using public data sets instead of private ones. Finally, diagnostic accuracy increased almost four times per year (p=0.003). Chen et. al say this study’s conclusions are hard to generalize given the significant diversity of the data.
Outcomes and Implications
AI is starting to be incorporated into more diagnostic tools, particularly in distinguishing benign and malignant lesions. As such, it is critical to test its reliability in clinical settings. Although the results of this study are not widely generalizable, they do support the notion that ChatGPT cannot be used as an independent diagnostic tool. Chen et. al mention the algorithm must be refined and more inclusive data sets must be used to address gaps in performance before clinicians can attempt to move forward with using ChatGPT in clinical practice.