Comprehensive Summary
Porkaew et al. studied the efficacy of screening for depression risk level using a sentence completion test (SCT) scored by large language models (LLMs) in lieu of the current screening practices, which include self-report measures such as the PHQ-9 and CES-D, which are susceptible to bias. Participants of this study were recruited online and completed an SCT, in which they were asked to complete sentences related to the domains of family, social, health, and self-concept based on their personal experiences within the past two weeks. Area under the curve (AUC), sensitivity, specificity, and correlations were analyzed by either a zero-shot or fine-tuned LLM supported by AI-driven natural language processing (NLP). The data collected was compared to the PHQ-9 assessment, the current gold standard for assessing depression severity. The LLM findings in this study had a strong correlation with the PHQ-9 scores, with the fine-tuned LLM having an AUC between 0.85 and 0.90, indicating significant accuracy in distinguishing between depressed and nondepressed individuals. The LLM was able to distinguish between negative affect terminology and neutral or positive terminology used by depressed and nondepressed individuals, respectively. Furthermore, the LLM scores collected had strong correlations with the PHQ-9 depression severity scores, suggesting that the LLM was also able to accurately assess depression severity. The fine-tuned LLM was also found to have high sensitivity and specificity levels with very few false negative cases. Porkaew et al. concluded that LLMs can be used to accurately screen depression using SCTs, but it was noted that the study was limited by a small dataset and a lack of consideration of language variation across different cultures/areas.
Outcomes and Implications
By developing a model less susceptible to personal bias, Porkaew et al.’s proposition to utilize SCTs analyzed by a LLM and NLP for depression screening suggests a reliable and scalable method to diagnose depression and depression severity in patients. By using AI modeling, depression screening could also be implemented in a widespread, standardized manner, avoiding discrepancies between the various screening methods currently utilized across the globe. Additionally, as the dataset expands for this LLM developed by Porkaew et al., it can continue to be fine-tuned to various languages and language-styles to improve accuracy of screening and diagnosis. However, prior to clinical implementation, the LLM should be assessed at a larger scale with a wider demographic, and ethical concerns for patient privacy and misuse of data should be addressed.