Comprehensive Summary
This multicenter prospective study evaluated the quality of generative AI chatbot responses to suicide-related queries across two phases over one year. Copilot, ChatGPT 3.5, and Gemini were assessed for improvement; Claude was only included in Phase 2 due to limited access. Responses were analyzed by two coders using 5 and 7 predetermined prompts in phases 1 and 2, respectively. Linguistic analysis using LIWC-22 evaluated word count, authenticity, and emotional tone. Scores above 50% indicate stronger authenticity and tone. The model class is a transformer NLP; the comparator is each chatbot's prior model versions. No external validation was performed. Linguistic analysis revealed a 32.68% increase in authenticity for ChatGPT 3.5 between phases, while Copilot and Gemini declined slightly (-6.15% and -1.18%). Gemini’s emotional tone rose by 3.45%, maintaining positive scores near 76%, while Copilot remained below the 50% threshold - the lowest among Phase 2 models. Claude showed neutral authenticity and tone in Phase 2. Researchers tracked accuracy indicators like mention of the 988 lifeline and observed overall improvement in quality of responses across models in Phase 2. It should be noted that excluding chatbots from other regions limited cultural diversity in responses shaped by differing mental health norms. Furthermore, the study simulated concerned friends, limiting insight into chatbot responses to direct crises. Finally, no statistical analysis was disclosed. Linguistic analysis revealed a 32.68% increase in authenticity for ChatGPT 3.5 between phases, while Copilot and Gemini declined slightly (-6.15% and -1.18%). Gemini’s emotional tone rose by 3.45%, maintaining positive scores near 76%, while Copilot remained below the 50% threshold - the lowest among Phase 2 models. Claude showed neutral authenticity and tone in Phase 2. Researchers tracked accuracy indicators like mention of the #988 lifeline and observed overall improvement across models in Phase 2. It should be noted that excluding chatbots from other regions limited cultural diversity in responses shaped by differing mental health norms. Furthermore, the study simulated concerned friends, limiting insight into chatbot responses to direct crises. Finally, no statistical analysis was disclosed.
Outcomes and Implications
This study highlights the surprising emotionality of chatbots regarding suicide-related queries, which may be invaluable to those supporting vulnerable individuals and as an adjunct to professional care. However, it also affirms that clinicians and individuals should treat AI tools as supplementary, prioritizing human oversight when navigating suicide-related conversations. Future research should analyze AI chatbot responses to first-person queries related to suicide and crisis to determine how AI would respond if the prompt writer was the individual in crisis (rather than a friend or loved one).