Comprehensive Summary
This study evaluates the effectiveness of a large language model (LLM) chatbot at providing breast, prostate, and lung cancer treatment recommendations concordant with National Comprehensive Cancer Network (NCCN) guidelines. The authors defined 26 different diagnosis descriptions (i.e. cancer types with or without relevant extent of disease modifiers), supplying each description with four prompt variations to create a total of 104 prompts, which were subsequently supplied to the GPT-3.5-turbo-0301 model via the ChatGPT (OpenAI) interface. Concordance of the chatbot output with the 2021 NCCN guidelines was determined by the consensus of three board-certified oncologists. Out of the 104 prompts, the chatbot provided a recommendation on 102 prompts. Of those 102 outputs, all of them included at least one NCCN- concordant treatment, while 35 of the 102 outputs (34.3%) also recommended one or more non- concordant treatments. In 13 out of the 104 outputs (12.5%) the chatbot hallucinated and recommended solutions that were not part of any recommended treatment protocol. The researchers found that disagreements between the chatbot output and the NCCN guidelines arose mainly when the output was unclear, like recommending combination treatments but not specifying which treatments to combine.
Outcomes and Implications
Many individuals now use LLM chatbots such as ChatGPT to gain information and inform their decisions on a wide range of topics. Patients will increasingly use LLMs to educate themselves on their own medical treatment. It is therefore important for both clinicians and patients to understand the limitations of LLM chatbots when it comes to recommending treatments. Both parties should understand that as of current, LLM chatbots are not a reliable source of treatment information.