Comprehensive Summary
This study by Yao et al. investigated whether large language models (LLMs) could improve staging accuracy using free-text radiology reports. Although the accurate preoperative staging of esophageal cancer can be critical for prognosis and guiding treatment decisions, manual interpretation of radiology reports by clinicians is often inconsistent, which can lead to errors and variability in staging. The retrospective dataset included 200 patients from Shanghai Chest Hospital (May-December 2024) with 1,134 Chinese-language radiology reports. Postoperative pathological staging was the reference standard for evaluating model performance. Three LLMs - INF-72B, Qwen2.5-72B, and LLaMA3.1-70B - were utilized to classify the tumors (T1-T4), lymph nodes (N0-N3), and overall cancer stage (I-IV). Three prompting strategies were compared: zero-shot prompting, chain-of-thought, and a novel interpretable reasoning (IR) method. Model performance was assessed using accuracy, F1-score, and statistical tests (McNemar and Pearson chi-square) against the clinician performance. Results indicated that the INF-72B+IR combination outperformed all other models and clinicians, achieving 61.5% overall staging accuracy and an F1-score of 0.60, compared to clinician accuracy of 39.5% and F1-score of 0.39 (P <0.001). Qwen2.5-72B+IR also surpassed clinicians, with 46% accuracy and 0.51 F1. LLaMA3.1-70B, however, did not show significant improvement over clinicians. Importantly, the IR prompting method also enhanced transparency, allowing for the reasoning behind predictions to be traced, which is crucial for clinical adoption and trust.
Outcomes and Implications
This study demonstrates that LLMs, especially when combined with interpretable reasoning, can significantly improve preoperative esophageal cancer staging from radiology reports. By outperforming clinicians, models like INF-72B+IR can offer clinicians a pathway to reduce variability and errors in staging, which could directly influence treatment planning, surgical decisions, and prognosis estimation. The ability to generate transparent, verifiable reasoning steps makes this approach clinically trustworthy and more likely to be adopted in practice. In high-volume hospitals or settings where experienced radiologists are limited, AI-driven tools such as these LLMs can act as decision-support systems. Moreover, the integration of LLMs can streamline workflows, reduce time spent on manual chart review, and potentially identify subtle features in reports that may be overlooked by humans.