Comprehensive Summary
This study, presented by Jiang and colleagues, investigates how the open-source large language model DeepSeek is being used in Chinese hospitals. The authors conducted a scoping review of both gray literature, using hospital websites and WeChat disclosures from 100 Chinese hospitals, and white literature derived from peer-reviewed studies from PubMed and Web of Science. They extracted information on how DeepSeek is used, how its performance is assessed, which risks are identified, and what regulatory or oversight measures hospitals report. The review found 58 DeepSeek usages across 48 hospitals and 27 relevant research studies, with hospital adoption expanding very rapidly. Hospitals primarily used DeepSeek for clinical decision support, including diagnosis, treatment recommendations, documentation, and appointment coordination. However, only 36% of hospital disclosures mentioned performing pre-deployment evaluation, and none provided methodological details. Reported accuracy in hospital posts was consistently high, whereas academic studies showed wider and often lower accuracy ranges as well as important deficits in comprehensiveness, factuality, and fairness. Hospitals reported far fewer risks than research studies, which frequently identified inappropriate recommendations, hallucinations, and misalignment with patient needs. The discussion emphasizes that rapid, under-evaluated deployment poses safety concerns and that hospitals need more rigorous validation procedures, transparent reporting, and stronger regulatory oversight.
Outcomes and Implications
Considering that large language models are being introduced into clinical workflows at unprecedented speed, often before their real-world performance and safety are fully understood, it is critical that research is conducted surrounding efficacy and usage. By comparing hospital claims with independent research, the study reveals a mismatch between enthusiastic adoption and incomplete evaluation, highlighting a critical risk to patient safety in environments where LLMs may influence diagnoses, treatment plans, and triage. Clinically, the findings show that DeepSeek has genuine potential, but its reliability varies widely depending on context, evaluation method, and task complexity. This underscores the need for clinically grounded validation frameworks before LLM outputs are incorporated into care. The work also reinforces the necessity of monitoring hallucinations, inappropriate recommendations, and equity concerns, as these directly affect patient outcomes. The article suggests that implementation in its current form should remain augmented, not autonomous, with physicians retaining full decision authority. The authors call for regulatory expansion to cover downstream users such as hospitals and recommend mandatory disclosure of datasets, evaluation procedures, and risk-mitigation strategies. Although the article does not predict a precise timeline for safe, large-scale clinical integration, it implies that meaningful clinical adoption will depend on the development of standardized evaluation frameworks and regulatory reforms.