Comprehensive Summary
This study, presented by Lee et al., develops an automated system that converts free-text eligibility criteria from ClinicalTrials.gov into Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)–compatible Structured Query Language (SQL) queries and analyzes hallucination patterns across 8 large language models (LLMs). With a three-stage preprocessing pipeline (segmentation, filtering, and simplification), the authors achieved a 58.2% token reduction and compared this concept mapping performance against other LLMs. As a result, the llama3: 8b model achieved the highest effective SQL rate (75.8%) compared to GPT-4 (45.3%), primarily due to lower hallucination rates (21.1% vs 33.7%). By classifying 235 hallucinations across models, the overall hallucination rate was high (32.7%) with wrong domain assignments (34.2%) and placeholder insertions (28.7%) being the most common. Clinical validation also revealed mixed results, with high concordance for type 1 diabetes (Jaccard=0.81) and failure for pregnancy (Jaccard=0.00). With high hallucination rates, these results challenge assumptions about model superiority, demonstrating that smaller and cost-effective models like llama3 can outperform larger models like GPT-4.
Outcomes and Implications
Clinical trial participant recruitment faces significant challenges, resulting in termination due to insufficient recruitment, which delays new drug approvals and increases research costs. Therefore, the ability to query real-world data using clinical trial criteria with reduced hallucination rates functions not only as a technical advancement but also as a fundamental shift in how clinical trials are conducted. Specifically, the authors discuss how this technology could allow integration with community health centers and safety-net hospitals by reducing the burden of trial participation. However, due to the high variability in accuracy and the significant hallucination rates (21-50%) across LLMs, a purely LLM-based system is not ready for clinical implementation yet. The authors emphasize the importance of further research to execute a comprehensive assessment of generalizability across diverse clinical datasets and institutions, while adopting hybrid evaluation strategies that combine LLM capabilities with rule-based methods to accurately handle complex clinical data.