Comprehensive Summary
This study introduces the Multimodal Transformer with Unified Masked Modeling (MUSK), a vision–language foundation model developed to integrate pathology images and clinical text data for improved cancer diagnosis and prognosis. MUSK was pretrained on 50 million pathology image patches extracted from 32,898 slides across 11,577 patients, covering 33 tumor types, and processed one billion text tokens derived from pathology-related articles. It underwent additional pretraining on one million paired image–text samples to align vision and language features. MUSK's performance was evaluated across 23 patch-level and slide-level benchmarks, where it outperformed other models in image-to-text and text-to-image retrieval, visual question answering, and molecular biomarker prediction. Specifically, MUSK achieved a Recall@50 of 34.4% for image-to-text retrieval on the PathMMU dataset, surpassing the second-best model, CONCH, which achieved 27.3%. On the BookSet dataset, MUSK achieved 74.8% in Recall@50 compared to CONCH's 71.3%. In melanoma relapse prediction using the VisioMel dataset, MUSK achieved an AUC of 0.833, significantly outperforming existing models. Additionally, in pan-cancer prognosis prediction across 16 cancer types, MUSK achieved a concordance index of 0.747, with its highest performance for renal cell carcinoma (c-index = 0.887). For immunotherapy response prediction in lung cancer, MUSK achieved an AUC of 0.768 compared to the tumor PD-L1 expression biomarker at 0.606. The model also demonstrated improved progression-free survival (PFS) predictions with a c-index of 0.705 compared to existing pathology models, which ranged from 0.580 to 0.599.
Outcomes and Implications
The MUSK model has the potential to enhance precision oncology by combining image and text data to improve diagnostic accuracy and treatment planning. Its superior performance in melanoma relapse prediction (AUC = 0.833) suggests it could identify high-risk patients more accurately than traditional methods, enabling earlier interventions. The pan-cancer prognosis results, with a c-index of 0.747 across 16 cancer types, indicate that MUSK could provide more reliable risk stratification for personalized treatment. In lung cancer immunotherapy, MUSK's ability to achieve an AUC of 0.768 for response prediction demonstrates its potential to identify patients more likely to benefit from treatment, reducing unnecessary exposure to toxic therapies. The model's capacity to process large datasets with minimal additional training makes it a scalable solution for clinical settings, offering an efficient tool for integrating diverse data sources. However, clinical implementation requires further validation across different populations and healthcare environments to ensure its robustness and generalizability.