Comprehensive Summary
Petmezas et al. evaluated a spatiotemporal Transformer framework for automated detection of heart failure with reduced ejection fraction (HFrEF) from apical four-chamber echocardiographic videos. The authors adapted the TimeSformer architecture and combined it with transfer learning from the EchoNet-Dynamic dataset (10,030 patients) as well as anatomically guided left-ventricular (LV) masking. When applied directly to a small, imbalanced institutional cohort of 219 patients, model performance was limited (TimeSformer accuracy: 65.8%, AUC: 49.3%), underscoring the difficulty of HFrEF classification in advanced cardiomyopathy populations. Pretraining on EchoNet-Dynamic substantially improved discrimination, with TimeSformer achieving an AUC of 92.9% on the benchmark dataset. After fine-tuning on the institutional cohort, performance improved further (accuracy: 69.8%, AUC: 74.2%), with additional gains observed following LV masking (accuracy: 73.0%, AUC: 79.8%). Attention analyses showed that LV masking shifted model focus toward the LV cavity and myocardium, supporting physiologic plausibility of the learned features.
Outcomes and Implications
These findings suggest that Transformer-based video models could support automated HFrEF detection when combined with transfer learning and anatomy-aware preprocessing. However, discrimination in small clinical cohorts remained moderate, and the approach should be viewed as a decision-support tool rather than a complete replacement for expert interpretation or quantitative LVEF assessment. Prospective, multi-center validation and integration of multi-view echocardiography and clinical variables are necessary before routine clinical use.