A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria

Back

OB-GYN

A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria

Archives of Gynecology and Obstetrics

Research Authors: Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A. Zwimpfer

AIIM Authors: Neha Parthasarathy, Victoria Bulkowski

Approved by President Reda Riffi

Publication Date: Sep 13, 2025

Comprehensive Summary

The study, written by Psilopatis et al., examines the accuracy of various large language model tools (LLMs) in Cardiotocography (CTG) interpretations with regard to the Federation of Gynecology and Obstetrics criteria (FIGO). This paper analyzed sixty CTG traces, as classified by University Hospital Basel, by presenting screenshots to four different LLMs. The LLMs (Chat-GPT-4.0, Google Gemini, Bing CoPilot, and DeepSeek) would classify them as either normal or abnormal. Specifically, the LLM that provided a satisfactory interpretation of normal CTGs was then tasked to classify 30 abnormal CTG traces. The observed results were that DeepSeek was unable to evaluate any CTG traces due to the LLM’s lack of features to process visuals. Google Gemini inaccurately classified 93.3% of the CTG traces, most likely due to variable decelerations despite being considered normal fluctuations. Chat-GPT-4.0 inaccurately classified 53.3% CTG traces as being suspicious again due to variable decelerations. Finally, Bing Copilot correctly identified 96.6% CTG traces as being normal, presenting a high alignment with obstetrician-based classifications, but failed to identify the abnormal CTG traces. These findings indicate that general-purpose LLMs face substantial limitations in accurately evaluating CTG traces based on FIGO criteria, with DeepSeek being the only one of four showing some promise, though still unable to identify CTG data entirely.

Outcomes and Implications

Cardiotocography (CTG) is a significant tool for fetal monitoring, but it faces noticeable variability in its evaluations. Large Language model tools (LLMs) offer the chance to improve diagnostic accuracy as well as reduce clinicians’ workload in this field. Based on preliminary research, LLMs seem to lack the integration of multiple parameters needed to distinguish normal physiological variations and abnormalities in CTGs. Another limitation is the lack of training data on this subject for the general-purpose LLMs used, hindering the LLMs' knowledge of the field. All of these factors point to the potential of exploring specialized AI models that are designed specifically for processing CTG traces. Nevertheless, current findings do not support integration of LLMs into CTG analysis, instead, prompting continued clinical evaluations and more research into specialized AI models.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.