Poster Session 3
Category: Intrapartum Fetal Assessment
Poster Session 3
Juliette Vitrou
Maternité Port-Royal, groupe hospitalier Paris Centre, AP-HP, Paris, France; Institut interdisciplinaire santé des femmes, iWISH, Université Paris cité, Paris, France;
Paris, Ile-de-France, France
Charles Garabedian, MD, PhD (he/him/his)
CHU Lille
Lille, Nord-Pas-de-Calais, France
Aude Girault, MD, PhD (she/her/hers)
Department of Obstetrics and Gynecology, Port-Royal Maternity Hospital, AP-HP, Cochin Hospital, FHU PREMA, Paris, France
Paris, Ile-de-France, France
Mathieu Hivert, MD
CHU Lille, Department of Obstetrics, Lille, France; Univ Lille, ULR 2694-METRICS, Lille, France.
Lille, Nord-Pas-de-Calais, France
To compare the accuracy of fetal scalp pH prediction by midwives, residents, and a large language model (ChatGPT) against actual measurements.
Study Design:
Prospective monocentric study including term laboring women undergoing fetal scalp blood sampling for FHR II tracings. For each case, three pH predictions were independently obtained from a resident, a midwife, and ChatGPT, based on standardized clinical data and cardiotocographic tracings. Correlation with actual pH was assessed with Spearman’s ρ, and accuracy using mean absolute error (MAE) and correct classification within predefined categories (< 7.20; 7.20–7.24; >7.24). Based on prediction performance, we estimated the proportion of avoidable pH tests and potential clinical consequences, including avoidable cesarean deliveries and missed severe acidosis.
Results:
A total of 95 fetal scalp pH measurements were analyzed. Correlation with actual pH values was weak for all predictors, with the highest for midwives (ρ = 0.26, p = 0.011). MAE was lowest for midwives (0.042, 95% CI 0.036–0.050) and residents (0.047, 95% CI 0.038–0.056), compared with ChatGPT (0.098, 95% CI 0.087–0.110). Correct categorical prediction rates were 61.0% for midwives, 59.7% for residents, and 24.7% for ChatGPT. ChatGPT systematically underestimated fetal pH (71.4% of cases), whereas midwives and residents showed more balanced under- and overestimation. Compared to predictions, fetal scalp pH testing avoided between 1.3% (ChatGPT) and 5.2% (midwives and residents) of neonatal acidosis cases; and prevented unnecessary cesarean deliveries in 19.5% of cases when guided by midwife or resident predictions, but up to 62.3% when compared to ChatGPT-based decisions.
Conclusion:
Midwives and residents demonstrated comparable accuracy in predicting fetal scalp pH, both markedly outperforming ChatGPT. While professional clinical judgment can potentially reduce unnecessary fetal blood sampling and cesarean deliveries, reliance on large language models in their current state would increase misclassification risk and unnecessary interventions.