Poster Session 4
Category: Digital Health Technologies (DHT)
Poster Session 4
Avish Arora, MD, PhD
Fellow
Montefiore/Albert Einstein College of Med.
New York, New York, United States
Nawras Zayat, MD
MFM Fellow
Montefiore-Einstein
Bronx, New York, United States
Georgios Doulaveris, MD
Maternal Fetal Medicine Attending
Montefiore Medical Center/Albert Einstein College of Medicine
Montefiore Medical Center/Albert Einstein College of Medicine, New York, United States
To Quantify the accuracy, completeness, readability, and inter-reviewer agreement of GPT-4o answers to the most-searched pregnancy questions.
Study Design:
The 50 highest-volume U.S. Google-Trends pregnancy questions (April 2025) were each submitted twice to GPT-4o (temperature 0, identical prompt). Two maternal-fetal medicine clinicians, blinded to each other, graded the 100 answers with a four-tier rubric: comprehensive & accurate (C-A), accurate but incomplete, mixed correct/incorrect, completely incorrect, and recorded whether the two answers to a question differed materially. Flesch-Kincaid grade level and Flesch Reading Ease were calculated for every answer. Cohen’s κ described inter-rater concordance.
Results:
Of 200 individual ratings, 147 (73.5 %) were C-A, 38 (19 %) incomplete, and 15 (7.5 %) mixed; none were incorrect. Reviewers assigned the identical grade to 67 of 100 answers (67 %), yielding κ = 0.21 (fair agreement). Unanimous judgements comprised 57 C-A, 7 incomplete, and 3 mixed answers. The two GPT-4o answers to a question differed materially in 37 of 50 cases (74 %), with added detail rather than contradictions accounting for divergence. Mean FKGL was 13.1 (FRE 36.5); only 4 % of answers met the ≤ 8ᵗʰ-grade benchmark for patient education.
Conclusion:
GPT-4o is a safe and time-saving starting point for obstetric FAQs: it produced zero outright errors and was rated comprehensive & accurate in nearly three-quarters of clinician reviews. Yet meaningful gaps remain—one answer in three fails to satisfy both specialists, and 96 % of responses exceed an 8ᵗʰ-grade reading level. The most effective workflow is therefore “clinician-in-the-loop”: generate two iterations, merge their complementary content, embed institution-specific guidance and red-flag warnings, then translate the text to patient-friendly language before dissemination. Deployed this way, GPT-4o can standardize counselling scripts, shorten drafting time, and let MFM providers focus on nuanced care rather than repetitive education, while periodic re-benchmarking keeps pace with evolving LLM capabilities and clinical guidelines.