Icon Legend

This session is not in your schedule.

This session is in your schedule. Click again to remove it.

Presentation Icons

Additional registration fee required

Faculty have requested this content not be shared outside of the session

CME Credit Offered

Abstract Award

Recording available 2/16-5/2

AIUM Credit

Foundation Awardee

Poster Icons

Abstract Award

Foundation Awardee

Poster Session 4

Category: Digital Health Technologies (DHT)

Poster Session 4

(1038) Accuracy and readability of GPT-4o responses to top pregnancy FAQs

Thursday, February 12, 2026

3:30 PM - 5:00 PM

Submitting Author and Presenting Author(s)

Avish Arora, MD, PhD

Fellow
Montefiore/Albert Einstein College of Med.
New York, New York, United States

Coauthor(s)

NZ

Nawras Zayat, MD

MFM Fellow
Montefiore-Einstein
Bronx, New York, United States
GD

Georgios Doulaveris, MD

Maternal Fetal Medicine Attending
Montefiore Medical Center/Albert Einstein College of Medicine
Montefiore Medical Center/Albert Einstein College of Medicine, New York, United States

Objective:

To Quantify the accuracy, completeness, readability, and inter-reviewer agreement of GPT-4o answers to the most-searched pregnancy questions.

Study Design:

The 50 highest-volume U.S. Google-Trends pregnancy questions (April 2025) were each submitted twice to GPT-4o (temperature 0, identical prompt). Two maternal-fetal medicine clinicians, blinded to each other, graded the 100 answers with a four-tier rubric: comprehensive & accurate (C-A), accurate but incomplete, mixed correct/incorrect, completely incorrect, and recorded whether the two answers to a question differed materially. Flesch-Kincaid grade level and Flesch Reading Ease were calculated for every answer. Cohen’s κ described inter-rater concordance.

Results:
Of 200 individual ratings, 147 (73.5 %) were C-A, 38 (19 %) incomplete, and 15 (7.5 %) mixed; none were incorrect. Reviewers assigned the identical grade to 67 of 100 answers (67 %), yielding κ = 0.21 (fair agreement). Unanimous judgements comprised 57 C-A, 7 incomplete, and 3 mixed answers. The two GPT-4o answers to a question differed materially in 37 of 50 cases (74 %), with added detail rather than contradictions accounting for divergence. Mean FKGL was 13.1 (FRE 36.5); only 4 % of answers met the ≤ 8ᵗʰ-grade benchmark for patient education.

Conclusion:
GPT-4o is a safe and time-saving starting point for obstetric FAQs: it produced zero outright errors and was rated comprehensive & accurate in nearly three-quarters of clinician reviews. Yet meaningful gaps remain—one answer in three fails to satisfy both specialists, and 96 % of responses exceed an 8ᵗʰ-grade reading level. The most effective workflow is therefore “clinician-in-the-loop”: generate two iterations, merge their complementary content, embed institution-specific guidance and red-flag warnings, then translate the text to patient-friendly language before dissemination. Deployed this way, GPT-4o can standardize counselling scripts, shorten drafting time, and let MFM providers focus on nuanced care rather than repetitive education, while periodic re-benchmarking keeps pace with evolving LLM capabilities and clinical guidelines.