Poster Session 4
Category: Digital Health Technologies (DHT)
Poster Session 4
Joe Haydamous, MD (he/him/his)
PGY1
Department of Obstetrics, Gynecology and Reproductive Sciences, McGovern Medical School at UTHealth Houston
Department of Obstetrics and Gynecology, McGovern Medical School at UT Health, Houston, Texas, United States
Laura Diab, MD (she/her/hers)
Division of Maternal-Fetal Medicine, Department of Obstetrics, Gynecology and Reproductive Sciences, McGovern Medical School at UTHealth Houston
Houston, Texas, United States
Analuisa C. Mosqueda, MD
Division of Maternal-Fetal Medicine, Department of Obstetrics, Gynecology and Reproductive Sciences, McGovern Medical School at UTHealth Houston
Houston, Texas, United States
Sabrina C. DaCosta, MD
Division of Maternal-Fetal Medicine, Department of Obstetrics, Gynecology and Reproductive Sciences, McGovern Medical School at UTHealth Houston
Houston, Texas, United States
Irene A. Stafford, MD, MPH, MS
Associate Professor
Division of Maternal-Fetal Medicine, Department of Obstetrics, Gynecology and Reproductive Sciences, McGovern Medical School at UTHealth Houston
Houston, Texas, United States
Clinically relevant questions were compiled using CDC, ACOG, and WHO guidelines, as well as topics frequently discussed in patient forums. Responses were generated using GPT-4 and evaluated by six independent experts in obstetrics, maternal-fetal medicine, and infectious disease. Reviewers rated each response on a 5-point Likert scale across three domains: accuracy, completeness, and safety. Each question was submitted using one of two standardized prompts simulating either patient or provider communication styles. The complete list of questions and prompts appears in Figure 1. Questions were grouped into four categories: General Knowledge, Public Health/Prevention, Treatment, and Diagnostic Interpretation.
Results:
ChatGPT responses showed high performance across all domains, with mean scores of 4.38 for accuracy, 4.49 for completeness, and 4.47 for safety. Between 88% and 92% of evaluations received scores of 4 or higher. Public Health/Prevention questions achieved the highest overall scores, averaging 4.67 or above in each domain. General Knowledge items also performed well, especially in safety (4.79) and completeness (4.75). Treatment-related responses maintained strong ratings across all categories (≥4.33), showing good guideline alignment. Although Diagnostic Interpretation responses were accurate (4.33) and safe (4.50), completeness was lower (4.17), with reviewers noting missing nuance in complex cases. Importantly, no unsafe or misleading content was identified by any reviewer.
Conclusion:
ChatGPT produced safe, accurate, and generally complete responses to syphilis-in-pregnancy questions. Its strong performance in public health and general education supports cautious integration into prenatal counseling workflows. Enhancing diagnostic depth should be prioritized while maintaining the observed high safety profile.