Contact
DURHAM, N.C. — Duke University School of Medicine researchers have developed two pioneering frameworks designed to evaluate the performance, safety, and reliability of large-language models in health care.
Published in npj Digital Medicine and the Journal of the American Medical Informatics Association (JAMIA), these studies offer a new approach to ensuring that AI systems used in clinical settings meet the highest standards of quality and accountability.
As large-language models become increasingly embedded in medical practice — generating clinical notes, summarizing conversations, and assisting with patient communications — health systems are grappling with how to assess these technologies in ways that are both rigorous and scalable. The Duke-led studies, under the direction of Chuan Hong, Ph.D., assistant professor in Duke’s Biostatistics and Bioinformatics, aim to fill that gap.
The npj Digital Medicine study introduces SCRIBE, a structured evaluation framework for Ambient Digital Scribing tools. These AI systems generate clinical documentation from real-time patient-provider conversations. SCRIBE draws on expert clinical reviews, automated scoring methods, and simulated edge-case testing to evaluate how well these tools perform across dimensions like accuracy, fairness, coherence, and resilience.
“Ambient AI holds real promise in reducing documentation workload for clinicians,” Hong said. “But thoughtful evaluation is essential. Without it, we risk implementing tools that might unintentionally introduce bias, omit critical information, or diminish the quality of care. SCRIBE is designed to help prevent that.”
A second, related study in JAMIA applies a complementary framework to assess large-language models used by the Epic electronic medical record platform to draft replies to patient messages. The research compares clinician feedback with automated metrics to evaluate aspects such as clarity, completeness, and safety. While the study found strong performance in tone and readability, it also revealed gaps in the completeness of responses — emphasizing the importance of continuous evaluation in practice.
“This work helps close the distance between innovative algorithms and real-world clinical value,” said Michael Pencina, Ph.D., chief data scientist at Duke Health and co-author of both studies. “We are showing what it takes to implement AI responsibly, and how rigorous evaluation must be part of the technology’s life cycle, not an afterthought.”
Together, these frameworks form a foundation for responsible AI adoption in health care. They give clinical leaders, developers, and regulators the tools to assess AI models before deployment and monitor their performance over time — ensuring they support care delivery without compromising safety or trust.
In addition to Pencina and Hong, study authors include: Haoyuan Wang, Rui Yang, Mahmoud Alwakeel, Ankit Kayastha, Anand Chowdhury, Joshua M. Biro, Anthony D. Sorrentino, Jessica L. Handley, Sarah Hantzmon, Sophia Bessias, Nicoleta J. Economou-Zavlanos, Armando Bedoya, Monica Agrawal, Raj M. Ratwani, Eric G. Poon, Kathryn I. Pollak.
The npj research received funding support from the Agency for Healthcare Research and Quality (1R03HS030307-01).