Chakib Battioui, PhD1, Pavel Brodskiy, PhD2, Klaus Gottlieb, MD, PhD, JD1, Mohammad Haft-Javaherian, PhD2, William J. Eastman, 1, Julian Lehrer, 2, Evan Yu, PhD2, Derek Onken, PhD1, , Darren Thomason, MBA2, Walter Reinisch, MD, PhD3Daniel Colucci, 4, Shrujal Baxi, MD4 1Eli Lilly and Company, Indianapolis, IN; 2Iterative Health Inc, Cambridge, MA; 3Medical University of Vienna, Department of Internal Medicine III, Division of Gastroenterology and Hepatology, Spitalgasse, Wien, Austria; 4Iterative Health Inc, New York, NY
Introduction: Regulatory guidance recommends the endoscopy subscore as the index to assess the endoscopic component of the primary endpoint in ulcerative colitis (UC) trials. Inter-reader variability in assessments may impact the reliability of trial results. Currently, there is no metric in place to assess the certainty by which a reader is assigning an endoscopy subscore. Machine learning (ML) provides an opportunity to assess the endoscopy subscore and provide a measurement of its certainty in a standardized manner. Artificial Intelligence Assessment of Endoscopic Severity (AI-ES) accurately assesses the endoscopy subscore. The objective of this study is to evaluate the calibration of AI-ES - how well its predicted probabilities reflect true likelihoods - to assess the reliability of its measurement of certainty in endoscopy subscore assessments in UC trials.
Methods: AI-ES is a deep learning algorithm that assesses the endoscopy subscore in UC endoscopic videos. AI-ES measures probability for the four ordinal endoscopy subscore classes. The endoscopy subscore with the highest probability is assigned as the final score by AI-ES. We assessed calibration on a holdout test set of 639 videos (~25%) from the Phase 3 induction trial for mirikizumab in UC (NCT03518086). Videos had a 2+1 centrally read endoscopy subscore, randomly selected from week 0 and 12 with a distribution of endoscopic severity similar to the overall study population. Calibration plots were generated across endoscopy subscore classes with probabilities grouped into septiles (~100 videos per group) for primary analysis and deciles for confirmation. Brier scores, ranging from 0 (perfect calibration) to 1 (worst calibration), were calculated, with values < 0.25 considered informative.
Results: AI-ES demonstrated strong calibration, with Brier scores below < 0.25 for each endoscopy subscore (0: 0.037, 1: 0.082, 2: 0.162, 3: 0.112). The Brier score for evaluation of endoscopic improvement (0,1 vs 2,3) also showed excellent calibration (0.066). Findings were consistent when assessing probabilities by deciles.
Discussion: Whereas data on the certainty of human readers in endoscopy subscore assessments are elusive, AI-ES is calibrated across all endoscopy subscore classes, providing reliable data on score probabilities. This novel measurement of certainty by AI-ES added to the score assessment may enable novel AI-based multi-reader or consensus workflows in trials, potentially improving the reliability of UC endpoint assessments.
Figure: Figure 1. Calibration plots measuring the reliability of the model’s probability of endoscopy subscore class predictions for endoscopy subscores 0 or 1 (A) and 2 or 3 (B). Data is based on septiles of predicted probabilities.
Disclosures:
Chakib Battioui: Eli Lilly – Employee.
Pavel Brodskiy: Iterative Health Inc – Employee.
Klaus Gottlieb: Eli Lilly – Employee.
Mohammad Haft-Javaherian: Iterative Health Inc – Employee.
William Eastman: Eli Lilly and Company – Employee, Stock Options.
Chakib Battioui, PhD1, Pavel Brodskiy, PhD2, Klaus Gottlieb, MD, PhD, JD1, Mohammad Haft-Javaherian, PhD2, William J. Eastman, 1, Julian Lehrer, 2, Evan Yu, PhD2, Derek Onken, PhD1, Darren Thomason, MBA2, Walter Reinisch, MD, PhD3,Daniel Colucci,4, Shrujal Baxi, MD4. P3295 - Validating Calibration of an Artificial Intelligence Assessment of Endoscopic Severity in Ulcerative Colitis, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.