Scientists at Toronto General Hospital Research Institute (TGHRI) have developed an improved method for evaluating the performance of Artificial Intelligence (AI) models across various health care settings.
As health care datasets become larger and more complex, the use of AI for the analysis of these datasets is gaining traction. Medical information can take the form of unstructured data such as medical images, electrocardiograms (ECGs), and text from clinical notes. Despite advancements in AI that have produced tools capable of analyzing medical images and clinical language, it remains challenging to predict their effectiveness in different health care settings without testing on new and varied data from each setting.
“For AI tools to be truly safe and effective for patient care, they must perform reliably across different situations and patient groups, a concept known as generalizability, which requires accurate performance estimation,” says Cathy Ong Ly, doctoral student at TGHRI and co-first author of the study. “We sought to address this challenge of estimating AI model accuracy by analyzing 13 datasets across different modalities such as X-rays, CT scans, ECGs, clinical notes, and lung sound recordings.”
When the team tested various AI models on this data, they found that their performance was often overestimated by about 20% on average. “We propose that this overestimation is due to data acquisition bias (DAB), a natural occurrence when data for these studies is retrospectively collected from regular medical care,” says Dr. Chris McIntosh, Scientist at TGHRI and senior author of the study.
“Generally speaking, AI might focus on irrelevant patterns in the data instead of what really matters for the task,” adds Dr. McIntosh. “Different hospital departments may use different equipment or settings and have different patient acquisition conditions. These variations, which might be imperceptible to researchers and clinicians, can be detected by AI algorithms. When models are trained on this data, they might rely on these subtle differences—like how a medical image was taken—rather than the actual medical content, to make predictions.”
An example of this bias is how patients suspected of having interstitial lung disease are often directed towards specific imaging techniques meant to confirm the diagnosis, while those without suspicion get more general scans. The algorithm will appear highly accurate at the hospital the data was trained on, but when deployed for clinical care at another hospital with different scanners, the accuracy will drop, potentially putting patients at risk.
To address this issue, the researchers developed and proposed an open-source accuracy estimate called PEst that corrects for bias and provides more accurate estimates of a model’s external performance.
“Our method, which corrects for hidden patterns and biases in the data, predicts models performance on new datasets with an accuracy margin within 4% of the actual results,” says Balagopal Unnikrishnan, doctoral student at TGHRI and co-first author of the study.
Given how crucial the accuracy of AI models is in health care, where recommendations can significantly impact patient outcomes, these findings will help enable safer and more widespread use of AI and support the development of new medical AI technology. This study was a truly multidisciplinary effort across UHN to measure the impact of these biases in a diverse array of modalities and diseases.
Cathy Ong Ly (Left) and Balagopal Unnikrishnan (middle) are doctoral students in Dr. Chris McIntosh’s lab (right).
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), The Princess Margaret Cancer Foundation, and UHN Foundation. Data for this study was supported by foundation investments in the Digital Cardiovascular Health Platform including the Peter Munk Cardiac Centre and Ted Rogers Centre for Heart Research and MIRA through Cancer Digital Intelligence.
Dr. Chris McIntosh is an Assistant Professor in the Department of Medical Biophysics at the University of Toronto (U of T). He holds the Chair in Artificial Intelligence and Medical Imaging at the Joint Department of Medical Imaging at UHN and the Department of Medical Imaging at U of T.
Ong Ly C, Unnikrishnan B, Tadic T, Patel T, Duhamel J, Kandel S, Moayedi Y, Brudno M, Hope A, Ross H, McIntosh C. Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. NPJ Digit Med. 2024 May 14;7(1):124. doi: 10.1038/s41746-024-01118-4.