International evaluation of an AI system for breast cancer screening
Screening mammography aims to identify breast cancer before symptoms appear, enabling earlier therapy for more treatable disease. Despite the existence of screening programs worldwide, interpretation of these images suffers from suboptimal rates of false positives and false negatives. Here we present an AI system capable of surpassing a single expert reader in breast cancer prediction performance. Using two large data sets representative of clinical practice in the United States (US) and the United Kingdom (UK), we show an absolute reduction of 5.7%/1.2% (US/UK) in false positives and 9.4%/2.7% (US/UK) in false negatives. We show evidence of the system's ability to generalize from the UK sites to the US site. In an independently-conducted reader study, the AI system out-performed all six radiologists with an area under the receiver operating characteristic curve (AUC-ROC) greater than the average radiologist by an absolute margin of 12.1%. By simulating the AI system's role in the double-reading process, we maintain noninferior performance while reducing the second reader's workload by 88%. This robust assessment of the AI system paves the way for prospective clinical trials to improve the accuracy and efficiency of breast cancer screening.
Breast cancer is the second leading cause of death from cancer in women, but outcomes have been shown to improve if caught and treated early. This is why many countries around the world have set up breast cancer screening programmes, aiming to identify breast cancer at earlier stages of the disease, when treatment can be more successful.
However, interpreting mammograms (breast x-rays) remains challenging, as evidenced by the high variability of experts’ performance in detecting cancer. In this collaborative research with Google Health & Cancer Research UK Imperial Centre, Northwestern University, and Royal Surrey County Hospital now published in Nature, we developed an AI system capable of surpassing clinical specialists from the UK and US in predicting breast cancer from mammograms, as confirmed by biopsy.
Breast cancer screening datasets
Breast cancer screening programmes vary from country to country. In the US, women are typically screened every one to two years, and their mammograms are interpreted by a single radiologist. In the UK, women are screened every three years, but each mammogram is interpreted by two radiologists, with an arbitration process in case of disagreement. We utilised large datasets collected in both countries to develop and evaluate this AI system.
The UK evaluation dataset consisted of a random sample of 10% of all women with screening mammograms at two sites in London between 2012 and 2015. It included 25,856 women, 785 of which had a biopsy, and 414 women with cancer that was diagnosed within three years of imaging. These de-identified data was collected as part of the OPTIMAM database effort by Cancer Research UK, and are subject to strict privacy constraints.
The US evaluation dataset consisted of de-identified screening mammograms of 3,097 women collected between 2001 and 2018 from one academic medical centre. We included images from all 1,511 women who were biopsied during this time period and a random subset of women who never underwent biopsy. Among the women who received a biopsy, 686 were diagnosed with cancer within 2 years of imaging.
Assessing the performance of the AI system
We compared the performance of the AI system against decisions made by individual human specialists in the original screening visit. In this evaluation, we found that the AI had an absolute reduction in false positives (women incorrectly referred for further investigation) of 5.7% for US subjects and 1.2% for UK subjects, and a reduction in false negatives (women incorrectly missed for further investigation) of 9.4% for US subjects and 2.7% for UK subjects, compared to human experts. See the paper for more extensive results.
Generalization across populations
To evaluate whether the AI system was able to generalize across populations and screening settings, we ran an experiment in which the AI was only allowed to learn from data from UK subjects, and then evaluated it on data from US subjects. This experiment showed that the AI system still surpassed human expert performance on US data.
This is an encouraging avenue for future research and gives more confidence about the robustness of the AI system. It might be possible that an AI diagnostic system could be beneficial even when used in areas where there is not a significant history of screening mammography on which to train it.
Future research & potential applications
We’ve yet to determine how to best deploy an AI system for clinical use in mammography. However, we investigated one possible such scenario by using the AI system as a “second reader”. We simulated this by treating the prediction of the AI system as an independent second opinion for every mammogram, taking the place of the ‘second reader’ in the UK ‘double reading’ system. When the AI and the clinician disagreed, the existing arbitration process would take place. In these simulated experiments, we showed that an AI-aided double-reading system could achieve non-inferior performance to the UK system with only 12% of the current second reader workload.
Further research, including prospective clinical studies, will be required to understand the full extent to which this technology can benefit breast cancer screening programmes.