Medical AI Safety: We have a problem.

This is the start of a new blog series on the most important topic I will probably ever write about: medical AI safety. This is a timely discussion, because we are approaching a tipping point. For the first time ever AI systems are replacing human judgement in actual clinics. For the first time ever AI systems could directly harm patients, with unsafe systems leading to overspending, injury, or even death.

For the first time ever, AI systems could actually be responsible for medical disasters.

If that sounds ominous, it should. I don’t think we are being nearly careful enough.

Medical safety

When considering the issue of near-term* AI safety, we need to recognise that AI applied to human medicine is different than AI in almost all other areas. In the tech world performance is often valued above all else, and the risks are treated as secondary. This is exemplified in the unofficial motto of many Silicon Valley software companies – “move fast and break things”. In contrast to software companies, the semi-official creed of many doctors (the Hippocratic Oath) begins with “first, do no harm”. The risk to life and health in medical research requires us to put safety first. This is baked into medical research, for example drug companies are required to demonstrate safety before they are allowed to test performance.

We have seen the problems with the Silicon Valley ethos in AI outside of medicine already. Biased or flawed systems released to the wild have caused harm. Google’s photo tagging system identifying users with dark skin as gorillas, Microsoft’s chat-bot “Tay” devolving into racism and ugliness, Uber’s and Tesla’s self-driving cars unable to cope with fairly unexceptional traffic conditions and leading to fatalities.

A useful shorthand is to think of medical AI as operating on a spectrum of risk, ranging between human applied tools (i.e., traditional medical software) and fully autonomous decision makers. At the very far end of the spectrum are systems that not only operate independently, but also perform a task humans are incapable of themselves (and thus cannot even evaluate, no matter how transparent the decision making process).

If we put typical medical AI tasks on this spectrum, it might look like this:

The dotted line is a tipping point, where we transition from systems that supply information to human experts, and into systems that can make medical decisions independently.

We are at that tipping point, right now.

Move fast …

There have been a number of deep learning based AI systems approved by regulatory bodies, including detection and measurement systems from Arterys, and a stroke triage system from Viz.AI. These systems all had something in common: a human expert is still required to make every decision. The system simply provides additional information for that expert to base their decisions on. While it is not impossible that application of these systems will harm patients, it is unlikely.

There are also systems where the line gets a bit blurry. An FDA approved system to detect atrial fibrillation in ECG halter monitors from Cardiologs highlights possible areas of concern to doctors, but the final judgement is on them. The concern here is that if this system is mostly accurate, are doctors really going to spend time painstakingly looking through hours of ECG traces? The experience from mammography is that computer advisers might even worsen patient outcomes, as unexpected as that may be. Here is a pertinent quote from Kohli and Jha, reflecting on decades of follow-up studies for systems that appeared to perform well in multi-reader testing:

Not only did CAD increase the recalls without improving cancer detection, but, in some cases, even decreased sensitivity by missing some cancers, particularly non-calcified lesions. CAD could lull the novice reader into a false sense of security. Thus, CAD had both lower sensitivity and lower specificity, a non-redeeming quality for an imaging test.

These sort of systems can clearly have unintended and unexpected consequences, but the differences in outcomes are often small enough that they take years to become apparent. This doesn’t mean we ignore these risks, just that the risk of disaster is fairly low.

Now we come to the tipping point.

A few months ago the FDA approved a new AI system by IDx, and it makes independent medical decisions. This system can operate in a family doctor’s office, analysing the photographs of patients’ retinas, and deciding whether that patient needs a referral to an ophthalmologist. The FDA explicitly says:

IDx-DR is the first device authorized for marketing that provides a screening decision without the need for a clinician to also interpret the image or results, which makes it usable by health care providers who may not normally be involved in eye care.

As far as autonomous decision making goes, this seems fairly benign. Currently the decision to refer to an ophthalmologist is made based on several factors (see this pdf for the Australian guidelines), but retinal image assessment does play a big role. It is possible that this automated referral system can work well in practice. But while there is a big potential upside here (about 50% of people with diabetes are not screened regularly enough), and the decision to “refer or not” is rarely immediately vision-threatening, approving a system like this without clinical testing raises some concerns.

I’ll explain more about what I mean by “clinical testing” in a coming piece, but it should be made clear I am not saying this system was not tested. IDx have done thorough performance testing (see this pdf link), and as far as I can tell this testing has been best-in-class (they use prospective patient selection and data from multiple geographic locations, for example). But like in mammography CAD, performance is only one part of the story and can be misleading. The true proof of any system is in clinical outcomes, and no group anywhere in the world has demonstrated how AI systems affect patients.

Further along the risk spectrum, medical twitter has been up in arms because the NHS is already using an automated smart-phone triage system “powered by” babylonhealth AI. This one is definitely capable of leading to serious harm, since it recommends when to go (or not to go) to hospital. Several medicos are highly concerned that it appears to recommend staying at home for classic heart attack symptoms, resting for meningitis, and taking pain relief for a stroke.

PE (pulmonary embolism) has an untreated mortality rate of between 5% and 50%. Costochondritis is a sore joint in your chest, usually from overuse. It is not a long bow to draw that this kind of advice could kill people.

As far as I know, several NHS bodies are now investigating the use of this technology, in response to concerns from doctors.

Finally, several companies (i.e., Google, Enlitic, among others) have been practising “off-shoring” – they are taking their AI products to jurisdictions with more lax regulatory environments. While this practice is widespread in drug testing (and comes with serious risks in that setting), these companies don’t appear to be using these settings for testing, but rather are off-shoring to avoid testing altogether. The justification that these populations are desperately under-serviced and that any treatment is better than no treatment is highly suspect. Many promising treatments have been abandoned because they caused more harm than they prevented.

The point here isn’t to criticise these companies specifically. Many of them are great companies full of people I personally know are wonderful, thoughtful, and careful. But we have to recognise that even the most cautious people can make decisions that lead to tragedy, which is why we have regulation in the first place.

The real point is that none of the FDA, NHS, nor the various regulatory agencies in other nations appear to be concerned about the specific risks of autonomous decision making AI. They are treating these systems as simple medical devices, allowing decision-making AI systems to be used clinically with only modest evidence backing them and no investigation of the possible unintended consequences.

… and break things

While the FDA is certainly on a trajectory to reduce regulation under the current government, it is unclear if the current lassiez-faire approach is completely intentional or if it also reflects the regulatory inertia that comes with any new technology. Either way, we need to remember our history. Lax medical regulation has resulted in tragedy in the past.

A few important examples:

Elixir sulfanilamide (1930s). Sulfanilamide was known to be a fairly safe antibiotic, but the chemist in charge of producing it at the S.E. Massengill Company decided to use a solvent known as DEG to make his “elixir”. This compound is highly toxic, and caused 107 deaths. This event and the resultant public outcry lead to new laws that laid the groundwork for the FDA as we know it today. In further evidence of the need for a strong regulatory environment, DEG has been responsible for over 1000 deaths worldwide since the 1980s, the majority in young children. The most recent event was in Nigeria in 2009, where at least 84 children under the age of 7 died.
Thalidomide (1950s). Thalidomide is actually something of a success story for regulation. The FDA, strengthened after the sulfanilamide disaster, prevented sale of this medication in the USA for control of nausea during pregnancy, due to a lack of evidence for its safety. Most jurisdictions had not yet developed similar authorities, and thalidomide was available over the counter in many countries. Use of the medication lead to around 10,000 cases of birth malformations and 2,000 infant deaths worldwide, which the USA avoided almost completely thanks to Frances Kelsey** and the FDA.

Are we potentially racing towards an AI event on the scale of elixir sulfanilamide or thalidomide? Over the next few articles, I will be addressing several key problems I see related to how these systems are being tested. For now, let’s just say I have serious concerns with the level of evidence required to get risky AI systems approved.

To really drive this home, a quote from Samuel Massengill himself, who testified in court about the 107 deaths caused by his product:

“We have been supplying a legitimate professional demand and not once could have foreseen the unlooked-for results.”

It is likely that the head chemist of his company disagreed.

A cautionary tale for all of us, about the personal cost of being careless. If we have learned nothing else from the history of medical regulation, we need to remember that unlooked-for and unforseeable are not the same thing.