The three phases of medical AI trials

In a recent blogpost I explored how to critically read medical artificial intelligence research, focusing on the relevance of these experiments to clinical practice. It has since struck me that we don’t have a simple, clear way to discuss the idea that some studies are still a still a long way off use in the clinics, and others have progressed much closer to translation into practice.

The medical researchers in the audience might recognise this concept, because this is a case where medicine has already solved the problem.

See, clinical trials are grouped into categories based on how useful the results are going to be to clinical practice. These groups are called the phases of medical research, and reflect the common path from preliminary work to clinical translation, and these are pretty much the required path for clinical innovations to take they want to be accepted by doctors and regulators. Broadly speaking, most research that involves humans (I will use drug trials to illustrate the concept) falls into one of three categories.

Phase I is the first safety checks. A drug is tested in a small group of people to make sure nothing terrible happens. At this stage we barely even consider efficacy (how well the drug works). We just want to know it doesn’t kill people. If we get hints that it works really well, great, but that isn’t the primary motivation of the study.

Phase II assesses safety more thoroughly. This requires a larger group, to identify rarer side effects. Because of the larger sample, we can start finding some evidence of efficacy but it will never be enough to justify clinical use.

Phase III is the difficult, expensive, important stage. The study is designed with the express purpose to find out how useful the drug is. This usually means a large number of people using it for a long period. The methods and analysis needs to be able to hold up under heavy scrutiny from the FDA or similar regulatory authorities.

Technically there are also pre-clinical trials (animal models), and phase 4 trials (follow-up once the drug is available), but phases I-III are where ideas become treatments.

This diagram doesn’t add anything useful except a splash of colour

I think we have a very similar progression in medical AI research, as almost all studies I have seen can fit into a few well defined categories. I highly doubt that the system I present below is rigorous or covers many of the edge cases, but it should form a useful framework when designing and reading about research in the field.

ARTIFICIAL INTELLIGENCE TRIALS

It makes sense for a framework of AI trials to mirror the structure of other clinical trials. The three phase concept is familiar, intuitive and would possibly go some distance in bridging the gap in understanding between medical and artificial intelligence researchers. It might even make it easier to convince doctors and regulators that your new state of the art medical AI system is ready for patients.

The key difference between clinical trials and AI trials is that in phase I and phase II, safety is not a concern for AI systems. These systems will not be applied to patient care at all until phase III, so there is no risk to humans. This is called “negligible risk research” among the ethics boards I usually interact with.

Note that the framework below is intended for use with software systems, not physical systems like surgical robots. A similar framework would exist for these systems, but the details would differ significantly.

Phase I:

Overview: This phase will try to identify tasks which are unfeasible, where the intended model is not promising enough to warrant further research. In tasks that seem promising at this stage, it will guide model design choices and inform cohort selection in the next phase of research.

Study design: The AI system is trained an tested on a small retrospective cohort. This means the data was collected in the past for other reasons, and the researchers simply use it to try to identify factors relevant to the task they want to solve. The classic example in ML research is using a public dataset.

Usually the cohort size will number between twenty and a few hundred, and is not expected to be large enough to accurately characterise model performance or make claims about efficacy.

The cohort is similar to the population the model is targeted at, but it is rarely exactly the same. Choices are often made to simplify the experiments, and these choices limit the ability of researchers to generalise the results more broadly. For example, a dataset of hospital patients is often used because it is readily available, even though the goal is to apply the system to the general (non-hospitalised) population. These design choices will often be performed by researchers not specifically trained in cohort selection (i.e. by computer scientists instead of biostatisticians/ epidemiologists/medical researchers).

The task itself will often be simplified as well to aid the analysis. Proxy tasks are often targeted (we call these surrogate endpoints), instead of attempting to measure the ultimate goal of the research. An example of a surrogate endpoint from my previous blog would be the study that measured the precision and regularity of stitch placement ex-vivo with a surgical robot, rather than the effect on the patient complication rate. While good performance at the former task is not direct evidence of a system doing human tasks, the latter is an experiment that could never get past an ethics board using an untested system, since it would need to be applied to patients.

Costs: The majority of the cost of phase I trials is in the researcher time, designing and training the models.

Time to translation: In clinical trials, we might expect around ten years between a successful phase I trial and a consumer-ready product.

Examples: every medical deep learning trial ever (except one). These are published at a rate of several per week, by groups ranging from high end researchers to undergraduate students. Even Kaggle competitions with medical data and a clinical target would count.

Phase II:

Overview: This phase will identify the ideas that are worth pursuing in phase III studies. Since phase III trials are expensive and time consuming, phase II experiments aim to discover the most promising model architectures, goals and patient cohorts.

Study design: The AI system is tested on a big cohort, large enough that the performance is representative of the expected maximum performance for the model design. The cohort should reflect the target population closely, although some significant differences are still likely. The major confounding variables should be accounted for, or explicitly recognised and acknowledged where they are not controlled. Cohort selection for phase II studies will often require the assistance of study design experts (biostatisticians, epidemiologists).

Cohorts in phase II AI trials are likely to number in the tens of thousands or more. This is much larger than is common in phase II clinical trials, accounting for the need in machine learning research for both training and testing cohorts. If you don’t know what this means, just accept that it will double your required cohort size at minimum compared to a similar clinical trial.

The data will almost always still be retrospective, but the task itself will be very similar to the clinical task that the researchers seek to automate.
Costs: The majority of cost in phase II trials will be in gathering, labeling and processing the large training dataset. The costs for model design at this stage will vary, depending on the novelty of the machine learning methods.

Time to translation: In clinical trials, we might expect around five to eight years between a successful phase II trial and a consumer-ready product.

Examples: the Google study on diabetic retinopathy. This study is the only one I have seen that could be called phase II in this framework. Over 10,000 cases to test the system, trained on 130,000 thousand images. This system performs on par with medical specialists and should accurately reflect the clinical performance (within a margin of error), and thus could legitimately form the basis for a phase III clinical trial.

Phase III:

Overview: Phase III trials are for proving clinical utility. The goal is to show how effective the system is at the clinical task in a controlled environment.

Study design: The AI system is tested on a large prospective cohort that accurately reflects the target population. Prospective means the patients are gathered prior to application of the system, and then followed up for long enough to assess the effects. The study aims to demonstrate change in a medical metric, such as improvement in patient outcomes or a reduction in the costs of clinical care (without increased harm).

Cohort selection is critical in this phase, as the system will only be accepted in clinical practice for populations that match the study cohort. A significant amount of effort is spent on study design, often requiring multiple experts working for several months.

Cohort size is more variable in phase III, and will be guided by the size of the effects identified during phase II studies (a statistical power calculation). It is possible that a phase III trial for a particularly efficacious system could be smaller than than the phase II study that created the AI model. That said, I personally expect that the first phase III AI system trials will have to “overpower” their cohorts to overcome the conservative bias* of medical research.

Task selection will reflect the use case of the system. Clinical and regulatory acceptance will require proof in the same task as the system is deployed to achieve (a regulatory endpoint). Again, this will require extensive planning and discussion with domain experts.

Costs: The majority of the cost during phase III trials is in the study design, cohort enrollment and management, data analysis and publication expenses. As the computer system design is largely finalised during phase II experiments, the machine learning cost during phase III should be small, although engineering costs may be much higher.

Since these studies are prospective, follow-up periods must be long enough to capture the clinical outcomes in question. For events like heart attacks, this often means several years of follow up. The costs of running studies like this can be enormous.

Time to translation: In clinical trials, we might expect around two to five years between a successful phase III trial and a consumer-ready product. The regulatory approval process can take a really long time!

Examples: No phase III trials have ever been performed using deep learning systems.

It could be argued that Computer Aided Diagnosis (CAD) systems for radiology have undergone phase III trials in the past, particularly in mammography. These systems were an older (and less performant) style of machine learning. This history could actually make the translation of deep learning systems harder, because phase IV (post-deployment) experience with CAD systems has been disappointing.

PHASES SET TO THRILL

Vale, Leonard

It seems to me that this kind of framework could help solve some of the problems I have written about previously, particularly regqrding science communication with the public and the media. Simple categories like those I have described can identify up-front how close (or far) to clinical translation an AI system is, and that will make understanding the research much easier for everyone.

They might also help to calibrate our expectations. Almost no clinical research ever makes it through the whole system, and it would be reasonable to expect a similar culling process. Since we try to keep track of the more advanced clinical trials, we know that only 18% of phase II trials reach phase III, and probably less than 50% of phase III trials succeed.

It is probably even worse for AI systems, since the barrier to performing a phase I study (particularly with public data) is so low. It might be fair to estimate that less than one in a thousand AI trials are ever going to progress past the first phase. We see publication of five to ten medical AI papers per week, but we have only ever seen a single phase II trial.

As a little bit of further cold water, it is estimated that the average drug takes more than ten years and almost a billion dollars to get from lab to market. AI systems might be easier and cheaper than that, but we don’t really have any evidence to justify this view. No AI trial has made it to phase III or beyond to find out.

Finally, a framework like this could also provide a clear road-map for researchers. Start with these sort of experiments, then move up to something like this, and by the end you will have a system that will (hopefully) address the concerns of doctors and regulators. In my experience computer scientists and engineers often find these kinds of study design choices non-obvious, and having a rough guide for how to get from idea to medical product could be helpful.

One of my new years resolutions is to try to limit the length of my blog posts so they are more digestible, so I will end this piece before it climbs too far over two thousand words 🙂

Thanks for reading and sharing.

* the conservative bias is a feature, not a bug. The first example of a new medical innovation faces a higher barrier to acceptance than subsequent implementations. This is because of the precautionary principle. The more we test a method of medical science, the better we understand it and the better we can predict the risks. For largely untested methods, we err on the side of caution.

9 thoughts on “The three phases of medical AI trials”

Heri says:

December 28, 2016 at 8:07 pm

Interesting ideas. I can’t help thinking of misclassification problems especially in early stages. This can snowball into wrong findings. Maybe we can have an AI working in parallel with the usual team of epidemiologists and bio-staticians to validate findings by AI.

I’m more excited about using the data from the trials into a learning dataset and then using it for healthcare applications. For example, a diabetic retinopathy study with classified images can be used to detect quickly and accurately symptoms in the general population via a simple mobile app. This means the time from clinical research and usage can be dramatically accelerated.

Another opportunity is going through the huge datasets and then having AI find correlations researchers haven’t found yet. Imagine we could have AI find new adverse effects for a food or inversely, find protective effects based on where you live etc.

LikeLike

mattwescott says:

December 29, 2016 at 8:32 pm

This is thought-provoking, thanks Luke. I’d be curious to hear how you expect adoption to unfold for particular applications in radiology. Will this sensible process that you’ve outlined emerge from the boiling stew of sometimes conflicting interests and agendas from research, sales, clinics, administrators and regulators?

LikeLike

1. lukeoakdenrayner says:
  
  December 31, 2016 at 7:02 am
  
  I have to admit I haven’t had to deal with “the system” directly in most of my work. I will definitely write more on this in the future, but most of it will rely on second-hand accounts.
  
  LikeLike
  
Elías Eyþ says:

December 30, 2016 at 10:47 am

Very interesting. Due to the high-cost of gathering and labeling DICOM data, the only logical way forward is to set up a multi-center, pre-registered database – using crowd sourcing principles and open-source ideology, I think it’s within reason to assume radiologists would participate. There would have to be stringent validation of cases going in, i.e. some kind of predefined selection criteria so that “shitty quality” images also get included. The problem with current databases is that they are meticulously curated and only “representative” images are usually included, while we would want the algorithm to do well on an unselected population. Radiopaedia would have a tactical advantage in setting up this kind of idea. What are your thoughts?

LikeLike

1. lukeoakdenrayner says:
  
  December 31, 2016 at 7:01 am
  
  There is a constant tension between easy to use data and representative data. I cover this a bit more in my recent piece on the phases of clinical trials (https://lukeoakdenrayner.wordpress.com/2016/12/27/the-three-phases-of-medical-ai-trials/).
  
  I think you only need truly representative data at phase III, as long as you understand the major ways your preliminary data is non-representative.
  
  I have a few ideas about how to reduce the cost of gathering these datasets, which I have been employing in my own work. Hopefully I will write something in the future.
  
  LikeLike
  
Joeri Nicolaes says:

January 4, 2017 at 9:50 am

Great post Luke, I really appreciate how you blend the clinical valiation practice with the AI technology context.

LikeLike

Pingback: 用人工智能改变医疗，必须经过这三个阶段 | 深度 - 115个Q
Pingback: 用人工智能改变医疗，必须经过这三个阶段 | 深度 - 大鱼海棠
Pingback: 2017 in review: progress, problems, and predictions – Luke Oakden-Rayner