Ten controversial opinions about medical AI

So, would you believe that in over 3 years of blogging, I have never done a low-effort, low-information, clickbait post? Bizarre, I know, but that changes today.

Honestly though, I’ve wanted to put together a list like this for a while. I’ve had a lot of twitter discussions around these topics, and these are all things I believe strongly but other clever people disagree with. So here you are, a list of opinions I hold that are more or less outside the common consensus.

Non-standard disclaimer: this post is all my personal opinion, with very little evidence. Take with salt for best results.

  1. Open data is not necessarily good. Data is the main competitive advantage a company has to bring a product to market (which costs millions of dollars). If they don’t have this advantage, they have a much less certain return on investment. Why spend millions on a product that anyone can build? Open data could actually slow the pace of progress, where we end up with lots of research papers but no products. Open data is also is a terrible thing for generalisability, as everyone overfits massively to “be the best” on public datasets.tenor
    I’m not giving up my first mover advantage.
  2. “Normal vs abnormal” is a terrible task to train a model for. The abnormal class is so broad and diverse that your data will never cover it well, and your ability to notice rare subgroup errors will be very low (since you wont have any cases). I expect a huge spike in the rate of missed bone tumours if anyone brings an “normal chest xray” detector to market.tumblr_m7fmdyrbie1r5lk54o2_500
    Exactly Ford. Without the darkness, how would we recognise the light?
  3. “Artificial intelligence” is a great term. We all know what it means, it brings in interest and money to the field, and frankly what we do is magic* so let’s just run with it.
    Which is more magical, magitech or technomagic?
  4. Deep learning is pretty useless for EHR data. Not only is deep learning meh with unstructured** data like EHR records, but I don’t see any reason to expect breakthroughs. Deep learning works in images, text, sound, and so on because it looks for a very narrow subset of possible features (i.e., those with spatial relationships). EHR data has no internal structure^, so DL is no better than simpler ML models.
  5. End-user interpretability is over-rated. If your model works, most doctors will gleefully and immediately cede all related decision making to the AI, without the need for intepretability tools. At best, interpretability methods will provide a (false?) sense of security to clinicians^^. That said, faux-interpretable systems will probably sell better to CIOs trying to look “safety first”, so  the current practice of adding heatmaps to everything makes a certain amount of cynical sense.

    ezgif.com-add-text (4)

    I can bill for it as a separate item.
  6. No medical advance is going to be achieved by a team who has designed a fancy new model for the task. Anyone using some home-spun model instead of an off-the-shelf dense/res/u-/inception network etc. is doing machine learning research, not medical research. The very process of building and tuning your own model means you will almost certainly overfit to your particular data, which is anathema to good medical systems. I’m actively skeptical of results in medical data where a novel architecture is used.
  7. Releasing public code is not particularly relevant in medical AI research. It doesn’t improve reproducibility for high performance systems, because without an equally good (but different) dataset we can’t actually validate the results. Even with shared data, running the same code on the same data only proves they didn’t make the results up.giphy
    I mean, I’d sell my soul for an AUC like that.
  8. Vision is done and dusted. Computer vision models aren’t going to get a lot better in terms of performance. We will slowly see improvements in data efficiency and semi-supervised learning, but pretty much any visual task can be performed at human or superhuman level given enough effort and data. We are at Bayes error.


    The end of computer vision. So sad.
  9. Unsupervised learning isn’t clinically relevant. Currently, all AI that seems likely to add clinical value is supervised, because human performance is very close to the best achievable, given the inputs. Unsupervised learning is getting better, but you will always take a performance hit, which will always make it worse than human. There are undoubtedly some situations where unsupervised learning can play a supplementary role to supervised learning, but we won’t be solving medicine with our huge stores of unlabelled data anytime soon.
  10. Distrust any system with an AUC below 0.8, because this is roughly how well medical AI systems work when they overfit on non-pathological image features, like the model of x-ray scanner used or which technician who took the image (all of which can be identified in the image to some degree). These systems will mostly fail as clinical AI because they can’t generalise. Obviously the cut-off of 0.8 is a huge oversimplification, but tends to be a good rule of thumb for many common medical tasks.

There you go. Definitely not a format I will make regular use of, but holy smokes, I’ve written a blogpost with less than 1000 words! My only regret is I couldn’t find more gifs.

I’ll try to respond to any comments here or on social media, so disagree away (or suggest other controversial opinions). I’ll probably do a follow-up post with the best responses, especially if anyone can convince me my opinions are wrong 🙂

* everyone who argues that “AI isn’t magic” needs to have an infusion of childlike wonder, stat. We use maths to transform sounds into meaning, and images into decisions.
** it really bothers me that this is often called “structured data” simply because it is in rows and columns. There is no exploitable internal structure!
^ the exception is with time-series data from EHRs, which have a temporal structure that deep learning might be able to exploit.
^^ interpretability methods are actually very important, IMO, but not for clinicians. They are probably the tools we quality assurance nerds will use for AI monitoring and troubleshooting to ensure ongoing system safety.

20 thoughts on “Ten controversial opinions about medical AI

  1. Boldest post so far – congratulations! The only point that I have a real problem with is #8, ie “pretty much any visual task can be performed at human or superhuman level given enough effort and data”. Might change my mind about that one though 🙂


      1. I really liked this opinion on this topic although it’s more commenting on #6,#7 and the first half of #8: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (5min read)
        TLDR: If you can, kill it with compute and it will be eventually better than humans if it “can discover like we can, not merely contain what we have discovered”. By adding more data and compute (i.e. effort and data as in #8) you can discover novel solutions to the same problem even in a supervised setting (see the legendary move to the 5th line by alphaGO) which then might result in a real medical advance much more valuable than the mere automation of the state of the art medical performance. Great post btw and sad you didn’t find more gifs! Gifs are important.


      2. TLDR: It depends on what “enough data and effort” and “superhuman level” precisely mean. However, I think that we currently do not have enough data to confidently answer even a very carefully formulated statement similar to your #8. The situation is very annoying yet interesting 🙂

        Sidenote: I have tried to find a list of computer vision tasks where computers are outperforming humans. Did not manage to find such a list. If somebody has it – please share, I will appreciate it very much.

        Let us consider ImageNet for concreteness. The labels (both train and test) come from multiple-human annotators via Amazon Mechanical Turk in that dataset. There is a league table at https://paperswithcode.com/sota/image-classification-on-imagenet , from which it seems that top 1 accuracy is 84% and top 5 accuracy is 97% at the moment. If you look at the plot in that page, it definitively shows saturation. Is the saturation due to bad labels? Does the saturation effect only exist with respect to effort? Would more data help? Is there a saturation effect when it comes to more data as well? How about a saturation effect for big data AND big effort? I am not convinced that we know the answers to these questions even in the particular case of ImageNet.

        A. Karpathy has famously demonstrated that his performance for top 5 was 95% http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ and his top 1 performance was miserable. There were constraints to his efforts though: he spent at most 1 minute per image and 1 week for the whole exercise. From this I have concluded that current deep learning algorithms are better than a human with 1 week of training.
        If we are to believe that the labels are correct, I think one could make a case that given more training time, a single human could outperform the top deep learning model. Learning to tell apart these 120 species of dogs seems doable for a human, and 84% seems to be a pretty low number. I would love to change my mind given more data though!

        A similar argument certainly does not work for some other computer vision tasks such as detection of faces. I concede that superhuman accuracy has been achieved in that domain.

        My general takeaway is that given reasonably huge amounts of well-labeled data (millions of examples) and effort (several decades of full time top data scientist working time), it is often possible to produce superhuman performance for certain tasks with current deep learning tech.

        I still wonder if we can comfortably cover all the edge cases (such as telling bone tumors from lung masses in chest X-rays) given that (1) some conditions are exceedingly rare, and (2) medical imaging data is far from being available in a global centralized repository.


      3. I agree. The point was really trying to say “in cases where we can gather enough supervised data for the task.” If we don’t have enough bone tumors, we shouldn’t expect human level or greater performance.

        In medicine, this could mean that rare diseases are beyond current technology, although I suspect this isn’t true for most tasks. For example, in radiology most rare conditions are characterised by combinations of non-rare image features. For example, rare lung infections and tumors can cause reticulonodular patterns, elements of which can be seen in more common diseases such as interstitial oedema and sarcoidosis. A model trained on more common diseases will have a representation of these features, so it isn’t unreasonable to think it may be able to differentiate between the classes with only small volumes of rare class training data.

        There will still be edge cases, but I’m finding more and more that broadly trained, high performance models learn surprisingly widely useful features.

        So while a chest x-ray may have ten thousand possible diseases, it might only have 100 useful image features, which might be learnable with a few dozen classes.


  2. #4 True, usually deep learning is not required for the relatively low feature count in EHR data. But there’s no compulsion that we have to use deep learning when a simpler neural network, or even a discrete mathematical formula can do the job with much less compute power, and faster too. Deep learning is one tool in the tool box. We should use the simplest tool needed to do the job well; an elegant solution for the task at hand.

    And I agree that a lot of EHR data can be less than well structured, particularly if as free form text, But this is not always the case. EHR data can be well structured, discrete and relational if the EHR was designed well to reflect the true nature of the domain. This is possible with good EHR design.

    #6,#7 both state the difficulty of making a generalizable model which works in all circumstances. But it reminds me of the eternal conflict all scientists face when trying to design and interpret any experiment. We want to sample the whole universe of the population and cater for real world behavior of the full system in its natural state But the complexity becomes uninterpretable and so we scientists reduce the model to a simpler form such that we can observe a few discrete dependent and independent variables and see how they are associated. This can yield precise knowledge. But then the critics say that this is an artificial circumstance which would be irrelevant to how nature works in the wild. So we have traded relevance for precision in our simplification of the model from the real world to discrete small model. Both approaches have their place to add knowledge, and so neither should be refuted. Transpose this idea to machine learning and I would argue that a discrete closed data model attaining high precision but perhaps less generalization in the wider population still has its use, if the constraints in its data and design are understood and accepted in its use. Similarly a more generalizable but less precise model using greater data from multiple centres also has its use, as more of a screening tool, or population assessment rather than for individual case diagnosis. No single model is likely to be possible which caters for all scales of use. All models will be inadequate to some degree and we have to be intelligent about how and where we use them, and how the results are interpreted.

    My 2c worth. What does anyone else think?


  3. Hi Rong,

    Great visuals! This liberal arts major could only understand a small part of your blog, but i could understand enough to know that it comes from a great deal of thought and a great analytical mind. I’m just glad that you’re on our side!! See you at work!

    Elena McHugh


  4. I respectfully disagree with points 1 and 8.

    1. Researchers research and product developers develop product. It is what they do. Allowing researchers to research is fundamentally good, and they stake the quality of their reputation on the quality of their output. Even if there is a surplus of bad research, that’s ok because that has been true for 3000 years. It makes science neither better nor worse, and rather more impressive when high quality research is produced.

    On the business side, there is no guarantee that the ubiquity of intellectual property will prevent a company from being successful. Consider McDonalds or Coca Cola. It is the duty of the product manager to create high quality applications for physicians to use. More competition is good as that means the best rise to the top and doctors are afforded more choice. We should encourage natural selection to choose who is weak and who is strong. Companies that quiver in their boots if someone takes their work are not ones that will survive. Anyone in any business, eager to make a quick buck deserves to fail.

    8. I would not underestimate the power of human creativity and intelligence. Consider one of the hardest mathematical problem in Computer Vision – fuzzy edges in image masking. A “traditional” method of masking is Otsu thresholding – plot the distribution, find the area under the curve, divide it in two, and find a threshold. However, for blurry images Otsu thresholding has difficulty finding appropriate masks. Why? Because if you plotted the pixel space over an n*m grid, you would see unsubstantial changes in derivative, or change in pixel value. Where have we seen this before? Boundary Value Problems for differential equations. This is not fully understood, and if anyone does, tell me and let’s solve Riemann Hypothesis. However, I bet someone alive right now will solve it, just as I am confident that we can do better than stitching together a few SVMs, matrix multiplications, and random search. Or maybe not, but we should try.


  5. The clearest example of the limitation of Machine Learning is Graph Isomorphism. (Graph matching, construct a bijection f such that if uv is an edge then f(u)f(v) is an edge) No ML is better, faster, or more accurate then Weisfeiler-Leman (nauty and variants) yet any undergrad can stare hard enough and match two presentations of the Petersen graph.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s