Imagine that there is a model that predicts an individual’s risk of thousands of diseases based solely on their medical history. What product would you like to create based on this model? This is not a hypothetical question – such a model already exists. It is described in a recent article in Nature, is publicly available, and is waiting for a team that can put it into practice.
Researchers led by oncologist and bioinformatics specialist Moritz Gerstung from Heidelberg University have developed a mathematical model called Delphi-2M that does just that. Using historical medical records of about half a million Britons stored in the UK Biobank, the group was able to predict personal risk of future disease with an accuracy significantly higher than traditional methods based on analyzing population risks associated with gender and age. Moreover, for the vast majority of diseases (though not all), the accuracy of Delphi-2M predictions was even higher than predictions based on biomarkers – despite the fact that the latter offer much more extensive data. The predictions of this new model cover the entire spectrum of ICD-10 morbidity, and its prediction horizon extends to almost 20 years.
All this was made possible by the use of transformer architecture, the very same architecture that revolutionized machine learning in recent years and made modern LLMs possible. It would not be too much of a stretch to say that researchers from Germany have invented “ChatGPT that predicts diseases.” Potentially, the implementation of such a predictive system in practice could significantly change both the personal medical risk scoring industry and, possibly, become the basis for a separate type of personalized medicine. However, in order to understand what the emergence of Delphi-2M means for the medical technology community, we first need to understand how it works, what its limitations are, and what improvements are required before its practical application can be realized.
What’s inside Delphi-2M?
The idea behind the new model is quite simple – and that in itself is good news for two reasons.
- First, it clearly has significant potential for improvement and customization, which will obviously be needed when and if such models enter medical practice.
- Second, it is surprising that even without such enhancements, it performs its task “out of the box” better than, it seems, anything else on the market.
In other words, this is undoubtedly a fundamental model that any business can adapt to its needs – and, importantly, the open source code provided by the model’s authors makes this task much easier.
Delphi-2M is based on a mathematical model of a transformer, but instead of word fragments, as in large language models, it uses individual diseases in their chronological sequence, which is unique to each person. In this case, the prompt for prediction is the medical history, i.e. a chain of time intervals during which a person was diagnosed with a particular disease. If no diseases were diagnosed in the person during the period in question, this is encoded with a special token. This little trick, invented by the authors, is important because it allows them not only to obtain the “most likely next disease” as output but also to get the numerical probability of its occurrence.
Similar to how an LLM transformer predicts the next token in the chain fed into the model, Delphi-2M predicts what disease poses the greatest risk to a given individual – i.e. it enables a personalized disease prognosis.
How accurate is the Delphi-2M prediction? And how does this accuracy compare to other models on the market?
In their work, the authors used three different approaches to assess the accuracy of predictions.
- First, they tested the predictions on a small group of people whose data was not used to train the model – the so-called validation set which constitutes around 20% of the whole dataset.
- Second, since the authors cut off the last two years of data in the UK Biobank when training the model, they had the opportunity to test how the model predicts the future. That alone allowed them to conduct so-called longitudinal testing.
- Third, and perhaps most importantly, the model was tested on a completely different type of data collected in another country, Denmark. The authors managed to obtain medical data on nearly two million Danes, which became the most difficult test for the model’s ability to generalize predictions and a hard test for the entire approach.
In short, the result of testing the model is as follows: Delphi-2M turned out to be the most accurate of all existing predictive models covering the entire spectrum of ICD-10 diseases. And although the accuracy of the prediction decreases slightly (by an average of a couple of percent) when the model is transferred to data from another country, it still remains quite high.
In some nosologies, Delphi-2M loses out to more specialized models. For example, among cardiovascular diseases, the Qrisk-3 model provides 71% accuracy compared to Delphi-2M’s 70%. However, it should be noted that Qrisk3 takes into account such data as personal BMI and blood cholesterol levels – i.e. indicators that Delphi-2M, solely relying on medical history, cannot know. Surprisingly, this only results in the latter being one percent less accurate.
The accuracy of the model’s predictions also depends on the type of disease – it is highest in those sections of ICD-10 that represent the most prevalent conditions (V-Mental disorders, XV-Pregnancy and childbirth) and for which, accordingly, there is significantly more data for training the model. Where more rare nosologies are concerned (e.g. XVII-Congenital abnormalities), predictions become more “random” because they are based on a significantly smaller data set.
It is important to understand what this number means. Accuracy can be understood in different ways, either as precision or recall. It is clear that both values are related to each other. Even without changing anything in the model, it is possible to increase precision if recall decreases. An objective but much less intuitive metric of accuracy is the so-called AUC (area under the curve). This is the area bounded by the curve, which reflects the increase in the true-positive rate as the false-positive rate increases. Ideally, when all answers are correct, the AUC will be one, and the higher it is, the greater the accuracy of the model. Since the cost of different types of errors varies, the model can be tuned to different precision and recall values. For example, if it is important to select people who are at high risk of disease and for whom some prevention is indicated, then the level of false positives for such a task is not very important. In this case, precision can be sacrificed in the hope of obtaining greater recall.
In any case, the accuracy of Delphi-2M predictions ranges from 69% to 80%. This is much higher than a random 50% and significantly higher than epidemiological predictions based on the prevalence of diseases in a particular social group.
The very fact that a transformer model can predict incidence better than conventional epidemiologists is remarkable and unusual. The only difference between a transformer and a human epidemiologist is that it takes into account temporal interactions between diseases that are not obvious or visible to the naked eye. This is what gives Delphi-2M its increased prediction accuracy compared to classical epidemiological prevalence models, and this in itself is a separate field for analysis by researchers of specific diseases.
Nevertheless, 70–80% is still not 95% or 99%, and a fairly large proportion of the model’s predictions turn out to be incorrect. Whether this level of accuracy is sufficient to use the model’s predictions in private practice, for example in private medicine, is a big question. The transition to real-world application may prove to be very difficult here, as it carries legal risks associated with possible errors and all the intricacies of working with private medical information.
An unconditional advantage of Delphi-2M is the open nature of the model and the license, which allows for refinement and the creation of derivative works. It can be expected that sooner or later, this approach to processing medical information will be used in one form or another – perhaps not in the form of direct-to-consumer applications that predict disease risk, but at least in public health tasks where planning and predicting future morbidity is important.