KDD 2018: Deep Learning for Computational Health

The Deep Learnig for Computational Health tutorial was presented by Jimeng Sun, Edward Choi and Cao Xiao. With the electronic health record (EHR) data comprising free text, structured codes, time series and images, a wide range of deep learning techniques has been successfully applied to their analysis. The presenters focussed on capturing the information about the sequence of patient’s visits, interpretability of models and representation learning for textual and structured data. Image processing was not discussed, except for a mention of last-year’s skin cancer Nature article.


A few recent results that span different types of data were presented as proofs of usefulness of deep learning in EHR classification tasks.


Assignment of clinical codes (for example, ICD-10) is a laborious and error-prone process performed manually by clinical “coders”, who, in the case of ICD-10 coding system, need to assign a subset of 68,000 codes to patient records. Inaccuracies in the assignment can lead to lost revenue or charges of fraud. Mullenbach et al. (including Sun) in Explainable Prediction of Medical Codes from Clinical Text try to address this issue with a machine learning model that assigns mutliple ICD-10 codes to clinical notes. The architecture is relatively simple, with the text of the note scanned sequentially with a CNN that constructs representations of short phrases. Attention mechanism is then employed to weigh those representations and select phrases that are important for coding.


Classification using structured data was exemplified by the Zhang et al. Leap: learning to prescribe effective and safe treatment combinations for multimorbidity. Multi-morbidity, or multiple diseases cooccurring in a single patient, present a challenge when it comes to prescribing medication, as different drugs can interact with each other in complex ways. The LEAP model uses a sequence-to-sequence RNN that takes patient conditions as an input and outputs a sequence of medications to prescribe, so there is no limit on the number of drugs predicted. While the initial model did not take known adverse interactions of drugs into consideration, that knowledge was later imparted through reinforcement fine-tuning.

Skin cancer

The Dermatologist-level classification of skin cancer with deep neural networks paper published in Nature last year, Esteva et. al present a complex CNN model based on Inception v3 architecture that, in conjunction with expert-curated hierarchy of skin deseases distinguishes skin cancers from benign conditions with accuracy higher than the mean accuracy of dermatologists.

Sequence processing

The first topic discussed in depth was that of sequence modelling, where the sequences considered were patient visits or hospital admissions. All papers in this section have been co-authored by the presenters.

Doctor AI

In Doctor AI: Predicting Clinical Events via Recurrent Neural Networks, Choi (the presenter) et al. attempted to model disease progression using patients visists, i.e. predict what will happen on visit 3 given visits 1 and 2. Each visit was represented using a sparse vector encoding coditions present on a given visit, then an RNN was trained on the sequence of visits in a manner similar to how language models are built: by having the network at each step predict the next input. The dataset used was obtained from Sutter Health and data from 260,000 patients captured over ten years. To make the dimensionality manageable, the ICD-9 codes used as output have been truncated to first three digits, resulting in 1,183 distinct values. The resulting model outperformed simpler approaches; interestingly, predicting the most frequent reason for a givne patient worked better than logistic regression in this task.

An additional interesting result was that this model is suitable for transfer learning: significantly better results were obtained on MIMIC-III when starting with a model pretrained on Sutter Health data than when the model was trained from scratch, even though the two datasets contained different types of data, with Sutter covering primary care and MIMIC intensive care units (ICU).


Another paper by Choi et al., RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism tackles the question of interpretability, which the Doctor AI RNN lacked. The RETAIN model uses a linear embedding of sparsely-encoded input to generate an interpretable hidden state, which in turn is attended to by two RNNs: one that determines the weight of a specific visit, another that assigns weights the individual variables (coordinates) within a visit. The results are combined into prediction by linear weighting of the features within subsequent visits, which then allows drawing timeline plots that illustrate the importance of findings over time:

Illustration from the paper: contribution of diagnoses over time.

Illustration from the paper: contribution of diagnoses over time.

I particularly liked this paper for achieving close to state-of-the art performance with an algorithm that provides a clean mapping to how humans assess the EHRs.


RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data by Xu et al. is another attention model, this time for for predicting patient outcomes (length of stay, mortality) based on data from ICU sensors, tests and notes. Aggregating this multi-channel, multimodal (observed at different frequencies) data in a way that allows predicting in an interpretable way (a recurring theme in healthcare models!) presented a challenge. It was addressed using a combination of CNN to summarise short-term, dense signals such as sensor readings over larger time windows, and then an RNN to model the long-term sequential pattern, with lab tests and medications used to guide which time windows are most relevant. Interpretability could then be achieved by plotting the channels over time, highlighting the time windows and channels that contributed most to the final prediction. The accuracy of length of stay prediction was very impressive.

llustration from the paper: channels and significant time steps.

llustration from the paper: channels and significant time steps.


Readmission – a patient being admitted to a hospital within a short time, typically less than 30 days after discharge – affects almost 18% patients in the US. It is estimated that up to 76% of those readmissions are potentially avoidable, and each costs a lot, adding up to nearly $18 billion a year. Readmission prediction via deep contextual embedding of clinical concepts by Xiao (the presenter) et al. is similar to RAIM in that it uses a different technique to capture long-term context and near-term dependencies. RNN is used for the latter, but the LSTM and GRU units, claimed to be capable of remembering long-distance dependencies, have been found to not performe very well over long sequences, and pose difficulties in optimisation. Topic modelling was therefore used to model the global context, i.e. the long-term history of admissions. The readmission indicator is then computed as a weighted average of the long and near-term context. The optimisation of the model is performed using an inference network.

Representation learning

Concept embedding was the second area of application for deep learning in healthcare that was discussed at length. Again, the presenters summarised their own work.


Word2vec algorithm (in the skip-gram version) learns representation vectors of words by predicting the context in which the word appears. Edward Choi, and co-authors of Multi-layer Representation Learning for Medical Concepts, wanted to apply this idea to structured EHR data, where each patient visit is represented by a sparse, one-hot-encoded vector, and neighbouring visits form the context in which a given visit is evaluated. The demographic data about the patient is incorporated into training through concatentation with the embedded visit vector. The model was trained on data from 3.3 million visits by 500,000 patients obtained from Children’s Healthcare of Atlanta. The trained embeddings of medical codes can then be used as the feature representation classification tasks, for example by attaching a logistic regression model.


Data-driven representation learning, as exemplified by Med2vec, is appropriate when there is a lot of data. In other cases the performance of machine learning models can be boosted by incorporating expert domain knowledge. [[GRAM: Graph-based Attention Model for Healthcare Representation Learning]] (Choi et al.) describes such approach, where the attention mechanism is applied to ontology graph of medical concepts in order to construct representations of each of the leaf nodes. The leaf nodes correspond to medical concepts used in the EHRs, while the inner nodes represent more general concepts, and are used as a context for the representation vector. This process produces an embedding matrix which can then be used to embed a patient visit vector. The embeddings were trained end-to-end, together with the final classifier. When compared to other embedding techniques, the GRAM provided most advantage with little training data. Another interesting result was the comparison with embeddings learned from random ontology graph (fake knowledge) – the classifiers based on those performed the worst among the group analysed, leading to the conclusion that if we are to incorporate domain knowledge, we would better make sure the knowledge is reliable!


Patient subtyping involves seeking patient groups with similar disease progression pathways based on longitudinal EHR. Time elapsed between visits or admissions is significant in clinical decision making, and LSTMs work on implicit assumptions of uniform time steps. To remove this restriction, Baytas et al. in Patient Subtyping via Time-Aware LSTM Networks devised a new type of gate, time-aware LSTM, or T-LSTM. Those gates were then used in an autoencoder that learned patient representations, which could then be clustered to form the patient groups sought.

Drug Similarity and Graph CNN

Drug-drug interactions are rarely observed in clinical trials, since the number of drug combinations that can be tested in this setting is limited. The interactions cannot also be inferred directly from the molecular structure, yet they might lead to increased or decreased action of a drug, and have adverse effects. It is assumed that similar drugs may interact with the same drug, so similarity metrics can be useful in identifying unwanted interactions. Drug Similarity Integration Through Attentive Multi-view Graph Auto-Encoders by Ma et al. attempts to help by modelling drug similarity as a graph, using data from multiple sources: label side effects, off-label side effects as observed by clinicians, molecular structure, drug indication and others. The drugs are represented as nodes in a graph, with edges representing similarity. A graph CNN (GCN) is then used to construct representation of each drug that includes information about its neighbourhood.

Data Augmentation

We are talking generative adversarial networks, of course. As deep learning models require lots of data, and in helthcare that is hard to obtain and comes with sever restrictions related to sensitive information, generation of medical data might open up new realms of applicability for deep learning models.


In Generating Multi-label Discrete Patient Records using Generative Adversarial Networks, Choi et al. took on the task of synthesising structured patient records, with the objective that the data is statistically similar to the real records, and that it does not divulge individual patient information. For simplicity, instead of generating a sequence of records, the generator produced aggregated patient vector, with counts of diagnoses stored in dimensions corresponding to conditions. To convert the continuous output to discrete records a pretrained decoder was used to transform the continuous values to discrete vectors. In addition to statistical tests to confirm reasonable distribution of generated variables, a physician was asked to act as the discriminator and assess the “fakeness” of 50 real and 50 generated samples. The distribution of scores for the two sets was almost identical, with just a few outliers, where e.g. both male and female conditions were generated in a single record. As far as privacy preservation was concerned, an attribute disclosure attack model was assumed: someone knows certain facts about an aquaintance; how much more can they learn from the set? Statistical analysis has shown that ifs someone knows 1% of target patient’s attributes, they can correctly estimate at most 10% positive unknown attributes, and the size of synthetic data has little influence on the effectiveness of the attack.


The final paper review concerned Real-Valued (Medical) Time Series Generation with Recurrent Conditional GANs by Hyland, Esteban and Rätsch, which describes the use of LSTM GAN for regularly timestamped time series – for example those see in ICU sensor readings. The authors then trained a random forest classifier on both synthetic and real data, and evaluated the performance on real test data, observing that the generated data could be used for training with minor performance penalty. They could not, however, reject the null hypothesis that the GAN has not memorised the training data.

* * *

Tutorial website: http://dl4health.org/