Medical imaging: playing with the ChestXray-14 dataset

I recently had the chance to work with the ChestX-ray14 image data-set [1], consisting of 112,200 frontal X-ray images from 30,805 unique patients and 14 different thoracic disease labels. The dataset is imbalanced (e.g. 60,361 examples associated with “No Findings”). Imbalances appear to be common in the medical imaging domain and has driven research to address the issue via augmentation techniques using GANs most recently. [2]

The classification task is multi-label with each X-ray image labeled with 0 or more diseases, as opposed to a multi-class task where labels are mutually exclusive. One can learn more about multi-label classification in this tutorial.

source code

Objective

The objective of the exercise was to train a number of multi-label classifiers on the entire ChestX-ray14 dataset and compare to results presented in Wang et al. 2017.

Data Analysis

For EDA on the ChestX-ray14 dataset check out good work done in a Kaggle kernel.

For this exercise I used GCP (Google Cloud Platform) for storage and training. Data and meta data can be found here. In addition to storage, appropriate compute was necessary. A GCP Deep Learning VM was used for pre-processing and training. \(\textbf{Note}\): dont forget to click “Install NVIDIA GPU driver automatically on first startup?” and also select the appropriate image. (When using Tensorflow you may run into CUDA version issues.) I ended up using an image with CUDA 9.0. Training is done using 1 NVIDIA Tesla P100 and 16 CPUs with 104GB of memory collectively.

Preprocess

Converted string labels, e.g. Effusion Emphysema Infiltration Pneumothorax, to multi-hot encodings.
Converted raw images, and associated labels into TFRecords.
Standardized the images, subtracting mean and dividing by the standard deviation on a per image basis.
Resized images to dimensions 224x224x3.

Architectures

A base case simple CNN.
A pre-trained ResNet-v2-50 used as fixed feature extractor, with outputs fed into 2 fully connected layers. (Backprop only through the FCs)
An ensemble of feature extractors with outputs put through a transition layer before applying the add operator. Resulting vectors are passed through 2 fully connected layers.

Feature extraction was done using pre-trained models found at tensorflow hub.

Evaluation

For evaluation, the AUC ROC metric was used as in Wang et. al. Googles machine-learning crash course does a good job in explaining ROC + AUC ROC. Further, though not applied in this exercise, accuracy measures used for multi-label classification requires a different set of metrics. [4,5,6]

Results

\[\begin{array}{rrr} \hline \textbf{Disease} & \textbf{ResNetv2-50 FE AUC} & \textbf{Ensemble AUC} & \textbf{Wang et. al} \\ \hline Cardiomegaly & 0.6770 & 0.7980 & \textbf{0.8100} \\ Emphysema & 0.7300 & 0.7950 & \textbf{0.8330} \\ Effusion & 0.5710 & 0.6550 & \textbf{0.7585} \\ Hernia & 0.6590 & 0.6930 & \textbf{0.8717} \\ Nodule & 0.7210 & \textbf{0.7510} & 0.6687 \\ Pneumothorax & 0.5210 & 0.6960 & \textbf{0.7993} \\ Atelectasis & 0.6090 & \textbf{0.7920} & 0.7003 \\ Pleural Thickening & 0.6440 & 0.6660 & \textbf{0.6835} \\ Mass & 0.7720 & \textbf{0.8420} & 0.6933 \\ Edema & 0.6390 & 0.6820 & \textbf{0.8052} \\ Consolidation & 0.7630 & \textbf{0.8210} & 0.7032 \\ Infiltration & 0.6130 & \textbf{0.7060} & 0.6614 \\ Fibrosis & 0.6950 & 0.7480 & \textbf{0.7859} \\ Pneumonia & 0.6640 & \textbf{0.7200} & 0.6580 \\ \hline \end{array}\]

ResNet-v2-50 as a feature extractor takes about 50 minutes.
Ensemble of feature extractors takes about 90 minutes, for 10 epochs. Early stopping is used after 5 epochs for results.

Further Studies

Address class imbalance using example-weighted neural network training.
Use data augmentation to increase sample size as well as address class imbalances.
Integrate more features (e.g. age, gender, etc.) as a embedding and concatenate with the encoded images after feature extraction.

Acknowledgements

Training environment was based off code examples found at cs230-stanford. This is one of the better starting points I have come across, in addition to walking through best practices on data pipelines, and reproducibility. Note that build_dataset.py was heavily modified in my use case for use with GCP storage and TFRecords, as well as input_fn.py and train.py to work with the multi-label task.

References

https://nihcc.app.box.com/v/ChestXray-NIHCC/file/256057377774
https://github.com/xinario/awesome-gan-for-medical-imaging
http://lpis.csd.auth.gr/publications/tsoumakas-ijdwm.pdf
https://stats.stackexchange.com/questions/12702/what-are-the-measure-for-accuracy-of-multilabel-data
https://stackoverflow.com/questions/37746670/tensorflow-multi-label-accuracy-calculation
https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff