S.Dasgupta Big Data Energy:

Diabetic Retinopathy Disease Detection

Background

Diabetic retinopathy (DR) is a disease that impacts millions of people every year. About one-third of people with diabetes will develop diabetic retinopathy, and the disease can exhibit with a variety of different phenotypes depending on severity.

There are two forms of diabetic retinopathy, non-proliferative and proliferative. The non-proliferative form is characterized by tiny vessel leaks resulting in macular edema and also ischemia. The disease can progress to its more advanced, proliferative stage, where new vessels start to form, which can then leak into the vitreous humor causing problems in vision. More details about DR can be found here

Diabetic Retinopathy

Objective

An important part of DR diagnosis is evaluation of fundus images, or pictures of the patient’s retina. A physician evaluates and looks for phenotypes that can indicate disease.

The objective for this project was to develop a machine learning model to classify a patient as having or not having DR based of features from the fundus image. This would automated the diagnosis process and make the evalation less prone to subjectivity based off the thresholds set.

Methodology

The data for this project was acquired from the UCI Machine Learning Repository

PostgreSQL on AWS was used to store data for the project. Code for database creation can be found here

Initial Analyses

Numerous classification models were assessed for this dataset, with the target variable being “Class”: 1 if the patient has DR, 0 if the patient does not. Code for data cleaning and exploratory analyses can be found here

Results

The model selection process involved comparing ROC-AUC scores across multiple supervised learning models, with Random Forest classifer yielding the highest value. Accuracy for the chosen model was around 63% and F-score was calculated with beta=1.5. This was done to appropriately weigh recall over precision, since as a physician, falsely predicting DR is preferred over falsely missing a diagnosis. Although more resources will be allocated towards re-testing, the risk of missing a patient with potential disease could be the difference between visual impairment or not.

ROC/AUC

Code for the chosen model and metrics can be found here

Feature importance graphs showed the relative impact of each feature in the model and how well it correlated with the results.

Feature Importance

Conclusions

Presence of microaneurysms and exudates were predictive of diabetic retinopathy and the model was able to categorize a majority of the samples correctly. It is important to note that models like this can improve greatly with larger data sets and adjusting thresholds for categorization can also impact accuracy. In the future, incorporating other measurements such as BMI, blood glucose, or factors like gender and age, could help create an even more comprehensive and predictive model.

Review presentation slides here