This Project was primarily a part of my Final year undergrad thesis.
The aim of this project was to predict disease based on the clinical report.
We used a artificial dataset from http://www.emrbots.org/.
The primary part of the project was to clean the data and structure it to be fed into azureML which we used to train our initial models.
Once we had enough confidence with the test data, we moved the solution to Scikit package in python.
The dataset contained around 30 features predicting the ICD 10 disease code.
We wanted to achieve predicting the domain of the disease rather than the disease itself since it was artificial and dataset was not balanced for all classes.
We completed it by predicting the disease domains based on clinical data with around 78% accuracy using random forest classifier.
Further, we used the existing timestamped data to also get the similarities in the dataset to predict what might be a potential disease occurrence for the patient in 1,3 and 5 years later.