Predicting Covid-19 Cases and Deaths with Machine Learning
For my first-year intro to Machine Learning course, my team and I created various machine learning models to predict
Covid-19 related deaths and cases for any given country.
Link to Project Report
Overview:
To accomplish our task, we collected time-series data from the Johns Hopkins Coronavirus
Resource Center (direct link to data
here), Oxford University's Covid-19 Government Response Tracker (direct link to data
here), and selected Covid-19 relevant
health and economic datasets from the World Bank Open
Data platform. The Johns Hopkins data provided us with daily statistics on Covid-19 confirmed cases, recoveries, and deaths on a national
level and the Oxford Government Reponse Tracker also provided us with daily statistics pertaining to countries' policy measures to contain
the spread of the virus. The policy measures are comprised of eight containment and closure measures decisions, four economic measures,
and five health system measures, which are either categorical or continuous type and include school and workplace closing, stay at home
requirements, Covid testing, and contact tracing. The World Bank datasets are national-level statistics from the latest year available, and
include GDP, Life Expectancy at Birth, Physicians per 1,00 individuals, Diabetes Prevalence, and Health Expenditure per capita.
Methodology:
We used regression models to make predictions for new cases and deaths 12 days into the future given government policy response and
health and economic indicators. Regression models used include linear, ridge, lasso, decision tree regression, and random forest regression.
We also tested treating this problem as one of classification and fit a Multilayer Perceptron Neural Network to the data, classifying every future
country-day observation to a discrete number of cases previously found in the data. We used temporal holdouts to train our model as we are working
with time-series data and Time Series Nested Cross Validation to validate our model. For more detials about our methodology, see the Machine Learning
and Details of Solution section in our project report.
Evaluation and Results:
We used the Root Mean Squared Error (RMSE) metric to evaluate our models for both outputs, confirmed cases and deaths. As is clear from both charts
below, the Decision Tree Regression and Random Forest Regression out performed all other models by a significant margin.
Project code can be found here.
Project teammates: Diego Diaz and Piyush Tank
Return to main
Data Viz & Extras:
Government Response Index vs Confirmed CasesPredicting Confirmed cases in Spain using Linear Regression and Neural Network
Models sometimes varied greatly within one country case, as with Austria
Example countries successfully containing virus with varying Policy Stringency Indices