Predicting Covid-19 cases and deaths with machine learning
Using government policy measures and national health & economic indicators to forecast Covid-19 cases and deaths across countries, 12 days out.
For my first-year intro to Machine Learning course, my team and I created various machine learning models to predict Covid-19 related deaths and cases for any given country. Project Report (PDF) →
Overview
To accomplish our task, we collected time-series data from the Johns Hopkins Coronavirus Resource Center (direct link to data here), Oxford University's Covid-19 Government Response Tracker (direct link here), and selected Covid-19 relevant health and economic datasets from the World Bank Open Data platform.
The Johns Hopkins data provided us with daily statistics on Covid-19 confirmed cases, recoveries, and deaths on a national level. The Oxford Government Response Tracker provided daily statistics on countries' policy measures to contain the spread of the virus — eight containment and closure measures, four economic measures, and five health-system measures, which are either categorical or continuous type and include school and workplace closing, stay-at-home requirements, Covid testing, and contact tracing. The World Bank datasets are national-level statistics from the latest year available: GDP, Life Expectancy at Birth, Physicians per 1,000 individuals, Diabetes Prevalence, and Health Expenditure per capita.
Methodology
We used regression models to make predictions for new cases and deaths 12 days into the future given government policy response and health and economic indicators. Regression models used include linear, ridge, lasso, decision tree regression, and random forest regression. We also tested treating this problem as one of classification and fit a Multilayer Perceptron Neural Network to the data, classifying every future country-day observation to a discrete number of cases previously found in the data. We used temporal holdouts to train our model as we are working with time-series data, and Time Series Nested Cross Validation to validate our model.
Evaluation & Results
We used the Root Mean Squared Error (RMSE) metric to evaluate our models for both outputs — confirmed cases and deaths. As is clear from the chart below, the Decision Tree Regression and Random Forest Regression outperformed all other models by a significant margin.
Project code can be found here. Project teammates: Diego Diaz and Piyush Tank.
Data viz & extras
Government response index vs. confirmed cases
Predicting confirmed cases in Spain (LR vs. NN)
When models diverge — Austria