Back to index
Project 06 June 2020 Machine Learning · Time Series

Predicting Covid-19 cases and deaths with machine learning

Using government policy measures and national health & economic indicators to forecast Covid-19 cases and deaths across countries, 12 days out.

Covid-19 virus over a market trendline

For my first-year intro to Machine Learning course, my team and I created various machine learning models to predict Covid-19 related deaths and cases for any given country. Project Report (PDF) →

Overview

To accomplish our task, we collected time-series data from the Johns Hopkins Coronavirus Resource Center (direct link to data here), Oxford University's Covid-19 Government Response Tracker (direct link here), and selected Covid-19 relevant health and economic datasets from the World Bank Open Data platform.

The Johns Hopkins data provided us with daily statistics on Covid-19 confirmed cases, recoveries, and deaths on a national level. The Oxford Government Response Tracker provided daily statistics on countries' policy measures to contain the spread of the virus — eight containment and closure measures, four economic measures, and five health-system measures, which are either categorical or continuous type and include school and workplace closing, stay-at-home requirements, Covid testing, and contact tracing. The World Bank datasets are national-level statistics from the latest year available: GDP, Life Expectancy at Birth, Physicians per 1,000 individuals, Diabetes Prevalence, and Health Expenditure per capita.

Methodology

We used regression models to make predictions for new cases and deaths 12 days into the future given government policy response and health and economic indicators. Regression models used include linear, ridge, lasso, decision tree regression, and random forest regression. We also tested treating this problem as one of classification and fit a Multilayer Perceptron Neural Network to the data, classifying every future country-day observation to a discrete number of cases previously found in the data. We used temporal holdouts to train our model as we are working with time-series data, and Time Series Nested Cross Validation to validate our model.

Evaluation & Results

We used the Root Mean Squared Error (RMSE) metric to evaluate our models for both outputs — confirmed cases and deaths. As is clear from the chart below, the Decision Tree Regression and Random Forest Regression outperformed all other models by a significant margin.

RMSE across all models
Fig. 1 — Model RMSE comparison

Project code can be found here. Project teammates: Diego Diaz and Piyush Tank.

Data viz & extras

Government response index vs. confirmed cases

Government response index
Fig. 2 — Government response vs. cases

Predicting confirmed cases in Spain (LR vs. NN)

Spain LR and NN predictions
Fig. 3 — Linear regression vs. neural net, Spain

When models diverge — Austria

Austria model divergence
Fig. 4 — Model variance within a single country
Return to main