Exercise - Health Analysis

Introduction

In this exercise you will practice using Pandas, Numpy, Matplotlib and Jupyter Notebooks.

Problem Description

You’re interested in the health habits and outcomes in developed nations, particularly the EU and US.

Solution Description

Download euro-health.csv, which contains health-related information derived from data in the Eurostat database (Expand the tree to find the data tables). For comparison to the US, I added roughly equivalent statistics for the US. US data come from the following sources:

The data file has the following columns:

In the same directory as this data file write a Jupyter Notebook named health-analysis with the contents described below.

Jupyter Notebook Sections

Your notebook should have the following parts:

Part 1 - Basic Metrics

Make the necessary imports and read the CSV data file into a Pandas DataFrame named health. Use the first column from the data file (the countries) as the index column for the DataFrame. Answer the questions below using the DataFrame.

For each of the following, include a Markdown cell with the question followed by a code cell which computes and displays the answer. For example, for the question

You would have a Markdown cell with:

What is 2 * 3?

followed by a code cell with

2 * 3

For each of the data columns, what is the average, which country is “best”, which is “worst” and how does the US compare (where would US rank compared to EU countries.

What do these results tell you about health outcomes in Europe and in the US?

Part 2 - Visualization

For each of the following, include a Markdown cell with the question/description and a code cell that produces the visualization. Choose a graphic display that would clearly present the information.

One variable:

What do these plots tell you about the gaps between countries in the data set?

Two variables:

Rather than looking at the measurements by country, look at the measurements relative to other measurements. Each pair of measurements in the data set is paired by country, that is, associated. Plot life expectancy against each of the following varibales (life expectancy should be the dependent variable).

What do these plots tell you about these risk factors?

Tips and Considerations

Discussion

I wrote a script (create_health_summary.py) to create the relatively clean dataset for this exercise from these source files downloaded from Eurostat (they’re all gzip archives, as downloaded from Eurostat). (I then added US data from WHO, CDC, and NIH.) This script is an example of creating a simpler data set from a collection of more complicated data sets.

Sample Solution