Properly Structuring the Worldbank Data Set

Introduction

Here I will give you a concrete example of how to structure a messy data set with the Pandas package. As you will see properly doing this is key to efficiently performing analysis.

You can view the original jupyter notebook here: https://github.com/JBed/Worldbank-Health-and-Nutrition-Analysis/blob/master/Worldbank-Health-and-Nutrition-Analysis.ipynb

The Data

Get the data from here: http://data.worldbank.org/data-catalog/health-nutrition-and-population-statistics

In [1]:
import pandas as pd
%matplotlib inline
In [2]:
pd.__version__
Out[2]:
'0.22.0'
In [3]:
df = pd.read_csv('HNP_Data.csv', index_col=False)

Let's take a look at the data as is

In [4]:
df.head()
Out[4]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 Arab World ARB % of females ages 15-49 having comprehensive c... SH.HIV.KNOW.FE.ZS NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Arab World ARB % of males ages 15-49 having comprehensive cor... SH.HIV.KNOW.MA.ZS NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Arab World ARB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 133.555013 134.159119 134.857912 134.504576 134.105211 133.569626 ... 50.329135 49.999851 49.887046 49.781207 49.672975 49.536047 49.383745 48.796558 48.196418 47.550194
3 Arab World ARB Adults (ages 15+) and children (0-14 years) li... SH.HIV.TOTL NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Arab World ARB Adults (ages 15+) and children (ages 0-14) new... SH.HIV.INCD.TL NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 60 columns

It looks like the rows are indicator names for a specific country measured over many years. In this format it's hard to think of anything to do with this dataset.

In [5]:
df.columns
Out[5]:
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015'],
      dtype='object')

Goal 1: To compare the change in a given Indicator between two or more countries over the years for which data is provided.

To accomplish this we would need a dataframe with row index being years and the columns countries. But how do we deal with the fact that we have a large number of indicators? Do we make one dataframe for each indicator? No, what we should do is use Multi-Indexed column names. Check it out.

In [6]:
df = df.drop(['Country Code', 'Indicator Code'], axis=1).\
set_index(['Indicator Name','Country Name']).sort_index().T

Now we can access an indicator just like a column. but instead of returning a pandas series object a pandas dataframe is returned. That is the magic of Multi Indexing. (you can also check out the docs here: http://pandas.pydata.org/pandas-docs/stable/advanced.html)

In [7]:
df['Mortality rate, under-5 (per 1,000)'].head()
Out[7]:
Country Name Afghanistan Albania Algeria American Samoa Andorra Angola Antigua and Barbuda Arab World Argentina Armenia ... Uzbekistan Vanuatu Venezuela, RB Vietnam Virgin Islands (U.S.) West Bank and Gaza World Yemen, Rep. Zambia Zimbabwe
1960 NaN NaN 246.3 NaN NaN NaN NaN 249.360225 NaN NaN ... NaN 155.2 81.1 NaN NaN NaN 182.7 NaN 207.3 150.7
1961 356.5 NaN 246.1 NaN NaN NaN NaN 241.992944 NaN NaN ... NaN 151.0 78.3 NaN NaN NaN 180.4 NaN 203.0 146.1
1962 350.6 NaN 246.2 NaN NaN NaN NaN 236.496536 NaN NaN ... NaN 147.0 75.9 NaN NaN NaN 177.1 NaN 199.0 141.6
1963 345.0 NaN 246.8 NaN NaN NaN NaN 241.306521 NaN NaN ... NaN 142.4 73.8 NaN NaN NaN 174.1 406.1 195.4 137.2
1964 339.7 NaN 247.4 NaN NaN NaN NaN 234.443051 NaN NaN ... NaN 137.7 72.0 95.0 NaN NaN 175.2 397.7 192.3 132.8

5 rows × 258 columns

In [8]:
df['Mortality rate, under-5 (per 1,000)']\
[['High income', 'Middle income', 'Low income']].plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f35e358>

Nice! It looks like High, Middle and Low income countries are all converging on zero under-5 mortality rate. one can also see that high income countries are decreasing less quickly than middle and low income courts especially for more recent years.

In [ ]:
 

Goal 2: compare two indicators (over all countries) over the years for which data is provided.

The indicators that we are interested in here are: Mortality rate, under-5, male (per 1,000) and Mortality rate, under-5, female (per 1,000). We're going to make a boxplot using data from all of the countries for each of the years with available data.

In [9]:
df['Mortality rate, under-5, male (per 1,000)'].dropna(axis=[0,1], how='all')
Out[9]:
Country Name Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Arab World Argentina Armenia Australia ... Uruguay Uzbekistan Vanuatu Venezuela, RB Vietnam West Bank and Gaza World Yemen, Rep. Zambia Zimbabwe
1990 185.5 44.5 50.8 9.4 236.3 28.2 83.418292 30.5 54.6 10.2 ... 25.5 79.5 38.5 32.9 56.5 47.0 92.9 131.7 199.3 82.0
2000 141.1 28.8 42.5 5.0 226.7 17.1 63.832238 22.3 33.3 6.9 ... 18.7 70.7 30.8 24.1 38.1 32.0 77.8 99.9 171.8 113.0
2010 108.4 18.3 28.8 3.6 190.9 10.9 45.805552 16.0 20.0 5.3 ... 13.5 51.8 30.8 18.5 28.2 25.7 53.3 58.1 87.8 95.7
2015 94.6 15.3 26.8 3.1 164.6 8.8 39.345160 13.7 15.6 4.1 ... 11.1 44.1 29.8 16.6 24.6 22.9 44.4 45.5 68.9 76.1

4 rows × 234 columns

In [10]:
df['Mortality rate, under-5, female (per 1,000)'].dropna(axis=[0,1], how='all')
Out[10]:
Country Name Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Arab World Argentina Armenia Australia ... Uruguay Uzbekistan Vanuatu Venezuela, RB Vietnam West Bank and Gaza World Yemen, Rep. Zambia Zimbabwe
1990 176.2 36.5 42.7 7.6 215.2 22.7 76.439642 24.5 44.8 8.0 ... 20.5 63.0 32.6 26.1 44.7 41.6 88.2 120.7 181.6 69.2
2000 132.7 23.4 36.7 4.2 206.2 13.7 56.807549 18.0 26.8 5.6 ... 14.9 55.5 25.9 19.0 29.2 27.4 73.8 90.6 154.0 98.4
2010 101.4 14.8 25.8 3.0 173.2 8.9 40.064303 13.1 16.0 4.3 ... 11.1 40.0 26.1 14.6 21.3 21.7 49.7 50.2 76.2 83.0
2015 87.4 12.5 24.0 2.6 148.9 7.3 34.137936 11.3 12.5 3.4 ... 9.1 33.9 25.2 13.2 18.7 19.3 40.5 38.3 58.9 64.9

4 rows × 234 columns

So there are only 4 unique years for which we can make this comparison.

Now for the boxplot we are not interested in the name of the country we want all of the data (regardless of which country it came from). But we are interested in the year and the gender (both of which are not columns in the df as it is now). We can make year and gender a column with with following

In [11]:
male = df['Mortality rate, under-5, male (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
male['gender'] = 'male'
male.head()
Out[11]:
Country Name index Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Arab World Argentina Armenia ... Uzbekistan Vanuatu Venezuela, RB Vietnam West Bank and Gaza World Yemen, Rep. Zambia Zimbabwe gender
0 1990 185.5 44.5 50.8 9.4 236.3 28.2 83.418292 30.5 54.6 ... 79.5 38.5 32.9 56.5 47.0 92.9 131.7 199.3 82.0 male
1 2000 141.1 28.8 42.5 5.0 226.7 17.1 63.832238 22.3 33.3 ... 70.7 30.8 24.1 38.1 32.0 77.8 99.9 171.8 113.0 male
2 2010 108.4 18.3 28.8 3.6 190.9 10.9 45.805552 16.0 20.0 ... 51.8 30.8 18.5 28.2 25.7 53.3 58.1 87.8 95.7 male
3 2015 94.6 15.3 26.8 3.1 164.6 8.8 39.345160 13.7 15.6 ... 44.1 29.8 16.6 24.6 22.9 44.4 45.5 68.9 76.1 male

4 rows × 236 columns

In [12]:
female = df['Mortality rate, under-5, female (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
female['gender'] = 'female'
female.head()
Out[12]:
Country Name index Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Arab World Argentina Armenia ... Uzbekistan Vanuatu Venezuela, RB Vietnam West Bank and Gaza World Yemen, Rep. Zambia Zimbabwe gender
0 1990 176.2 36.5 42.7 7.6 215.2 22.7 76.439642 24.5 44.8 ... 63.0 32.6 26.1 44.7 41.6 88.2 120.7 181.6 69.2 female
1 2000 132.7 23.4 36.7 4.2 206.2 13.7 56.807549 18.0 26.8 ... 55.5 25.9 19.0 29.2 27.4 73.8 90.6 154.0 98.4 female
2 2010 101.4 14.8 25.8 3.0 173.2 8.9 40.064303 13.1 16.0 ... 40.0 26.1 14.6 21.3 21.7 49.7 50.2 76.2 83.0 female
3 2015 87.4 12.5 24.0 2.6 148.9 7.3 34.137936 11.3 12.5 ... 33.9 25.2 13.2 18.7 19.3 40.5 38.3 58.9 64.9 female

4 rows × 236 columns

We're making progress but what we really want is a dataframe that is just 3 columns (year, value, gender) not 236 as it is currently. We're going to accomplish this by using the magic of pandas melt function.

In [13]:
male = pd.melt(male, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
male.head()
Out[13]:
index gender value
0 1990 male 185.5
1 2000 male 141.1
2 2010 male 108.4
3 2015 male 94.6
4 1990 male 44.5
In [14]:
female = pd.melt(female, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
female.head()
Out[14]:
index gender value
0 1990 female 176.2
1 2000 female 132.7
2 2010 female 101.4
3 2015 female 87.4
4 1990 female 36.5

Ok now we can stack these two dataframes on top of each other to make the final plot.

In [15]:
male_and_female_data=pd.concat([male,female])
In [16]:
male_and_female_data.boxplot(by=['index', 'gender'], column='value',rot=90)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x113cac198>

We see that the under-5 mortality is decreasing and that the gender difference is shreking (with males consistently higher).

Conclusion

As you can see manipulating a Pandas Dataframe is key to analysing data with Pandas. I hope that this write up was helpful and that you can make use of this in your own projects.

In [ ]:
 

Postscriptum

You can make the above box plot without making the intermediary datafrmes like so.

In [17]:
male = 'Mortality rate, under-5, male (per 1,000)'
female = 'Mortality rate, under-5, female (per 1,000)'

df[[male, female]]\
.dropna(axis=[0,1], how='all').reset_index()\
.pipe(pd.melt, id_vars=['index'])\
.replace(to_replace={'Indicator Name':{male:'male', female:'female'} })\
.drop(['Country Name'], axis=1)\
.boxplot(by=['index', 'Indicator Name'], column='value',rot=90)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x113e2b908>
In [ ]: