Properly Structuring the Worldbank Data Set¶

Introduction¶

Here I will give you a concrete example of how to structure a messy data set with the Pandas package. As you will see properly doing this is key to efficiently performing analysis.

You can view the original jupyter notebook here: https://github.com/JBed/Worldbank-Health-and-Nutrition-Analysis/blob/master/Worldbank-Health-and-Nutrition-Analysis.ipynb

The Data¶

Get the data from here: http://data.worldbank.org/data-catalog/health-nutrition-and-population-statistics

In [1]:

import pandas as pd
%matplotlib inline

In [2]:

pd.__version__

Out[2]:

'0.22.0'

In [3]:

df = pd.read_csv('HNP_Data.csv', index_col=False)

Let's take a look at the data as is

In [4]:

df.head()

Out[4]:

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	Arab World	ARB	% of females ages 15-49 having comprehensive c...	SH.HIV.KNOW.FE.ZS	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Arab World	ARB	% of males ages 15-49 having comprehensive cor...	SH.HIV.KNOW.MA.ZS	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Arab World	ARB	Adolescent fertility rate (births per 1,000 wo...	SP.ADO.TFRT	133.555013	134.159119	134.857912	134.504576	134.105211	133.569626	...	50.329135	49.999851	49.887046	49.781207	49.672975	49.536047	49.383745	48.796558	48.196418	47.550194
3	Arab World	ARB	Adults (ages 15+) and children (0-14 years) li...	SH.HIV.TOTL	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Arab World	ARB	Adults (ages 15+) and children (ages 0-14) new...	SH.HIV.INCD.TL	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 60 columns

It looks like the rows are indicator names for a specific country measured over many years. In this format it's hard to think of anything to do with this dataset.

In [5]:

df.columns

Out[5]:

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015'],
      dtype='object')

Goal 1: To compare the change in a given Indicator between two or more countries over the years for which data is provided.

To accomplish this we would need a dataframe with row index being years and the columns countries. But how do we deal with the fact that we have a large number of indicators? Do we make one dataframe for each indicator? No, what we should do is use Multi-Indexed column names. Check it out.

In [6]:

df = df.drop(['Country Code', 'Indicator Code'], axis=1).\
set_index(['Indicator Name','Country Name']).sort_index().T

Now we can access an indicator just like a column. but instead of returning a pandas series object a pandas dataframe is returned. That is the magic of Multi Indexing. (you can also check out the docs here: http://pandas.pydata.org/pandas-docs/stable/advanced.html)

In [7]:

df['Mortality rate, under-5 (per 1,000)'].head()

Out[7]:

Country Name	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Antigua and Barbuda	Arab World	Argentina	Armenia	...	Uzbekistan	Vanuatu	Venezuela, RB	Vietnam	Virgin Islands (U.S.)	West Bank and Gaza	World	Yemen, Rep.	Zambia	Zimbabwe
1960	NaN	NaN	246.3	NaN	NaN	NaN	NaN	249.360225	NaN	NaN	...	NaN	155.2	81.1	NaN	NaN	NaN	182.7	NaN	207.3	150.7
1961	356.5	NaN	246.1	NaN	NaN	NaN	NaN	241.992944	NaN	NaN	...	NaN	151.0	78.3	NaN	NaN	NaN	180.4	NaN	203.0	146.1
1962	350.6	NaN	246.2	NaN	NaN	NaN	NaN	236.496536	NaN	NaN	...	NaN	147.0	75.9	NaN	NaN	NaN	177.1	NaN	199.0	141.6
1963	345.0	NaN	246.8	NaN	NaN	NaN	NaN	241.306521	NaN	NaN	...	NaN	142.4	73.8	NaN	NaN	NaN	174.1	406.1	195.4	137.2
1964	339.7	NaN	247.4	NaN	NaN	NaN	NaN	234.443051	NaN	NaN	...	NaN	137.7	72.0	95.0	NaN	NaN	175.2	397.7	192.3	132.8

5 rows × 258 columns

In [8]:

df['Mortality rate, under-5 (per 1,000)']\
[['High income', 'Middle income', 'Low income']].plot()

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x10f35e358>

Nice! It looks like High, Middle and Low income countries are all converging on zero under-5 mortality rate. one can also see that high income countries are decreasing less quickly than middle and low income courts especially for more recent years.

In [ ]:

Goal 2: compare two indicators (over all countries) over the years for which data is provided.

The indicators that we are interested in here are: Mortality rate, under-5, male (per 1,000) and Mortality rate, under-5, female (per 1,000). We're going to make a boxplot using data from all of the countries for each of the years with available data.

In [9]:

df['Mortality rate, under-5, male (per 1,000)'].dropna(axis=[0,1], how='all')

Out[9]:

Country Name	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Arab World	Argentina	Armenia	Australia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela, RB	Vietnam	West Bank and Gaza	World	Yemen, Rep.	Zambia	Zimbabwe
1990	185.5	44.5	50.8	9.4	236.3	28.2	83.418292	30.5	54.6	10.2	...	25.5	79.5	38.5	32.9	56.5	47.0	92.9	131.7	199.3	82.0
2000	141.1	28.8	42.5	5.0	226.7	17.1	63.832238	22.3	33.3	6.9	...	18.7	70.7	30.8	24.1	38.1	32.0	77.8	99.9	171.8	113.0
2010	108.4	18.3	28.8	3.6	190.9	10.9	45.805552	16.0	20.0	5.3	...	13.5	51.8	30.8	18.5	28.2	25.7	53.3	58.1	87.8	95.7
2015	94.6	15.3	26.8	3.1	164.6	8.8	39.345160	13.7	15.6	4.1	...	11.1	44.1	29.8	16.6	24.6	22.9	44.4	45.5	68.9	76.1

4 rows × 234 columns

In [10]:

df['Mortality rate, under-5, female (per 1,000)'].dropna(axis=[0,1], how='all')

Out[10]:

Country Name	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Arab World	Argentina	Armenia	Australia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela, RB	Vietnam	West Bank and Gaza	World	Yemen, Rep.	Zambia	Zimbabwe
1990	176.2	36.5	42.7	7.6	215.2	22.7	76.439642	24.5	44.8	8.0	...	20.5	63.0	32.6	26.1	44.7	41.6	88.2	120.7	181.6	69.2
2000	132.7	23.4	36.7	4.2	206.2	13.7	56.807549	18.0	26.8	5.6	...	14.9	55.5	25.9	19.0	29.2	27.4	73.8	90.6	154.0	98.4
2010	101.4	14.8	25.8	3.0	173.2	8.9	40.064303	13.1	16.0	4.3	...	11.1	40.0	26.1	14.6	21.3	21.7	49.7	50.2	76.2	83.0
2015	87.4	12.5	24.0	2.6	148.9	7.3	34.137936	11.3	12.5	3.4	...	9.1	33.9	25.2	13.2	18.7	19.3	40.5	38.3	58.9	64.9

4 rows × 234 columns

So there are only 4 unique years for which we can make this comparison.

Now for the boxplot we are not interested in the name of the country we want all of the data (regardless of which country it came from). But we are interested in the year and the gender (both of which are not columns in the df as it is now). We can make year and gender a column with with following

In [11]:

male = df['Mortality rate, under-5, male (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
male['gender'] = 'male'
male.head()

Out[11]:

Country Name	index	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Arab World	Argentina	Armenia	...	Uzbekistan	Vanuatu	Venezuela, RB	Vietnam	West Bank and Gaza	World	Yemen, Rep.	Zambia	Zimbabwe	gender
0	1990	185.5	44.5	50.8	9.4	236.3	28.2	83.418292	30.5	54.6	...	79.5	38.5	32.9	56.5	47.0	92.9	131.7	199.3	82.0	male
1	2000	141.1	28.8	42.5	5.0	226.7	17.1	63.832238	22.3	33.3	...	70.7	30.8	24.1	38.1	32.0	77.8	99.9	171.8	113.0	male
2	2010	108.4	18.3	28.8	3.6	190.9	10.9	45.805552	16.0	20.0	...	51.8	30.8	18.5	28.2	25.7	53.3	58.1	87.8	95.7	male
3	2015	94.6	15.3	26.8	3.1	164.6	8.8	39.345160	13.7	15.6	...	44.1	29.8	16.6	24.6	22.9	44.4	45.5	68.9	76.1	male

4 rows × 236 columns

In [12]:

female = df['Mortality rate, under-5, female (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
female['gender'] = 'female'
female.head()

Out[12]:

Country Name	index	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Arab World	Argentina	Armenia	...	Uzbekistan	Vanuatu	Venezuela, RB	Vietnam	West Bank and Gaza	World	Yemen, Rep.	Zambia	Zimbabwe	gender
0	1990	176.2	36.5	42.7	7.6	215.2	22.7	76.439642	24.5	44.8	...	63.0	32.6	26.1	44.7	41.6	88.2	120.7	181.6	69.2	female
1	2000	132.7	23.4	36.7	4.2	206.2	13.7	56.807549	18.0	26.8	...	55.5	25.9	19.0	29.2	27.4	73.8	90.6	154.0	98.4	female
2	2010	101.4	14.8	25.8	3.0	173.2	8.9	40.064303	13.1	16.0	...	40.0	26.1	14.6	21.3	21.7	49.7	50.2	76.2	83.0	female
3	2015	87.4	12.5	24.0	2.6	148.9	7.3	34.137936	11.3	12.5	...	33.9	25.2	13.2	18.7	19.3	40.5	38.3	58.9	64.9	female

4 rows × 236 columns

We're making progress but what we really want is a dataframe that is just 3 columns (year, value, gender) not 236 as it is currently. We're going to accomplish this by using the magic of pandas melt function.

In [13]:

male = pd.melt(male, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
male.head()

Out[13]:

	index	gender	value
0	1990	male	185.5
1	2000	male	141.1
2	2010	male	108.4
3	2015	male	94.6
4	1990	male	44.5

In [14]:

female = pd.melt(female, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
female.head()

Out[14]:

	index	gender	value
0	1990	female	176.2
1	2000	female	132.7
2	2010	female	101.4
3	2015	female	87.4
4	1990	female	36.5

Ok now we can stack these two dataframes on top of each other to make the final plot.

In [15]:

male_and_female_data=pd.concat([male,female])

In [16]:

male_and_female_data.boxplot(by=['index', 'gender'], column='value',rot=90)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x113cac198>

We see that the under-5 mortality is decreasing and that the gender difference is shreking (with males consistently higher).

Conclusion¶

As you can see manipulating a Pandas Dataframe is key to analysing data with Pandas. I hope that this write up was helpful and that you can make use of this in your own projects.

In [ ]:

Postscriptum¶

You can make the above box plot without making the intermediary datafrmes like so.

In [17]:

male = 'Mortality rate, under-5, male (per 1,000)'
female = 'Mortality rate, under-5, female (per 1,000)'

df[[male, female]]\
.dropna(axis=[0,1], how='all').reset_index()\
.pipe(pd.melt, id_vars=['index'])\
.replace(to_replace={'Indicator Name':{male:'male', female:'female'} })\
.drop(['Country Name'], axis=1)\
.boxplot(by=['index', 'Indicator Name'], column='value',rot=90)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x113e2b908>

In [ ]: