Here I will give you a concrete example of how to structure a messy data set with the Pandas package. As you will see properly doing this is key to efficiently performing analysis.
You can view the original jupyter notebook here: https://github.com/JBed/Worldbank-Health-and-Nutrition-Analysis/blob/master/Worldbank-Health-and-Nutrition-Analysis.ipynb
Get the data from here: http://data.worldbank.org/data-catalog/health-nutrition-and-population-statistics
import pandas as pd
%matplotlib inline
pd.__version__
df = pd.read_csv('HNP_Data.csv', index_col=False)
Let's take a look at the data as is
df.head()
It looks like the rows are indicator names for a specific country measured over many years. In this format it's hard to think of anything to do with this dataset.
df.columns
Goal 1: To compare the change in a given Indicator between two or more countries over the years for which data is provided.
To accomplish this we would need a dataframe with row index being years and the columns countries. But how do we deal with the fact that we have a large number of indicators? Do we make one dataframe for each indicator? No, what we should do is use Multi-Indexed column names. Check it out.
df = df.drop(['Country Code', 'Indicator Code'], axis=1).\
set_index(['Indicator Name','Country Name']).sort_index().T
Now we can access an indicator just like a column. but instead of returning a pandas series object a pandas dataframe is returned. That is the magic of Multi Indexing. (you can also check out the docs here: http://pandas.pydata.org/pandas-docs/stable/advanced.html)
df['Mortality rate, under-5 (per 1,000)'].head()
df['Mortality rate, under-5 (per 1,000)']\
[['High income', 'Middle income', 'Low income']].plot()
Nice! It looks like High, Middle and Low income countries are all converging on zero under-5 mortality rate. one can also see that high income countries are decreasing less quickly than middle and low income courts especially for more recent years.
Goal 2: compare two indicators (over all countries) over the years for which data is provided.
The indicators that we are interested in here are: Mortality rate, under-5, male (per 1,000) and Mortality rate, under-5, female (per 1,000). We're going to make a boxplot using data from all of the countries for each of the years with available data.
df['Mortality rate, under-5, male (per 1,000)'].dropna(axis=[0,1], how='all')
df['Mortality rate, under-5, female (per 1,000)'].dropna(axis=[0,1], how='all')
So there are only 4 unique years for which we can make this comparison.
Now for the boxplot we are not interested in the name of the country we want all of the data (regardless of which country it came from). But we are interested in the year and the gender (both of which are not columns in the df as it is now). We can make year and gender a column with with following
male = df['Mortality rate, under-5, male (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
male['gender'] = 'male'
male.head()
female = df['Mortality rate, under-5, female (per 1,000)']\
.dropna(axis=[0,1], how='all').reset_index()
female['gender'] = 'female'
female.head()
We're making progress but what we really want is a dataframe that is just 3 columns (year, value, gender) not 236 as it is currently. We're going to accomplish this by using the magic of pandas melt function.
male = pd.melt(male, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
male.head()
female = pd.melt(female, id_vars=['index', 'gender'])\
.drop('Country Name',axis=1)
female.head()
Ok now we can stack these two dataframes on top of each other to make the final plot.
male_and_female_data=pd.concat([male,female])
male_and_female_data.boxplot(by=['index', 'gender'], column='value',rot=90)
We see that the under-5 mortality is decreasing and that the gender difference is shreking (with males consistently higher).
As you can see manipulating a Pandas Dataframe is key to analysing data with Pandas. I hope that this write up was helpful and that you can make use of this in your own projects.
You can make the above box plot without making the intermediary datafrmes like so.
male = 'Mortality rate, under-5, male (per 1,000)'
female = 'Mortality rate, under-5, female (per 1,000)'
df[[male, female]]\
.dropna(axis=[0,1], how='all').reset_index()\
.pipe(pd.melt, id_vars=['index'])\
.replace(to_replace={'Indicator Name':{male:'male', female:'female'} })\
.drop(['Country Name'], axis=1)\
.boxplot(by=['index', 'Indicator Name'], column='value',rot=90)