Pandas Data Analysis Example Lab

By Jason Bedford (http://jbedford.net/)

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [13]:
df = pd.read_csv('./data/cdystonia.csv')
In [17]:
df.drop('id', axis=1 ,inplace=True)
In [18]:
df.head()
Out[18]:
patient obs week site treat age sex twstrs
0 1 1 0 1 5000U 65 F 32
1 1 2 2 1 5000U 65 F 30
2 1 3 4 1 5000U 65 F 24
3 1 4 8 1 5000U 65 F 37
4 1 5 12 1 5000U 65 F 39

These data are from Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis, pp. 161-163 (Springer, 2002). These data are from a multicenter, randomized controlled trial of botulinum toxin type B (BotB) in patients with cervical dystonia from nine U.S. sites.

  • "treat" tells us what the patient was treated with. 5000 units of BotB 10,000 units of BotB or placebo.

  • "twstrs" tells us total score on Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and disability of cervical dystonia (high scores mean less pain.)

  • The other columns are straightforward. Note that some patients missed weeks and other dropped out before week 12.

Restructuring DataFrame objects

The first thing that we notice is that twstrs is unique. so a single patient appears in many rows. There is nothing inherently wrong with structuring the dataset like this but it does make certain types of analysis more difficult. What other ways could we structure this dataset? And what sort of analysis would be made possible by that structure?

The answer is that we could make each patent a single row. which would mean that we could have have columns for each of the weeks of the twstrs scores. This can be accomplished with a simple pivot shown below.

In [20]:
df_pivot = df.pivot(index='patient', columns='obs', values='twstrs')
In [22]:
df_pivot.head()
Out[22]:
obs 1 2 3 4 5 6
patient
1 32.0 30.0 24.0 37.0 39.0 36.0
2 60.0 26.0 27.0 41.0 65.0 67.0
3 44.0 20.0 23.0 26.0 35.0 35.0
4 53.0 61.0 64.0 62.0 NaN NaN
5 53.0 35.0 48.0 49.0 41.0 51.0

We can add the site, treat, age, sex columns into the pivoted dataframe.

In [41]:
df_wide = pd.merge(df_pivot,df.set_index('patient')\
                        [['site','treat','age','sex']].drop_duplicates(),\
         left_index=True, right_index=True, how='left')
In [42]:
df_wide.head()
Out[42]:
1 2 3 4 5 6 site treat age sex
patient
1 32.0 30.0 24.0 37.0 39.0 36.0 1.0 5000U 65.0 F
2 60.0 26.0 27.0 41.0 65.0 67.0 1.0 10000U 70.0 F
3 44.0 20.0 23.0 26.0 35.0 35.0 1.0 5000U 64.0 F
4 53.0 61.0 64.0 62.0 NaN NaN 1.0 Placebo 59.0 F
5 53.0 35.0 48.0 49.0 41.0 51.0 1.0 10000U 76.0 F

This illustrates the two formats for repeated measures data: long and wide formats. Its typically better to store data in long format because additional data can be included as additional rows in the database, while wide format requires that the entire database schema be altered by adding columns to every row as data are collected.

The preferable format for analysis depends entirely on what is planned for the data, so it is important to be able to move easily between the wide and long format as shown above.

Data aggregation and GroupBy operations

After getting familiar with the data and the two ways we can structure it we might want to start asking questions. For example, is there any change in twstrs for the different treat. We can figure this out using groupby.

In [48]:
df.groupby('treat')['twstrs'].describe()
Out[48]:
count mean std min 25% 50% 75% max
treat
10000U 213.0 41.563380 12.546215 9.0 33.0 44.0 51.0 67.0
5000U 211.0 41.364929 13.531779 7.0 32.0 43.0 51.0 71.0
Placebo 207.0 41.507246 12.078573 6.0 33.0 43.0 50.5 71.0

Does not look like much of difference but let's also look at some other features. We could use a statistical significance test but it’s just going to tell us what we can see from glancing at the data.

In [49]:
df.groupby('treat')[['twstrs', 'age']].mean()
Out[49]:
twstrs age
treat
10000U 41.563380 55.774648
5000U 41.364929 56.957346
Placebo 41.507246 54.086957

Same story

We can look at discrete variables with value counts.

In [51]:
df.groupby('treat')['sex'].value_counts().unstack()
Out[51]:
sex F M
treat
10000U 165 48
5000U 106 105
Placebo 124 83

seems to be some sex imbalance in the 10kU group

In [ ]:
 

As I was saying the wide structure of our data can be useful. Specifically, if we wanted to know how twstrs changes over time for the different treat classes.

In [58]:
df_wide.groupby('treat')[[1, 2, 3, 4, 5, 6]].mean()
Out[58]:
1 2 3 4 5 6
treat
10000U 47.055556 36.628571 35.542857 39.393939 44.939394 49.342857
5000U 46.228571 37.000000 37.114286 39.676471 43.114286 44.735294
Placebo 43.583333 39.969697 39.342857 41.400000 41.735294 42.911765

This is a lot of data it would be nice if we could view this as a plot.

In [60]:
df_wide.groupby('treat')[[1, 2, 3, 4, 5, 6]].mean().T.plot()
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x107ec96d8>

Between weeks 3 and 6 the groups receiving treatment seems to be improving over the placebo

Finally, we can see if patents at different hospitals were more likely to receive the treatment or a placebo. Note this could influence the results.

In [62]:
df.groupby('site')['treat'].value_counts().unstack()
Out[62]:
treat 10000U 5000U Placebo
site
1 24 24 22
2 23 29 30
3 24 24 24
4 18 12 18
5 11 11 11
6 27 30 30
7 24 24 24
8 41 35 30
9 21 22 18

since this is a lot of dat we can plot it like this.

In [63]:
df.groupby('site')['treat'].value_counts().unstack().plot(kind='bar')
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x107f063c8>