By Jason Bedford (http://jbedford.net/)
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('./data/cdystonia.csv')
df.drop('id', axis=1 ,inplace=True)
df.head()
These data are from Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis, pp. 161-163 (Springer, 2002). These data are from a multicenter, randomized controlled trial of botulinum toxin type B (BotB) in patients with cervical dystonia from nine U.S. sites.
"treat" tells us what the patient was treated with. 5000 units of BotB 10,000 units of BotB or placebo.
"twstrs" tells us total score on Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and disability of cervical dystonia (high scores mean less pain.)
The other columns are straightforward. Note that some patients missed weeks and other dropped out before week 12.
The first thing that we notice is that twstrs is unique. so a single patient appears in many rows. There is nothing inherently wrong with structuring the dataset like this but it does make certain types of analysis more difficult. What other ways could we structure this dataset? And what sort of analysis would be made possible by that structure?
The answer is that we could make each patent a single row. which would mean that we could have have columns for each of the weeks of the twstrs scores. This can be accomplished with a simple pivot shown below.
df_pivot = df.pivot(index='patient', columns='obs', values='twstrs')
df_pivot.head()
We can add the site, treat, age, sex columns into the pivoted dataframe.
df_wide = pd.merge(df_pivot,df.set_index('patient')\
[['site','treat','age','sex']].drop_duplicates(),\
left_index=True, right_index=True, how='left')
df_wide.head()
This illustrates the two formats for repeated measures data: long and wide formats. Its typically better to store data in long format because additional data can be included as additional rows in the database, while wide format requires that the entire database schema be altered by adding columns to every row as data are collected.
The preferable format for analysis depends entirely on what is planned for the data, so it is important to be able to move easily between the wide and long format as shown above.
After getting familiar with the data and the two ways we can structure it we might want to start asking questions. For example, is there any change in twstrs for the different treat. We can figure this out using groupby.
df.groupby('treat')['twstrs'].describe()
Does not look like much of difference but let's also look at some other features. We could use a statistical significance test but it’s just going to tell us what we can see from glancing at the data.
df.groupby('treat')[['twstrs', 'age']].mean()
Same story
We can look at discrete variables with value counts.
df.groupby('treat')['sex'].value_counts().unstack()
seems to be some sex imbalance in the 10kU group
As I was saying the wide structure of our data can be useful. Specifically, if we wanted to know how twstrs changes over time for the different treat classes.
df_wide.groupby('treat')[[1, 2, 3, 4, 5, 6]].mean()
This is a lot of data it would be nice if we could view this as a plot.
df_wide.groupby('treat')[[1, 2, 3, 4, 5, 6]].mean().T.plot()
Between weeks 3 and 6 the groups receiving treatment seems to be improving over the placebo
Finally, we can see if patents at different hospitals were more likely to receive the treatment or a placebo. Note this could influence the results.
df.groupby('site')['treat'].value_counts().unstack()
since this is a lot of dat we can plot it like this.
df.groupby('site')['treat'].value_counts().unstack().plot(kind='bar')