import pandas as pd
%matplotlib inline
pd.__version__
The goal of this worksheet is to provide practical examples of aggregating (with group by), plotting, and pivoting data with the Pandas Python package.
This worksheet is available as a jupyter notebook on github here: https://github.com/JBed/Pandas_Analysis_Worksheet
Get the data here: https://www.kaggle.com/dansbecker/nba-shot-logs
Finally, if you have any questions, comments, or believe that I did anything incorrectly (or suboptimally) feel free to email me here: jason@jbedford.net
df = pd.read_csv('nba-shot-logs.zip')
The data is structured so that each row corresponds to one shot taking during the 2014-2015 NBA season (We exclude free throws).
df.head(2)
df.columns
Most of the column names are self-explanatory. One thing that initially confused me was that there is no column indicating the team of the player taking the shot. It turns out that that information is hidden in the MATCHUP column.
df.set_index('GAME_ID').loc[21400899]['MATCHUP'].unique()
We see that the name of the team of the player taking the shot is the first team listed after the date. It turns out that having things structured this way is actually very convenient.
Here It makes sense restructure our data so that each row is referring to one team in one game and the columns give us the number of shots made or missed. This can be done with either DataFrame.groupby or pandas.pivot_table which I show below.
shot_result_by_matchup = df.groupby(['MATCHUP','SHOT_RESULT']).size().unstack()
shot_result_by_matchup.head()
df.pivot_table(index='MATCHUP', columns='SHOT_RESULT', \
values='W',aggfunc=lambda x: len(x)).head()
Personally I find the groupby operation to be more expressive.
This is done by sorting the above on made
shot_result_by_matchup.sort_values(by='made').head()
shot_result_by_matchup.sort_values(by='made', ascending=False).head()
We'll make a new column called total and sort on that.
shot_result_by_matchup['total'] = shot_result_by_matchup.sum(axis=1)
shot_result_by_matchup.sort_values(by='total').head()
The first few are likely cancelled games
shot_result_by_matchup.sort_values(by='total',ascending=False).head()
We'll make a derived column called make_percent and sort on that.
shot_result_by_matchup['make_percent'] = \
round((shot_result_by_matchup['made'] / shot_result_by_matchup['total'])*100,1)
shot_result_by_matchup.sort_values(by='make_percent').head()
shot_result_by_matchup.sort_values(by='make_percent', ascending=False).head()
Here we’ll make a dataframe similar to the above but with 'W' as a column
shot_and_game_result_by_matchup = df.groupby(['MATCHUP','W','SHOT_RESULT']).size().unstack()
shot_and_game_result_by_matchup.head()
We want 'W' to be a column not part of the index as it is now
shot_and_game_result_by_matchup = \
shot_and_game_result_by_matchup.reset_index().set_index('MATCHUP')
shot_and_game_result_by_matchup.head()
Again we'll make a derived column called make_percent
shot_and_game_result_by_matchup['make_percent'] = \
round((shot_and_game_result_by_matchup['made']/\
shot_and_game_result_by_matchup.sum(axis=1))*100,1)
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='W'].sort_values(by='make_percent').head()
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='L'].sort_values(by='make_percent', \
ascending=False).head()
We'll explore this question with a aggrigation and a plot
shot_and_game_result_by_matchup.groupby('W')['make_percent'].describe().unstack()
shot_and_game_result_by_matchup.boxplot(column='make_percent', by='W',figsize=(12,8))
Here it looks like we need to compare two rows in the shot_and_game_result_by_matchup dataframe. Don't do that! Remember that columns are for comparing not rows. Let me say that again
ok so how do we turn two rows into two columns?
Answer: Pivot the data
shot_result_by_gameid_and_w = \
df.groupby(['GAME_ID','W','SHOT_RESULT']).size().unstack()
shot_result_by_gameid_and_w.head(4)
shot_result_by_gameid_and_w['make_percent'] = \
round((shot_result_by_gameid_and_w['made']/\
shot_result_by_gameid_and_w.sum(axis=1))*100,1)
shot_result_by_gameid_and_w.head(4)
shot_result_by_gameid_and_w = \
shot_result_by_gameid_and_w.reset_index()
shot_result_by_gameid_and_w.head(4)
make_percent_by_gameid = \
shot_result_by_gameid_and_w.pivot(index='GAME_ID', columns='W')
make_percent_by_gameid.head()
Now we can make a comparison between the two columns of make_percent (note this is called a MultiIndex more info here: https://pandas.pydata.org/pandas-docs/stable/advanced.html)
(make_percent_by_gameid['make_percent']['W']<\
make_percent_by_gameid['make_percent']['L']).sum()
(197./904)*100
So about 22% of the time the game is won by the team that has the lower make percentage.