Pandas Worksheet Solutions

In [2]:
import pandas as pd
%matplotlib inline
In [3]:
pd.__version__
Out[3]:
'0.22.0'

The goal of this worksheet is to provide practical examples of aggregating (with group by), plotting, and pivoting data with the Pandas Python package.

This worksheet is available as a jupyter notebook on github here: https://github.com/JBed/Pandas_Analysis_Worksheet

Get the data here: https://www.kaggle.com/dansbecker/nba-shot-logs

Finally, if you have any questions, comments, or believe that I did anything incorrectly (or suboptimally) feel free to email me here: jason@jbedford.net

In [2]:
df = pd.read_csv('nba-shot-logs.zip')

The data is structured so that each row corresponds to one shot taking during the 2014-2015 NBA season (We exclude free throws).

In [18]:
df.head(2)
Out[18]:
GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES ... SHOT_DIST PTS_TYPE SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS player_name player_id
0 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1 1 1:09 10.8 2 ... 7.7 2 made Anderson, Alan 101187 1.3 1 2 brian roberts 203148
1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2 1 0:14 3.4 0 ... 28.2 3 missed Bogdanovic, Bojan 202711 6.1 0 0 brian roberts 203148

2 rows × 21 columns

In [4]:
df.columns
Out[4]:
Index(['GAME_ID', 'MATCHUP', 'LOCATION', 'W', 'FINAL_MARGIN', 'SHOT_NUMBER',
       'PERIOD', 'GAME_CLOCK', 'SHOT_CLOCK', 'DRIBBLES', 'TOUCH_TIME',
       'SHOT_DIST', 'PTS_TYPE', 'SHOT_RESULT', 'CLOSEST_DEFENDER',
       'CLOSEST_DEFENDER_PLAYER_ID', 'CLOSE_DEF_DIST', 'FGM', 'PTS',
       'player_name', 'player_id'],
      dtype='object')

Most of the column names are self-explanatory. One thing that initially confused me was that there is no column indicating the team of the player taking the shot. It turns out that that information is hidden in the MATCHUP column.

In [24]:
df.set_index('GAME_ID').loc[21400899]['MATCHUP'].unique()
Out[24]:
array(['MAR 04, 2015 - CHA @ BKN', 'MAR 04, 2015 - BKN vs. CHA'], dtype=object)

We see that the name of the team of the player taking the shot is the first team listed after the date. It turns out that having things structured this way is actually very convenient.

Part1: Questions about SHOT_RESULT for one team in a one game

Here It makes sense restructure our data so that each row is referring to one team in one game and the columns give us the number of shots made or missed. This can be done with either DataFrame.groupby or pandas.pivot_table which I show below.

In [27]:
shot_result_by_matchup = df.groupby(['MATCHUP','SHOT_RESULT']).size().unstack()
In [28]:
shot_result_by_matchup.head()
Out[28]:
SHOT_RESULT made missed
MATCHUP
DEC 01, 2014 - DEN @ UTA 39 41
DEC 01, 2014 - LAC vs. MIN 36 33
DEC 01, 2014 - MIA @ WAS 31 35
DEC 01, 2014 - MIN @ LAC 30 38
DEC 01, 2014 - PHI vs. SAS 28 37
In [33]:
df.pivot_table(index='MATCHUP', columns='SHOT_RESULT', \
               values='W',aggfunc=lambda x: len(x)).head()
Out[33]:
SHOT_RESULT made missed
MATCHUP
DEC 01, 2014 - DEN @ UTA 39 41
DEC 01, 2014 - LAC vs. MIN 36 33
DEC 01, 2014 - MIA @ WAS 31 35
DEC 01, 2014 - MIN @ LAC 30 38
DEC 01, 2014 - PHI vs. SAS 28 37

Personally I find the groupby operation to be more expressive.

Q1.1 which team made the most and least shots in a game

This is done by sorting the above on made

In [37]:
shot_result_by_matchup.sort_values(by='made').head()
Out[37]:
SHOT_RESULT made missed total
MATCHUP
FEB 09, 2015 - GSW @ PHI 2 9 11
FEB 09, 2015 - PHI vs. GSW 3 7 10
MAR 04, 2015 - NYK @ IND 7 18 25
OCT 29, 2014 - PHI @ IND 9 25 34
FEB 11, 2015 - MIN vs. GSW 11 27 38
In [38]:
shot_result_by_matchup.sort_values(by='made', ascending=False).head()
Out[38]:
SHOT_RESULT made missed total
MATCHUP
NOV 28, 2014 - LAL vs. MIN 48 41 89
DEC 27, 2014 - SAC vs. NYK 48 39 87
DEC 08, 2014 - WAS vs. BOS 48 39 87
NOV 07, 2014 - CHA vs. ATL 47 46 93
JAN 12, 2015 - TOR vs. DET 47 39 86

Q1.2: which team took the most and least shots in a game.

We'll make a new column called total and sort on that.

In [34]:
shot_result_by_matchup['total'] = shot_result_by_matchup.sum(axis=1)
In [35]:
shot_result_by_matchup.sort_values(by='total').head()
Out[35]:
SHOT_RESULT made missed total
MATCHUP
FEB 09, 2015 - PHI vs. GSW 3 7 10
FEB 09, 2015 - GSW @ PHI 2 9 11
MAR 04, 2015 - NYK @ IND 7 18 25
FEB 28, 2015 - NYK vs. TOR 13 14 27
FEB 23, 2015 - MIN @ HOU 14 19 33

The first few are likely cancelled games

In [36]:
shot_result_by_matchup.sort_values(by='total',ascending=False).head()
Out[36]:
SHOT_RESULT made missed total
MATCHUP
FEB 07, 2015 - DAL vs. POR 43 62 105
JAN 29, 2015 - CHI @ LAL 42 61 103
NOV 30, 2014 - TOR @ LAL 44 58 102
DEC 03, 2014 - SAS @ BKN 36 65 101
OCT 29, 2014 - CHA vs. MIL 41 60 101

Q1.3 Which team made the most shots as a percentage of all shots taken in a game

We'll make a derived column called make_percent and sort on that.

In [39]:
shot_result_by_matchup['make_percent'] = \
round((shot_result_by_matchup['made'] / shot_result_by_matchup['total'])*100,1)
In [40]:
shot_result_by_matchup.sort_values(by='make_percent').head()
Out[40]:
SHOT_RESULT made missed total make_percent
MATCHUP
FEB 09, 2015 - GSW @ PHI 2 9 11 18.2
OCT 29, 2014 - PHI @ IND 9 25 34 26.5
MAR 04, 2015 - NYK @ IND 7 18 25 28.0
NOV 16, 2014 - OKC vs. HOU 13 33 46 28.3
JAN 13, 2015 - LAL vs. MIA 23 58 81 28.4
In [41]:
shot_result_by_matchup.sort_values(by='make_percent', ascending=False).head()
Out[41]:
SHOT_RESULT made missed total make_percent
MATCHUP
DEC 22, 2014 - SAS vs. LAC 44 25 69 63.8
JAN 16, 2015 - ATL @ TOR 40 23 63 63.5
DEC 07, 2014 - MEM vs. MIA 40 23 63 63.5
NOV 15, 2014 - CLE vs. ATL 39 23 62 62.9
NOV 14, 2014 - NOP vs. MIN 42 25 67 62.7

Part 2: Questions about SHOT_RESULT and W for one team in a one game

Here we’ll make a dataframe similar to the above but with 'W' as a column

In [53]:
shot_and_game_result_by_matchup = df.groupby(['MATCHUP','W','SHOT_RESULT']).size().unstack()
In [54]:
shot_and_game_result_by_matchup.head()
Out[54]:
SHOT_RESULT made missed
MATCHUP W
DEC 01, 2014 - DEN @ UTA W 39 41
DEC 01, 2014 - LAC vs. MIN W 36 33
DEC 01, 2014 - MIA @ WAS L 31 35
DEC 01, 2014 - MIN @ LAC L 30 38
DEC 01, 2014 - PHI vs. SAS L 28 37

We want 'W' to be a column not part of the index as it is now

In [55]:
shot_and_game_result_by_matchup = \
shot_and_game_result_by_matchup.reset_index().set_index('MATCHUP')
In [56]:
shot_and_game_result_by_matchup.head()
Out[56]:
SHOT_RESULT W made missed
MATCHUP
DEC 01, 2014 - DEN @ UTA W 39 41
DEC 01, 2014 - LAC vs. MIN W 36 33
DEC 01, 2014 - MIA @ WAS L 31 35
DEC 01, 2014 - MIN @ LAC L 30 38
DEC 01, 2014 - PHI vs. SAS L 28 37

Q2.1 which team had the lowest make percentage (in a single game) but still won that game.

Again we'll make a derived column called make_percent

In [57]:
shot_and_game_result_by_matchup['make_percent'] = \
round((shot_and_game_result_by_matchup['made']/\
       shot_and_game_result_by_matchup.sum(axis=1))*100,1)
In [58]:
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='W'].sort_values(by='make_percent').head()
Out[58]:
SHOT_RESULT W made missed make_percent
MATCHUP
FEB 09, 2015 - GSW @ PHI W 2 9 18.2
NOV 16, 2014 - HOU @ OKC W 20 49 29.0
JAN 24, 2015 - CHA vs. NYK W 23 50 31.5
DEC 31, 2014 - OKC vs. PHX W 21 44 32.3
JAN 17, 2015 - CHA vs. IND W 31 65 32.3
In [59]:
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='L'].sort_values(by='make_percent', \
                                                       ascending=False).head()
Out[59]:
SHOT_RESULT W made missed make_percent
MATCHUP
JAN 14, 2015 - HOU @ ORL L 36 22 62.1
NOV 18, 2014 - NYK @ MIL L 31 21 59.6
DEC 03, 2014 - MIA vs. ATL L 33 23 58.9
NOV 14, 2014 - BOS vs. CLE L 45 33 57.7
NOV 19, 2014 - NYK @ MIN L 31 23 57.4

Q2.2 Did winning teams have a higher make percentage on average than losing teams?

We'll explore this question with a aggrigation and a plot

In [60]:
shot_and_game_result_by_matchup.groupby('W')['make_percent'].describe().unstack()
Out[60]:
count mean std min 25% 50% 75% max
W
L 904.0 42.667146 5.415448 26.5 39.2 42.95 46.2 62.1
W 904.0 47.793363 5.583023 18.2 44.2 47.70 51.5 63.8
In [61]:
shot_and_game_result_by_matchup.boxplot(column='make_percent', by='W',figsize=(12,8))
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x122e253c8>

Part3: Questions comparing two teams in one game.

Q 3.1: How often did the winning team have a lower make percentage than the losing team

Here it looks like we need to compare two rows in the shot_and_game_result_by_matchup dataframe. Don't do that! Remember that columns are for comparing not rows. Let me say that again

Compare columns not rows!!!

ok so how do we turn two rows into two columns?

Answer: Pivot the data

In [3]:
shot_result_by_gameid_and_w = \
df.groupby(['GAME_ID','W','SHOT_RESULT']).size().unstack()
In [4]:
shot_result_by_gameid_and_w.head(4)
Out[4]:
SHOT_RESULT made missed
GAME_ID W
21400001 L 29 46
W 36 51
21400002 L 31 36
W 37 33
In [5]:
shot_result_by_gameid_and_w['make_percent'] = \
round((shot_result_by_gameid_and_w['made']/\
       shot_result_by_gameid_and_w.sum(axis=1))*100,1)
In [6]:
shot_result_by_gameid_and_w.head(4)
Out[6]:
SHOT_RESULT made missed make_percent
GAME_ID W
21400001 L 29 46 38.7
W 36 51 41.4
21400002 L 31 36 46.3
W 37 33 52.9
In [7]:
shot_result_by_gameid_and_w = \
shot_result_by_gameid_and_w.reset_index()
In [8]:
shot_result_by_gameid_and_w.head(4)
Out[8]:
SHOT_RESULT GAME_ID W made missed make_percent
0 21400001 L 29 46 38.7
1 21400001 W 36 51 41.4
2 21400002 L 31 36 46.3
3 21400002 W 37 33 52.9
In [9]:
make_percent_by_gameid = \
shot_result_by_gameid_and_w.pivot(index='GAME_ID', columns='W')
In [10]:
make_percent_by_gameid.head()
Out[10]:
SHOT_RESULT made missed make_percent
W L W L W L W
GAME_ID
21400001 29 36 46 51 38.7 41.4
21400002 31 37 36 33 46.3 52.9
21400003 26 22 42 27 38.2 44.9
21400004 28 41 32 60 46.7 40.6
21400005 9 36 25 41 26.5 46.8

Now we can make a comparison between the two columns of make_percent (note this is called a MultiIndex more info here: https://pandas.pydata.org/pandas-docs/stable/advanced.html)

In [11]:
(make_percent_by_gameid['make_percent']['W']<\
make_percent_by_gameid['make_percent']['L']).sum()
Out[11]:
197
In [110]:
(197./904)*100
Out[110]:
21.792035398230087

So about 22% of the time the game is won by the team that has the lower make percentage.