import pandas as pd %matplotlib inline
The goal of this worksheet is to provide practical examples of aggregating (with group by), plotting, and pivoting data with the Pandas Python package.
This worksheet is available as a jupyter notebook on github here: https://github.com/JBed/Pandas_Analysis_Worksheet
Get the data here: https://www.kaggle.com/dansbecker/nba-shot-logs
Finally, if you have any questions, comments, or believe that I did anything incorrectly feel free to email me here: email@example.com
df = pd.read_csv('nba-shot-logs.zip')
The data is structured so that each row corresponds to one shot taking during the 2014-2015 NBA season (We exclude free throws).
Index(['GAME_ID', 'MATCHUP', 'LOCATION', 'W', 'FINAL_MARGIN', 'SHOT_NUMBER', 'PERIOD', 'GAME_CLOCK', 'SHOT_CLOCK', 'DRIBBLES', 'TOUCH_TIME', 'SHOT_DIST', 'PTS_TYPE', 'SHOT_RESULT', 'CLOSEST_DEFENDER', 'CLOSEST_DEFENDER_PLAYER_ID', 'CLOSE_DEF_DIST', 'FGM', 'PTS', 'player_name', 'player_id'], dtype='object')
Most of the column names are self-explanatory. One thing that initially confused me was that there is no column telling us the team of the player taking the shot. It turns out that that information is hidden in the MATCHUP column.
array(['MAR 04, 2015 - CHA @ BKN', 'MAR 04, 2015 - BKN vs. CHA'], dtype=object)
We see that the name of the team of the player taking the shot is the first team listed after the date. It turns out that having things structured this way is actually very convenient.