Two-Dimensional VLOOKUP in Pandas

One thing that I particularly enjoy about teaching Pandas is showing people how much better it is than Excel. Every so often I get a question about how to do something in Pandas (that is easy in Excel) that requires me to think for a few minutes here is an example of such a question. My hope is that you find this tutorial informative if you have any questions or spot any mistakes feel free to reachout at jason@jbedford.net

Question: Given two tables as shown below

title

Fill in the column ‘Tax Rate’ given the data in columns 'Salary' and 'Dep' (short for number of dependence) using the the data in the Tax Rate Table. Note Salary should be classified as the nearest number below the actual salary and Dep should be an exact match.

Solution in Excell

In excel you would click on cell D2 and enter the following

=vlookup(B2,F:M,C2+2)

Then drag this down the rows of column 'D' to fill in the values.

Solution in Pandas

I'll cut to the chase and show you how I ended up solving this problem. At the end I'll discuss some other ideas I had. Let's get started by importing the data which I have made into two CSVs.

In [21]:
import pandas as pd
In [22]:
df_employees = pd.read_csv('hr_employee_data.csv')
In [24]:
df_employees.head()
Out[24]:
Name Salary Dep
0 Booth Ron 55303 1
1 Brennan Tito 67330 3
2 Caballero Rich 82076 2
3 Carmichael Jim 58406 0
4 Cristinelli Dave 82096 2
In [6]:
df_tax = pd.read_csv('tax_table.csv')
In [7]:
df_tax.head()
Out[7]:
Salary 0 1 2 3 4 5 6
0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 35000 1 0.8 0.6 0.5 0.4 0.3 0.2
2 40000 3 2.4 1.9 1.5 1.2 1.0 0.8
3 45000 5 4.0 3.2 2.6 2.1 1.7 1.4
4 50000 7 5.6 4.5 3.6 2.9 2.3 1.8

First I defied a tax bracket from the salary column in df_tax (This may not make sense at the moment but hang on I'll explain at the end)

In [8]:
df_tax_braket = pd.DataFrame()
df_tax_braket['Salary'] = df_tax['Salary']

And assign each bracket a number 0 to n

In [15]:
df_tax_braket['Tax_Braket'] = df_tax['Salary'].rank().astype('int')
In [44]:
df_tax_braket.iloc[::4]
Out[44]:
Salary Tax_Braket
0 0 1
4 50000 5
8 70000 9
12 90000 13
16 110000 17

Now we can join this with df_employees using pd.merge_asof on column 'Salary'

In [26]:
df_employees = pd.merge_asof(df_employees.sort_values('Salary'),\
                             df_tax_braket,on='Salary')
In [30]:
df_employees.iloc[::4]
Out[30]:
Name Salary Dep Tax_Braket
0 Malone Carl 21635 3 1
4 Keller Lynda G 31450 2 1
8 Sebastian Julie 38288 3 2
12 Kopfe Siobhan 44206 3 3
16 Carmichael Jim 58406 0 6
20 Nicolaus Gary 80300 1 11
24 Rodriguez Michael 83516 3 11
28 Sonnenberg Diane 84809 1 11
32 Hendricks Ann 99530 1 14

Second I took df_tax and added Tax_Braket as a column and dropped the Salary column.

In [32]:
df_tax['Tax_Braket'] = df_tax_braket['Tax_Braket']
In [33]:
df_tax = df_tax.drop('Salary',axis=1)
In [34]:
df_tax.head()
Out[34]:
0 1 2 3 4 5 6 Tax_Braket
0 0 0.0 0.0 0.0 0.0 0.0 0.0 1
1 1 0.8 0.6 0.5 0.4 0.3 0.2 2
2 3 2.4 1.9 1.5 1.2 1.0 0.8 3
3 5 4.0 3.2 2.6 2.1 1.7 1.4 4
4 7 5.6 4.5 3.6 2.9 2.3 1.8 5

Finally we can merge df_tax with df_employees using the regular merge. We need to merge on two columns one is going to be Tax_Braket but the other ... well the other is not a column it's a list of columns... Fear not! We can restructure df_tax to make this work.

In [35]:
df_tax = df_tax.set_index('Tax_Braket').T.unstack().to_frame().reset_index()
In [38]:
df_tax.columns = ['Tax_Braket', 'Dep', 'Tax_Rate']
In [39]:
df_tax['Dep'] = df_tax['Dep'].astype('int')
In [43]:
df_tax.iloc[::12]
Out[43]:
Tax_Braket Dep Tax_Rate
0 1 0 0.0
12 2 5 0.3
24 4 3 2.6
36 6 1 7.2
48 7 6 2.9
60 9 4 6.2
72 11 2 12.2
84 13 0 23.0
96 14 5 8.2
108 16 3 14.9

ok now we can merge on 'Tax_Braket' and 'Dep'

In [45]:
df_employees = pd.merge(df_employees,df_tax,on=['Tax_Braket', 'Dep'])
In [47]:
df_employees.iloc[::4]
Out[47]:
Name Salary Dep Tax_Braket Tax_Rate
0 Malone Carl 21635 3 1 0.0
4 Keller Lynda G 31450 2 1 0.0
8 Jacobs Greg 35879 1 2 0.8
12 Kopfe Siobhan 44206 3 3 1.5
16 Carmichael Jim 58406 0 6 9.0
20 Nicolaus Gary 80300 1 11 15.2
24 Stenquist Allen 84793 2 11 12.2
28 Martyr Angela 84672 5 11 6.2
32 Hendricks Ann 99530 1 14 20.0

Conclusion

So, there you have it... not as simple as in Excel, you're thinking? I would agree with you it is a little more work but once you understand it you can package these steps into a function to help abstract away the suffering.

A (maybe impossible) more efficient solution: the first thing that came to mind was to restructure df_tax as it is (before adding Tax_Rate) and merging with merge_asof on the Salary column and exact match merging on the Dep column all in one function call. I dont think it's possible as of v0.22 (please contact me if I'm wrong).

In [ ]: