Two-Dimensional VLOOKUP in Pandas¶

One thing that I particularly enjoy about teaching Pandas is showing people how much better it is than Excel. Every so often I get a question about how to do something in Pandas (that is easy in Excel) that requires me to think for a few minutes here is an example of such a question. My hope is that you find this tutorial informative if you have any questions or spot any mistakes feel free to reachout at jason@jbedford.net

Question: Given two tables as shown below

title

Fill in the column ‘Tax Rate’ given the data in columns 'Salary' and 'Dep' (short for number of dependence) using the the data in the Tax Rate Table. Note Salary should be classified as the nearest number below the actual salary and Dep should be an exact match.

Solution in Excell¶

In excel you would click on cell D2 and enter the following

=vlookup(B2,F:M,C2+2)

Then drag this down the rows of column 'D' to fill in the values.

Solution in Pandas¶

I'll cut to the chase and show you how I ended up solving this problem. At the end I'll discuss some other ideas I had. Let's get started by importing the data which I have made into two CSVs.

In [21]:

import pandas as pd

In [22]:

df_employees = pd.read_csv('hr_employee_data.csv')

In [24]:

df_employees.head()

Out[24]:

	Name	Salary	Dep
0	Booth Ron	55303	1
1	Brennan Tito	67330	3
2	Caballero Rich	82076	2
3	Carmichael Jim	58406	0
4	Cristinelli Dave	82096	2

In [6]:

df_tax = pd.read_csv('tax_table.csv')

In [7]:

df_tax.head()

Out[7]:

	Salary	0	1	2	3	4	5	6
0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
1	35000	1	0.8	0.6	0.5	0.4	0.3	0.2
2	40000	3	2.4	1.9	1.5	1.2	1.0	0.8
3	45000	5	4.0	3.2	2.6	2.1	1.7	1.4
4	50000	7	5.6	4.5	3.6	2.9	2.3	1.8

First I defied a tax bracket from the salary column in df_tax (This may not make sense at the moment but hang on I'll explain at the end)

In [8]:

df_tax_braket = pd.DataFrame()
df_tax_braket['Salary'] = df_tax['Salary']

And assign each bracket a number 0 to n

In [15]:

df_tax_braket['Tax_Braket'] = df_tax['Salary'].rank().astype('int')

In [44]:

df_tax_braket.iloc[::4]

Out[44]:

	Salary	Tax_Braket
0	0	1
4	50000	5
8	70000	9
12	90000	13
16	110000	17

Now we can join this with df_employees using pd.merge_asof on column 'Salary'

In [26]:

df_employees = pd.merge_asof(df_employees.sort_values('Salary'),\
                             df_tax_braket,on='Salary')

In [30]:

df_employees.iloc[::4]

Out[30]:

	Name	Salary	Dep	Tax_Braket
0	Malone Carl	21635	3	1
4	Keller Lynda G	31450	2	1
8	Sebastian Julie	38288	3	2
12	Kopfe Siobhan	44206	3	3
16	Carmichael Jim	58406	0	6
20	Nicolaus Gary	80300	1	11
24	Rodriguez Michael	83516	3	11
28	Sonnenberg Diane	84809	1	11
32	Hendricks Ann	99530	1	14

Second I took df_tax and added Tax_Braket as a column and dropped the Salary column.

In [32]:

df_tax['Tax_Braket'] = df_tax_braket['Tax_Braket']

In [33]:

df_tax = df_tax.drop('Salary',axis=1)

In [34]:

df_tax.head()

Out[34]:

	0	1	2	3	4	5	6	Tax_Braket
0	0	0.0	0.0	0.0	0.0	0.0	0.0	1
1	1	0.8	0.6	0.5	0.4	0.3	0.2	2
2	3	2.4	1.9	1.5	1.2	1.0	0.8	3
3	5	4.0	3.2	2.6	2.1	1.7	1.4	4
4	7	5.6	4.5	3.6	2.9	2.3	1.8	5

Finally we can merge df_tax with df_employees using the regular merge. We need to merge on two columns one is going to be Tax_Braket but the other ... well the other is not a column it's a list of columns... Fear not! We can restructure df_tax to make this work.

In [35]:

df_tax = df_tax.set_index('Tax_Braket').T.unstack().to_frame().reset_index()

In [38]:

df_tax.columns = ['Tax_Braket', 'Dep', 'Tax_Rate']

In [39]:

df_tax['Dep'] = df_tax['Dep'].astype('int')

In [43]:

df_tax.iloc[::12]

Out[43]:

	Tax_Braket	Dep	Tax_Rate
0	1	0	0.0
12	2	5	0.3
24	4	3	2.6
36	6	1	7.2
48	7	6	2.9
60	9	4	6.2
72	11	2	12.2
84	13	0	23.0
96	14	5	8.2
108	16	3	14.9

ok now we can merge on 'Tax_Braket' and 'Dep'

In [45]:

df_employees = pd.merge(df_employees,df_tax,on=['Tax_Braket', 'Dep'])

In [47]:

df_employees.iloc[::4]

Out[47]:

	Name	Salary	Dep	Tax_Braket	Tax_Rate
0	Malone Carl	21635	3	1	0.0
4	Keller Lynda G	31450	2	1	0.0
8	Jacobs Greg	35879	1	2	0.8
12	Kopfe Siobhan	44206	3	3	1.5
16	Carmichael Jim	58406	0	6	9.0
20	Nicolaus Gary	80300	1	11	15.2
24	Stenquist Allen	84793	2	11	12.2
28	Martyr Angela	84672	5	11	6.2
32	Hendricks Ann	99530	1	14	20.0

Conclusion¶

So, there you have it... not as simple as in Excel, you're thinking? I would agree with you it is a little more work but once you understand it you can package these steps into a function to help abstract away the suffering.

A (maybe impossible) more efficient solution: the first thing that came to mind was to restructure df_tax as it is (before adding Tax_Rate) and merging with merge_asof on the Salary column and exact match merging on the Dep column all in one function call. I dont think it's possible as of v0.22 (please contact me if I'm wrong).

In [ ]:

	0	1	2	3	4	5	6	Tax_Braket
0	0	0.0	0.0	0.0	0.0	0.0	0.0	1
1	1	0.8	0.6	0.5	0.4	0.3	0.2	2
2	3	2.4	1.9	1.5	1.2	1.0	0.8	3
3	5	4.0	3.2	2.6	2.1	1.7	1.4	4
4	7	5.6	4.5	3.6	2.9	2.3	1.8	5

	0	1	2	3	4	5	6	Tax_Braket
0	0	0.0	0.0	0.0	0.0	0.0	0.0	1
1	1	0.8	0.6	0.5	0.4	0.3	0.2	2
2	3	2.4	1.9	1.5	1.2	1.0	0.8	3
3	5	4.0	3.2	2.6	2.1	1.7	1.4	4
4	7	5.6	4.5	3.6	2.9	2.3	1.8	5

	0	1	2	3	4	5	6	Tax_Braket
0	0	0.0	0.0	0.0	0.0	0.0	0.0	1
1	1	0.8	0.6	0.5	0.4	0.3	0.2	2
2	3	2.4	1.9	1.5	1.2	1.0	0.8	3
3	5	4.0	3.2	2.6	2.1	1.7	1.4	4
4	7	5.6	4.5	3.6	2.9	2.3	1.8	5