python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B


Question

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

Wes has added some nice functionality to drop duplicates: http://wesmckinney.com/blog/?p=340. But AFAICT, it's designed for exact duplicates, so there's no mention of criteria for selecting which rows get kept.

I'm guessing there's probably an easy way to do this---maybe as easy as sorting the dataframe before dropping duplicates---but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

1
107
9/19/2012 3:01:32 PM

Accepted Answer

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10
136
10/4/2017 6:03:43 PM

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon