Python Pandas: remove entries based on the number of occurrences


Question

I'm trying to remove entries from a data frame which occur less than 100 times. The data frame data looks like this:

pid   tag
1     23    
1     45
1     62
2     24
2     45
3     34
3     25
3     62

Now I count the number of tag occurrences like this:

bytag = data.groupby('tag').aggregate(np.count_nonzero)

But then I can't figure out how to remove those entries which have low count...

1
22
11/19/2012 1:20:07 AM

Accepted Answer

Edit: Thanks to @WesMcKinney for showing this much more direct way:

data[data.groupby('tag').pid.transform(len) > 1]

import pandas
import numpy as np
data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])

yields

   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62
23
11/25/2012 12:31:33 PM

New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:

In [11]: g = data.groupby('tag')

In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1
Out[12]:
   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12
Out[21]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon