In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches:
groupby() followed by
size(), and the simpler
.value_counts(). I did a quick
timeit, and the
groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer setup = "import pandas; male_trips=pandas.load('maletrips')" a = "male_trips.start_station_id.value_counts()" b = "male_trips.groupby('start_station_id').size()" Timer(a,setup).timeit(100) Timer(b,setup).timeit(100)
and here is the result:
In : Timer(a,setup).timeit(100) # <- this is value_counts Out: 9.709594964981079 In : Timer(b,setup).timeit(100) # <- this is groupby / size Out: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the
pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect
count_series based on the values in
stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = ( male_trips[male_trips.start_station_id.isin(stations.id.values)] .start_station_id .value_counts() )
and this will only give counts for station IDs actually found in