I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at
pivot/crosstab/indexing - whether I should have a
DataFrames etc...). Could someone give me some pointers on how to do the following:
I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount
AB,100.00 AB,200.00 AC,150.00 AD,500.00
AB,500.00 AC,250.00 AX,900.00
These are loaded into separate DataFrame objects.
What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:
AB,300.00,2 AC,150.00,1 AD,500.00,1
AB,500.00,1 AC,250.00,1 AX,900.00,1
I can't work out whether I should be using
etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.
Assuming that you have a file called 2010.csv with contents
category,value AB,100.00 AB,200.00 AC,150.00 AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
import pandas data_2010 = pandas.read_csv("/path/to/2010.csv") data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
value len sum category AB 2 300 AC 1 150 AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
pivot_table with the
df.pivot_table(index='category', aggfunc=[len, sum]) len sum value value category AB 2 300 AC 1 150 AD 1 500
It is possible to do this using
pivot_table for those interested:
In : df Out: category value 0 AB 100 1 AB 200 2 AC 150 3 AD 500 In : df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out: len sum value value category AB 2 300 AC 1 150 AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
In : df Out: category value value2 0 AB 100 5 1 AB 200 5 2 AC 150 5 3 AD 500 5 In : df.pivot_table(rows='category', aggfunc=[len, np.sum]) Out: len sum value value2 value value2 category AB 2 2 300 10 AC 1 1 150 5 AD 1 1 500 5
The main reason to use
np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.