I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at `pivot/crosstab/indexing`

- whether I should have a `Panel`

or `DataFrames`

etc...). Could someone give me some pointers on how to do the following:

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

2010:

```
AB,100.00
AB,200.00
AC,150.00
AD,500.00
```

2011:

```
AB,500.00
AC,250.00
AX,900.00
```

These are loaded into separate DataFrame objects.

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

2010:

```
AB,300.00,2
AC,150.00,1
AD,500.00,1
```

2011:

```
AB,500.00,1
AC,250.00,1
AX,900.00,1
```

I can't work out whether I should be using `pivot/crosstab/groupby/an index`

etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

Assuming that you have a file called 2010.csv with contents

```
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
```

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

```
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
```

You should get a result that looks something like

```
value
len sum
category
AB 2 300
AC 1 150
AD 1 500
```

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

`v0.21`

answer

Use `pivot_table`

with the `index`

parameter:

```
df.pivot_table(index='category', aggfunc=[len, sum])
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
```

`<= v0.12`

It is possible to do this using `pivot_table`

for those interested:

```
In [8]: df
Out[8]:
category value
0 AB 100
1 AB 200
2 AC 150
3 AD 500
In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]:
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
```

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

```
In [12]: df
Out[12]:
category value value2
0 AB 100 5
1 AB 200 5
2 AC 150 5
3 AD 500 5
In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]:
len sum
value value2 value value2
category
AB 2 2 300 10
AC 1 1 150 5
AD 1 1 500 5
```

The main reason to use `__builtin__.sum`

vs. `np.sum`

is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

Licensed under: CC-BY-SA with attribution

Not affiliated with: Stack Overflow