I am working with the pandas library and I want to add two new columns to a dataframe `df`

with n columns (n > 0).

These new columns result from the application of a function to one of the columns in the dataframe.

The function to apply is like:

```
def calculate(x):
...operate...
return z, y
```

One method for creating a new column for a function returning only a value is:

```
df['new_col']) = df['column_A'].map(a_function)
```

So, what I want, and tried unsuccesfully (*), is something like:

```
(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)
```

What the best way to accomplish this could be ? I scanned the documentation with no clue.

**`df['column_A'].map(calculate)`

returns a pandas Series each item consisting of a tuple z, y. And trying to assign this to two dataframe columns produces a ValueError.*

I'd just use `zip`

:

```
In [1]: from pandas import *
In [2]: def calculate(x):
...: return x*2, x*3
...:
In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
In [4]: df
Out[4]:
a b
0 1 2
1 2 3
2 3 4
In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))
In [6]: df
Out[6]:
a b A1 A2
0 1 2 2 3
1 2 3 4 6
2 3 4 6 9
```

The top answer is flawed in my opinion. Hopefully, no one is mass importing all of pandas into their namespace with `from pandas import *`

. Also, the `map`

method should be reserved for those times when passing it a dictionary or Series. It can take a function but this is what `apply`

is used for.

So, if you must use the above approach, I would write it like this

```
df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
```

There's actually no reason to use zip here. You can simply do this:

```
df["A1"], df["A2"] = calculate(df['a'])
```

This second method is also much faster on larger DataFrames

```
df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})
```

DataFrame created with 300,000 rows

```
%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

60x faster than zip

Apply is generally not much faster than iterating over a Python list. Let's test the performance of a for-loop to do the same thing as above

```
%%timeit
A1, A2 = [], []
for val in df['a']:
A1.append(val**2)
A2.append(val**3)
df['A1'] = A1
df['A2'] = A2
298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

So this is twice as slow which isn't a terrible performance regression, but if we cythonize the above, we get much better performance. Assuming, you are using ipython:

```
%load_ext cython
%%cython
cpdef power(vals):
A1, A2 = [], []
cdef double val
for val in vals:
A1.append(val**2)
A2.append(val**3)
return A1, A2
%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

You can get even greater speed improvements if you use the direct vectorized operations.

```
%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

This takes advantage of NumPy's extremely fast vectorized operations instead of our loops. We now have a 30x speedup over the original.

`apply`

The above example should clearly show how slow `apply`

can be, but just so its extra clear let's look at the most basic example. Let's square a Series of 10 million numbers with and without apply

```
s = pd.Series(np.random.rand(10000000))
%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Without apply is 50x faster

```
%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Licensed under: CC-BY-SA with attribution

Not affiliated with: Stack Overflow