Share on:

Python Pandas For Your Grandpa | Section 2.5 | Series Apply
March 18, 2020

.apply()

Suppose you have some cool, complicated function like this one, which takes in a scalar value, x, subtracts 1 if it’s less than 1.5, and adds 1 otherwise…

def my_func(x):
    return x - 1 if x < 1.5 else x + 1

You want to apply that function to each element of a series like this.

foo = pd.Series([1.3, 1.9, 1.2, 1.0, 1.7, 1.3])

Lucky for you, Series has an apply() method that lets you do exactly this. In this case, you’d call foo.apply(), passing the function callabale, my_func. The output of this is a new Series with the results of my_func applied to each element of the original Series.

foo.apply(my_func)
## 0    0.3
## 1    2.9
## 2    0.2
## 3    0.0
## 4    2.7
## 5    0.3
## dtype: float64

Now suppose you generalize your function, giving it some parameters like this.

def my_func_with_params(x, s=1.5, a=1):
    return x - a if x < s else x + a

How do you apply this function to each element of the series, also specifying the parameters you want to use? In this case you just call .apply(), passing in named arguments to feed your function.

foo.apply(my_func_with_params, s=1.1, a=10)
## 0    11.3
## 1    11.9
## 2    11.2
## 3    -9.0
## 4    11.7
## 5    11.3
## dtype: float64

Performance

The apply() method is great, because it’s easy to use and it generalizes well, but sometimes it’s slow. If we apply my_func() from the first example to a series with 10M values, it takes over 2 seconds to execute on my laptop.

import timeit

# Create a series of 10M values
bigfoo = pd.Series(np.random.uniform(low=1, high=2, size=10000000))

# Apply my_func() to bigfoo
timeit.timeit(lambda: bigfoo.apply(my_func), number = 1)

## 2.11 seconds

This is one of those cases where the function is simple enough that it’d be better to build it using pure NumPy. For example, the function below does the same thing as my_func() and takes about half the time to execute.

def my_numpy_func(x):
    a = x.to_numpy()
    return np.where(a < 1.5, a - 1, a + 1)
    
timeit.timeit(lambda: my_numpy_func(bigfoo), number = 10)

## 1.05 seconds

The NumPy solution is faster because it’s vectorized; Without going into too much detail, it basically outsources the entire computation to C which is fast, whereas the apply() solution spends a lot of time processing in Python, which is slow.


Enjoyed this article? Show your support and buy some GormAnalysis merch.
comments powered by Disqus