Python Pandas For Your Grandpa | Section 2.5 | Series Apply
2.1 Series Creation
2.2 Series Basic Operations
2.3 Series Basic Indexing
2.4 Series Overwriting Data
2.5 Series Apply
2.6 Series Concatenation
2.7 Series Boolean Indexing
2.8 Series View Vs Copy
2.9 Series Missing Values
2.10 Series Challenges
import numpy as np import pandas as pd
Suppose you have some cool, complicated function like this one, which takes in a scalar value,
x, subtracts 1 if it’s less than 1.5, and adds 1 otherwise…
def my_func(x): return x - 1 if x < 1.5 else x + 1
You want to apply that function to each element of a series like this.
foo = pd.Series([1.3, 1.9, 1.2, 1.0, 1.7, 1.3])
Lucky for you, Series has an
apply() method that lets you do exactly this. In this case, you’d call
foo.apply(), passing the function callabale,
my_func. The output of this is a new Series with the results of
my_func applied to each element of the original Series.
foo.apply(my_func) ## 0 0.3 ## 1 2.9 ## 2 0.2 ## 3 0.0 ## 4 2.7 ## 5 0.3 ## dtype: float64
Now suppose you generalize your function, giving it some parameters like this.
def my_func_with_params(x, s=1.5, a=1): return x - a if x < s else x + a
How do you apply this function to each element of the series, also specifying the parameters you want to use? In this case you just call
.apply(), passing in named arguments to feed your function.
foo.apply(my_func_with_params, s=1.1, a=10) ## 0 11.3 ## 1 11.9 ## 2 11.2 ## 3 -9.0 ## 4 11.7 ## 5 11.3 ## dtype: float64
apply() method is great, because it’s easy to use and it generalizes well, but sometimes it’s slow. If we apply
my_func() from the first example to a series with 10M values, it takes over 2 seconds to execute on my laptop.
import timeit # Create a series of 10M values bigfoo = pd.Series(np.random.uniform(low=1, high=2, size=10000000)) # Apply my_func() to bigfoo timeit.timeit(lambda: bigfoo.apply(my_func), number = 1) ## 2.11 seconds
This is one of those cases where the function is simple enough that it’d be better to build it using pure NumPy. For example, the function below does the same thing as
my_func() and takes about half the time to execute.
def my_numpy_func(x): a = x.to_numpy() return np.where(a < 1.5, a - 1, a + 1) timeit.timeit(lambda: my_numpy_func(bigfoo), number = 10) ## 1.05 seconds
The NumPy solution is faster because it’s vectorized; Without going into too much detail, it basically outsources the entire computation to C which is fast, whereas the
apply() solution spends a lot of time processing in Python, which is slow.