Contents

Python Pandas For Your Grandpa - 2.7 Series apply()

In this section, we’ll see how you can use the apply() method of a Series to apply a function to each element in the Series, and then we’ll see why apply() is usually inferior to a vectorized solution.

Suppose you have a Series called foo with 6 elements like this

import numpy as np
import pandas as pd

foo = pd.Series([3, 9, 2, 2, 8, 7])
print(foo)
## 0    3
## 1    9
## 2    2
## 3    2
## 4    8
## 5    7
## dtype: int64

And you want to apply some complicated function like this one to each element in the Series.

def my_func(x):
    return x - 1 if x % 2 == 0 else x + 3

Here, my_func() takes in a scalar, x, and returns x-1 if x is even, otherwise it returns x + 3. Okay.. maybe this one-liner isn’t that complicated, but for the sake of argument, pretend this function has hundreds of lines of cryptic code. In cases like this, you can use the apply() method of the Series object to apply my_func() to each element of foo.

In this case, you’d say foo.apply(), passing in the function callabale, my_func to get back a new Series of values.

foo.apply(my_func)
## 0     6
## 1    12
## 2     1
## 3     1
## 4     7
## 5    10
## dtype: int64

We could even generalized my_func, giving it some parameters, a and b like this

def my_func(x, a=1, b=3):
    return x - a if x % 2 == 0 else x + b

And this time, we can call foo.apply(my_func) and pass in trailing parameters for a and b.

foo.apply(my_func, a=2, b=4)
## 0     7
## 1    13
## 2     0
## 3     0
## 4     6
## 5    11
## dtype: int64

the apply() method is great, because it’s easy to use and it generalizes well, but it’s slow because it’s not vectorized. If we apply my_func to a Series with 10M values, it takes about 3 seconds to execute on Google Colab.

# Create a Series of 10M values
bigfoo = pd.Series(np.random.randint(low=0, high=9, size=10**7))
# apply() based method
%%timeit
y1 = bigfoo.apply(my_func) # 3 seconds

By contrast, here’s a NumPy based solution that achieves the same thing in about 100 milliseconds, roughly 30 times faster.

# vectorized NumPy method
%%timeit
a = bigfoo.to_numpy()
y2 = pd.Series(np.where(a % 2 == 0, a - 1, a + 3))  # 100 milliseconds

With that said, the apply() method is designed for convenience and code clarity, not speed. Keep in mind that sometimes my_func might actually be a function imported from another package, or maybe it makes http requests to some API, and so refactoring it into a vectorized solution just isn’t feasible.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts

Additional Content

  1. Python NumPy For Your Grandma
  2. Neural Networks For Your Dog
  3. Introduction To Google Colab