Python Pandas For Your Grandpa - 2.7 Series apply()
In this section, we’ll see how you can use the apply()
method of a Series to apply a function to each element in the Series, and then we’ll see why apply()
is usually inferior to a vectorized solution.
Suppose you have a Series called foo
with 6 elements like this
import numpy as np
import pandas as pd
foo = pd.Series([3, 9, 2, 2, 8, 7])
print(foo)
## 0 3
## 1 9
## 2 2
## 3 2
## 4 8
## 5 7
## dtype: int64
And you want to apply some complicated function like this one to each element in the Series.
def my_func(x):
return x - 1 if x % 2 == 0 else x + 3
Here, my_func()
takes in a scalar, x
, and returns x-1
if x
is even, otherwise it returns x + 3
. Okay.. maybe this one-liner isn’t that complicated, but for the sake of argument, pretend this function has hundreds of lines of cryptic code. In cases like this, you can use the apply()
method of the Series object to apply my_func()
to each element of foo
.
In this case, you’d say foo.apply()
, passing in the function callabale, my_func
to get back a new Series of values.
foo.apply(my_func)
## 0 6
## 1 12
## 2 1
## 3 1
## 4 7
## 5 10
## dtype: int64
We could even generalized my_func
, giving it some parameters, a
and b
like this
def my_func(x, a=1, b=3):
return x - a if x % 2 == 0 else x + b
And this time, we can call foo.apply(my_func)
and pass in trailing parameters for a
and b
.
foo.apply(my_func, a=2, b=4)
## 0 7
## 1 13
## 2 0
## 3 0
## 4 6
## 5 11
## dtype: int64
the apply()
method is great, because it’s easy to use and it generalizes well, but it’s slow because it’s not vectorized. If we apply my_func
to a Series with 10M values, it takes about 3 seconds to execute on Google Colab.
# Create a Series of 10M values
bigfoo = pd.Series(np.random.randint(low=0, high=9, size=10**7))
# apply() based method
%%timeit
y1 = bigfoo.apply(my_func) # 3 seconds
By contrast, here’s a NumPy based solution that achieves the same thing in about 100 milliseconds, roughly 30 times faster.
# vectorized NumPy method
%%timeit
a = bigfoo.to_numpy()
y2 = pd.Series(np.where(a % 2 == 0, a - 1, a + 3)) # 100 milliseconds
With that said, the apply()
method is designed for convenience and code clarity, not speed. Keep in mind that sometimes my_func
might actually be a function imported from another package, or maybe it makes http requests to some API, and so refactoring it into a vectorized solution just isn’t feasible.
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts