Contents

Python Pandas For Your Grandpa - 2.6 Series Vectorization

In this section, we’ll learn about vectorization and why using natively built Series methods is usually better than rolling your own custom methods.

Suppose you have a series x with 1M random uniform values between 1 and 2, and you want to calculate its mean.

import numpy as np
import pandas as pd

x = pd.Series(np.random.uniform(low=1, high=2, size=10**6))

You’re a confident, competent coder, so forget about using somebody else’s code - you’re gonna write your own function to calculate the average. So you come up with something like this.

def average(x):
    avg = 0.0
    for i in range(x.size):
        avg += x.iloc[i]/x.size

    return avg

Now you give it a whirl, only to find out.. it’s slow as hell!

average(x)
## 1.5003341637695278

By contrast, take a look at the built-in Series method, mean().

x.mean()
## 1.500334163769572

It’s blazing fast. So, why is the built-in method so much faster than your custom function? The answer is vectorization.

It’s important to understand that Python loops are slow. With each iteration of a Python loop, you’re essentially giving your computer a new set of instructions to perform some task, so in this case it’s like you’re giving your computer 1M different sets of instructions.

By contrast, the Series mean() method is basically an alias for the NumPy mean() method, which is vectorized. In this case, NumPy hands off the entire array of data to a lower level function in C with a single set of instructions for what to do. C then executes those instructions on the entire array, calculating the mean, before handing the result back to Python.

To recap, Python loops are slow and you shouldn’t use them, except in some rare circumstances. Instead, opt for a vectorized solution using one of the many Pandas and NumPy methods that implement fast algorithms using C.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts