Python NumPy For Your Grandma | Section 4.1 | where()
- NumPy Arrays
2.1 What’s A NumPy Array
2.2 Creating NumPy Arrays
2.3 Indexing And Modifying 1-D Arrays
2.4 Indexing And Modifying Multidimensional Arrays
2.5 Basic Math
- Intermediate Array Stuff
3.4 boolean indexing
- Common Operations
4.2 Math Funcs
4.3 all and any
This video explains vectorization by introducing NumPy’s where() function, which executes “if-then-else” logic at the C level.
import numpy as np # make two 1d arrays, each with length 5 foo = np.array([1, 2, 3, 4, 5]) bar = np.array([0, 1, 0, 0, 1]) # create a third array called baz such that, where bar is 0, you double the corresponding value # of foo, otherwise, you take half the corresponding value of foo. You might be inclined to do this with a for loop baz = np.zeros(foo.shape) for i in range(foo.shape): if bar[i] == 0: baz[i] = 2 * foo[i] else: baz[i] = foo[i] / 2 # Use where() (vectorized solution) baz = np.where((bar == 0), foo * 2.0, foo / 2.0)
Suppose you have two 1d arrays, foo and bar, each of length 5. You want to create a third array called baz such that, where bar is 0, you double the corresponding value of foo, otherwise, you take half the corresponding value of foo. You might be inclined to do this with a for loop like this one.
It turns out that if foo is large, with about a million or more elements, you’ll start to notice some latency in the time this takes to execute. This is because, for every iteration of the loop, python has to call some lower level function in C and then C has to give the result back to Python. This back and fourth communication between Python and C is the main bottleneck in our implementation.
Numpy is designed to address this problem. Instead of going back and fourth between C and Python, numpy basically says, “hey C, here’s my array and here’s what I want to do with it. Now carry out the entire operation and come back to me with the result”. This is called vectorization, where C operates on a batch, or vector of data before giving the result back to python. In general, if you’re writing a for loop to process some numpy array, you’re doing it wrong.
Going back to our example, the vectorized way to solve this problem is to use numpy’s where() function. where takes three() main parameters:
- a boolean array
- values to use when the boolean array is true
- values to use when the boolean array is false
So, we can solve our original problem using np.where(bar == 0, foo * 2, foo / 2)
If you test this out on big arrays you’ll notice major runtime improvements compared to using a python for loop