Python NumPy For Your Grandma - 2.1 NumPy Array Motivation

The primary data structure in NumPy is the N-dimensional array, so that’s gonna be the focus of this course. But before we start using arrays, let’s motivate their existence.

Suppose you have a lot of data, like the price of a stock measured every second for a year. That’s about 32 million values. To represent this data in native Python, the most obvious data structure we could use is a list. Let’s give that a go, using arbitrary made up stock prices.

prices = []
N = 60*60*24*365
for i in range(N):
    prices.append(100 + i/100)

## [100.0, 100.01, 100.02, 100.03, 100.04]

Now let’s calculate the average stock price.

avg = 0.0
for p in prices:
    avg += p/len(prices)
## 157779.995

Hopefully you noticed that populating our list of prices took a few seconds and so did calculating the average price. It wasn’t agonizingly slow, but it would be if we had to do the same thing for hundreds of stocks. Now let’s see how you could do this same sort of thing in NumPy.

import numpy as np

prices = 100 + np.arange(N)/100
avg = np.mean(prices)
## 157779.995

Pretty easy and fast, right? The takeaway here is that NumPy arrays are fast, and that’s the main reason they exist. All the convenient math functions, random number generators, and other things provided by the NumPy package are great, but secondary to the main benefit - the speed of the NumPy array data structure.

So, why are arrays faster than lists, and if that’s true why do lists even exist? First, you need to understand that NumPy arrays are intended to store objects with the same size and type. That’s because arrays store data in contiguous, fixed sized memory blocks.

For example, an array of 32-bit integers like [3,0,1] would internally be stored in binary like this.

# [00000000000000000000000000000011|00000000000000000000000000000000|00000000000000000000000000000001]

If you wanted to access the 3rd element in the array, you can essentially say “Hey computer, give me the 3rd element of this array!” and then your computer starts at the beginning of the array and knows exactly how many bits to jump across to get to the 3rd element - in this case, 64 bits. Knowing exactly how many bits to jump across to get to some requested element is what makes arrays fast for data access.

Unlike integers, strings are objects that vary in size. For example, consider the strings, ['hello', 'i', 'am', 'a', 'banana'].

The string “hello” is 40 bits, “i” is 8 bits, “am” is 16 bits, “a” is 8 bits, and “banana” is 48 bits. So, unless we agree on a maximum possible string size, we can’t easily represent these strings with contiguous, fixed-sized memory blocks because the strings are different sizes. And if you just stored them in contiguous memory blocks, as soon as you start asking your computer to fetch specific elements, you’ll run into problems because your computer doesn’t know how far to jump from the beginning of the array to get to some requested element. Not to mention a host of other problems, like what happens if we try to swap two elements with different sizes?

To get around this complexity, we could store the strings in some random location in memory and then build an array of fixed size pointers to the strings. A pointer is just an integer that identifies some location in your computer’s memory, i.e. a memory address. Now if we want to fetch, say, the third element, assuming our pointers are 32-bit integers, just like before we can quickly “jump” across 64 bits to get to the third pointer, and then make one extra step to retrieve our desired string element. The benefit to this architecture is that pointers can point to anything, so we could actually store a combination of strings, integers, and other objects all together. The downside is that you have to make that extra step each time you want to fetch an element. Plus the lack of a guarantee that all your data is the same type can cause some bugs, for example if you try to sum the elements and one of them turns out to be a string. This architecture I’m describing is one implementation of a list.

To recap, NumPy arrays are faster than lists, and that’s why we like them. But that speed comes at the cost of being flexible. Specifically, if we have a NumPy array, each element of the array should be of the same size and type. That means we can have an array of ints, an array of floats, or an array of booleans, but not an array of ints, floats, and booleans together.

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Basic Array Stuff
    2.1 NumPy Array Motivation
    2.2 NumPy Array Basics
    2.3 Creating NumPy Arrays
    2.4 Indexing 1-D Arrays
    2.5 Indexing Multidimensional Arrays
    2.6 Basic Math On Arrays
    2.7 Challenge: High School Reunion
    2.8 Challenge: Gold Miner
    2.9 Challenge: Chic-fil-A
  3. Intermediate Array Stuff
    3.1 Broadcasting
    3.2 newaxis
    3.3 reshape()
    3.4 Boolean Indexing
    3.5 nan
    3.6 infinity
    3.7 random
    3.8 Challenge: Love Distance
    3.9 Challenge: Professor Prick
    3.10 Challenge: Psycho Parent
  4. Common Operations
    4.1 where()
    4.2 Math Functions
    4.3 all() and any()
    4.4 concatenate()
    4.5 Stacking
    4.6 Sorting
    4.7 unique()
    4.8 Challenge: Movie Ratings
    4.9 Challenge: Big Fish
    4.10 Challenge: Taco Truck
  5. Advanced Array Stuff
    5.1 Advanced Array Indexing
    5.2 View vs Copy
    5.3 Challenge: Population Verification
    5.4 Challenge: Prime Locations
    5.5 Challenge: The Game of Doors
    5.6 Challenge: Peanut Butter
  6. Final Boss
    6.1 as_strided()
    6.2 einsum()
    6.3 Challenge: One-Hot-Encoding
    6.4 Challenge: Cumulative Rainfall
    6.5 Challenge: Table Tennis
    6.6 Challenge: Where’s Waldo
    6.7 Challenge: Outer Product