Python Pandas For Your Grandpa - 2.2 Series Basic Indexing

Contents

In this video, we’ll see how to use the index of a Series to access and modify its elements in meaningful ways.

Suppose we have the following Series, x

import numpy as np
import pandas as pd

x = pd.Series([5, 10, 15, 20, 25])
print(x)
## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## dtype: int64

If you wanted to access the ith element of the Series, you might be inclined to use square-bracket indexing notation just like accessing elements from a python list or a NumPy array. And this would work. x returns the 1st element, x returns the 2nd element and so on

x
## 5
x
## 10

But looks can be deceiving and what’s happening here probably isn’t what you think. x actually returns the element or elements of the Series with index label 0. In this case, that element happens to be the 1st element in the Series, but if we mix up the index, which we can do by setting

x.index = [3,1,4,0,2]
print(x)
## 3     5
## 1    10
## 4    15
## 0    20
## 2    25
## dtype: int64

Now x returns something totally different.

x
## 20

So, if you want to access the ith value of a Series, you should use the Series.iloc property. For example x.iloc returns the first element by position, and x.iloc returns the 2nd element by position.

x.iloc
## 5
x.iloc
## 10

Throughout this course, I’ll explicitly refer to the Series index values as index labels and Series positions as index positions.

With iloc, you can also use negative indexing. So x.iloc[-1] returns the last element, and x.iloc[-3] returns the third to last element.

x.iloc[-1]
## 25
x.iloc[-3]
## 15

Another trick you can do is use slicing to return a subSeries. For example, x.iloc[1:4:2] returns this 2-row subSeries.

x.iloc[1:4:2]
## 1    10
## 0    20
## dtype: int64

In pseudocode you could describe this as picking out “the rows from position 1 (inclusive), up to position 4 (exclusive), stepping by 2”. Notice here we get back a Series object whereas in the previous examples we got back scalars.

And lastly, you could pass in a list, array, or Series of integers to pick out specific rows of x.

x.iloc[[0, 2, 3]]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[np.array([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[pd.Series([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64

Let’s take a step back and talk about the index. Every Series has an index and its purpose is to provide a label for each element in the Series. As we’ve seen, when you make a Series from scratch, it automatically gets an index of sequential values starting from 0.

For example, here we make a Series to represent the test grades of 5 students, and you can see how the index 0 1 2 3 4 5 automatically gets created.

grades = pd.Series([82, 94, 77, 89, 91, 54])
## 0    82
## 1    94
## 2    77
## 3    89
## 4    91
## 5    54
## dtype: int64

We can change the index pretty easily, just by setting it equal to another array, list, or Series of values with the proper length. The index values don’t even need to be integers, and in fact, they’re often represented as strings since strings are more descriptive.

For example, here we might say

grades.index = ['homer', 'maggie', 'grandpa', 'bart', 'lisa', 'marge']
## homer      82
## maggie     94
## grandpa    77
## bart       89
## lisa       91
## marge      54
## dtype: int64

Now if wanted to know what grade bart got, we could do grades['bart'] and we’d get back 89.

## 89

Now, remember earlier when I said x returns the value with index label 0, not the value at position 0? That’s not entirely true. If you test it out on our grades Series, grades actually does give us the value at position 0.

## 82

So, what gives? The reason this works is because, in this case, our index consists of strings but we requested the element with index 0, an integer. Pandas basically tries to be smart, and figures, “Hey, this guy passed in an integer but the index consists of strings, therefore he must be requesting the element at postition 0”. In the earlier example, our index datatype was int so when we requested x, Pandas assumed we were searching for the value with label 0.

While this behavior can be convenient and some people like it, I think it’s a little bit confusing and obfuscates what’s actually going on, so I highly recommended avoiding this square bracket notation. I think it leads more problems than benefits. Instead, be explicit and use .iloc for positional indexing and .loc for label indexing.

So for example, if we want to pick out bart’s grade, we could do grades['bart'] but it’s better if we explicitly do grades.loc['bart']. And if we wanted to get the 2nd value in the Series, we could do grades but it’s better if we explicitly do grades.iloc.

And just like positional indexing, we can use slicing to access a range of elements by labeled index (which is really cool). For example, we can say

## homer      82
## maggie     94
## grandpa    77
## dtype: int64

to get back every person’s grade between homer and grandpa, including both endpoints. Note that this type of slicing is slightly different from positional slicing which excludes the right boundary.

For example, if we do grades.iloc[0:2], we get back two rows, not three.

## homer     82
## maggie    94
## dtype: int64

And just like with iloc, we can pass a list, array, or Series of labels into loc to retrieve multiple rows.

## homer      82
## grandpa    77
## bart       89
## dtype: int64

Before we move on, we need to address a few more things about the index. You may have noticed that when you make a Series from scratch, Pandas automatically gives you something called a Range index. For example, if we make Series of 1M random normal values like this, and then we print x’s index, you can see it’s a RangeIndex with start 0, stop 10M, and step 1.

x = pd.Series(np.random.normal(size = 10**6))
print(x.index)
## RangeIndex(start=0, stop=1000000, step=1)

If we make a second Series, y, but this time we specify the index as an integer array from 0 to 1M, if we print y’s index you can see it’s something called an Int64Index.

y = pd.Series(np.random.normal(size = 10**6), index=np.arange(10**6))
print(y.index)
## Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
##                  8,      9,
##             ...
##             999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997,
##             999998, 999999],
##            dtype='int64', length=1000000)

From a high level user’s perspective, you can pretty much ignore this subtle difference. On the surface, these Series behave the same way. But, internally these things are significantly different.

Perhaps the most obvious difference is that the Int64Index consumes more memory than the RangeIndex since it literally stores all 1 million index values whereas the range index basically stores 3 values: start, stop, and step.

You can see this pretty clearly using the sys.getsizeof() function, noting that y is about twice the size of x.

import sys
sys.getsizeof(x)
## 8000144
sys.getsizeof(y)
## 16000016

Less obvious is the fact that RangeIndex actually provides a performance boost over Int64Index. When you ask for element 342, RangeIndex knows exactly where to go to fetch that data just by using start and stepsize, but for Int64Index it’s not that simple since there’s no guarantee that the index values jump by a fixed size, or that they’re in order, or that there’s no duplicates.

One advantage Int64Index has over RangeIndex is that it allows for duplicate index values. For example, you can make a Series, alpha like this, which has some repeated index values.

alpha = pd.Series([2, 3, 5, 7, 11], index = [0, 0, 1, 1, 2])
print(alpha)
## 0     2
## 0     3
## 1     5
## 1     7
## 2    11
## dtype: int64

And then when you do things like ask for the element with label 0, you’ll get back every element with label 0.

alpha.loc
## 0    2
## 0    3
## dtype: int64

Now that we know how to access data from a Series using an index, overwriting data is pretty straight-forward

For example, if you have the Series foo with values [10, 20, 30, 40, 50] with index labels ['a', 'b', 'c', 'd', 'e'].

foo = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(foo)
## a    10
## b    20
## c    30
## d    40
## e    50
## dtype: int64

and you want to change the 2nd element to 200, you can do

foo.iloc = 200

If you want to set the 1st, 2nd and 3rd elements to 99 you can do

foo.iloc[[0, 1, 2]] = 99

or use slicing

foo.iloc[:3] = 999

This is like saying, “select every element from the start of the Series up to but excluding position 3” and then update those values to 999.

And obviously, you can do the same exact operations using the index labels with foo.loc.

foo.iloc = 200        # same as: foo.loc['b'] = 200
foo.iloc[[0, 1, 2]] = 99 # same as: foo.loc[['a', 'b', 'c']] = 99
foo.iloc[:3] = 999       # same as: foo.loc['a':'c'] = 99

What if you wanted to overwrite the entire Series with a new set of values like the ones in this array?

new_vals = np.array([5, 10, 15, 20, 25])

Your first instinct might be to overwrite the entire foo variable like foo = pd.Series(new_vals), but then you’d lose foo’s index. Instead, use slicing to select and overwrite all of foo’s values without overwriting its index.

foo.iloc[:] = new_vals
print(foo)
## a     5
## b    10
## c    15
## d    20
## e    25
## dtype: int64

Now suppose you have these two Series, x and y, each with four values, whose indices are different but share a few common values. Namely, 0, 2, and 3.

x = pd.Series([10, 20, 30, 40])
print(x)
## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64
y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
print(y)
## 7       1
## 3      11
## 2     111
## 0    1111
## dtype: int64

What do you think the result of x.loc[[0, 1]] = y will be? This one’s a bit strange to get used to, but when you see the result, it’s pretty clear what’s happening.

x.loc[[0, 1]] = y
print(x)
## 0    1111.0
## 1       NaN
## 2      30.0
## 3      40.0
## dtype: float64

Pandas starts by searching x for the values with index labels 0 and 1. Then it looks for matching labels in y to use to overwrite x.

Since x’s label 1 doesn’t doesn’t match any elements in y, Pandas assigns it the value NaN. And since NaN only exists as a floating point value in NumPy, Pandas casts the entire Series from ints to floats. We’ll talk more about NaN in a future video, but basically it’s a special value to represent missing or invalid data.

Also note that we could do the same exact thing using slicing. For example, if we do x.iloc[:2] = y, Pandas selects the first two values from x, and then searches y for replacement values with matching index labels.

If we try to do this using a NumPy array on the right hand side, we’ll get an error because when the right hand side is a NumPy array, pandas tries to assign each element of the right-hand-side to the left-hand-side on an element-by-element basis, and in this case we’re trying to replace 2 values with 4 values.

x.loc[[0, 1]] = y.to_numpy()  # ERROR

If the NumPy array on the right hand side is the same length as the Series subset on the left hand side, this would work, but note that the array elements gets assigned to the Series subset by position, not index label.

x.loc[[0, 1]] = y.to_numpy()[:2]
print(x)
## 0     1.0
## 1    11.0
## 2    30.0
## 3    40.0
## dtype: float64