# Python Pandas For Your Grandpa - 2.2 Series Basic Indexing

Contents

In this video, we’ll see how to use the index of a Series to access and modify its elements in meaningful ways.

Suppose we have the following Series, `x`

``````import numpy as np
import pandas as pd

x = pd.Series([5, 10, 15, 20, 25])
print(x)
## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## dtype: int64
``````

If you wanted to access the ith element of the Series, you might be inclined to use square-bracket indexing notation just like accessing elements from a python list or a NumPy array. And this would work. `x` returns the 1st element, `x` returns the 2nd element and so on

``````x
## 5
x
## 10
``````

But looks can be deceiving and what’s happening here probably isn’t what you think. `x` actually returns the element or elements of the Series with index label 0. In this case, that element happens to be the 1st element in the Series, but if we mix up the index, which we can do by setting

``````x.index = [3,1,4,0,2]
print(x)
## 3     5
## 1    10
## 4    15
## 0    20
## 2    25
## dtype: int64
``````

Now `x` returns something totally different.

``````x
## 20
``````

So, if you want to access the ith value of a Series, you should use the `Series.iloc` property. For example `x.iloc` returns the first element by position, and `x.iloc` returns the 2nd element by position.

``````x.iloc
## 5
x.iloc
## 10
``````

Throughout this course, I’ll explicitly refer to the Series index values as index labels and Series positions as index positions.

With `iloc`, you can also use negative indexing. So `x.iloc[-1]` returns the last element, and `x.iloc[-3]` returns the third to last element.

``````x.iloc[-1]
## 25
x.iloc[-3]
## 15
``````

Another trick you can do is use slicing to return a subSeries. For example, `x.iloc[1:4:2]` returns this 2-row subSeries.

``````x.iloc[1:4:2]
## 1    10
## 0    20
## dtype: int64
``````

In pseudocode you could describe this as picking out “the rows from position 1 (inclusive), up to position 4 (exclusive), stepping by 2”. Notice here we get back a Series object whereas in the previous examples we got back scalars.

And lastly, you could pass in a list, array, or Series of integers to pick out specific rows of `x`.

``````x.iloc[[0, 2, 3]]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[np.array([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[pd.Series([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64
``````

Let’s take a step back and talk about the index. Every Series has an index and its purpose is to provide a label for each element in the Series. As we’ve seen, when you make a Series from scratch, it automatically gets an index of sequential values starting from 0.

For example, here we make a Series to represent the test grades of 5 students, and you can see how the index 0 1 2 3 4 5 automatically gets created.

``````grades = pd.Series([82, 94, 77, 89, 91, 54])
## 0    82
## 1    94
## 2    77
## 3    89
## 4    91
## 5    54
## dtype: int64
``````

We can change the index pretty easily, just by setting it equal to another array, list, or Series of values with the proper length. The index values don’t even need to be integers, and in fact, they’re often represented as strings since strings are more descriptive.

For example, here we might say

``````grades.index = ['homer', 'maggie', 'grandpa', 'bart', 'lisa', 'marge']
## homer      82
## maggie     94
## grandpa    77
## bart       89
## lisa       91
## marge      54
## dtype: int64
``````

Now if wanted to know what grade bart got, we could do `grades['bart']` and we’d get back 89.

``````grades['bart']
## 89
``````

Now, remember earlier when I said `x` returns the value with index label 0, not the value at position 0? That’s not entirely true. If you test it out on our `grades` Series, `grades` actually does give us the value at position 0.

``````grades
## 82
``````

So, what gives? The reason this works is because, in this case, our index consists of strings but we requested the element with index 0, an integer. Pandas basically tries to be smart, and figures, “Hey, this guy passed in an integer but the index consists of strings, therefore he must be requesting the element at postition 0”. In the earlier example, our index datatype was int so when we requested `x`, Pandas assumed we were searching for the value with label 0.

While this behavior can be convenient and some people like it, I think it’s a little bit confusing and obfuscates what’s actually going on, so I highly recommended avoiding this square bracket notation. I think it leads more problems than benefits. Instead, be explicit and use `.iloc` for positional indexing and `.loc` for label indexing.

So for example, if we want to pick out bart’s grade, we could do `grades['bart']` but it’s better if we explicitly do `grades.loc['bart']`. And if we wanted to get the 2nd value in the Series, we could do `grades` but it’s better if we explicitly do `grades.iloc`.

And just like positional indexing, we can use slicing to access a range of elements by labeled index (which is really cool). For example, we can say

``````grades.loc['homer':'grandpa']
## homer      82
## maggie     94
## grandpa    77
## dtype: int64
``````

to get back every person’s grade between homer and grandpa, including both endpoints. Note that this type of slicing is slightly different from positional slicing which excludes the right boundary.

For example, if we do `grades.iloc[0:2]`, we get back two rows, not three.

``````grades.iloc[0:2]
## homer     82
## maggie    94
## dtype: int64
``````

And just like with `iloc`, we can pass a list, array, or Series of labels into `loc` to retrieve multiple rows.

``````grades.loc[['homer', 'grandpa', 'bart']]
## homer      82
## grandpa    77
## bart       89
## dtype: int64
``````

Before we move on, we need to address a few more things about the index. You may have noticed that when you make a Series from scratch, Pandas automatically gives you something called a Range index. For example, if we make Series of 1M random normal values like this, and then we print `x`’s index, you can see it’s a RangeIndex with start 0, stop 10M, and step 1.

``````x = pd.Series(np.random.normal(size = 10**6))
print(x.index)
## RangeIndex(start=0, stop=1000000, step=1)
``````

If we make a second Series, `y`, but this time we specify the index as an integer array from 0 to 1M, if we print `y`’s index you can see it’s something called an Int64Index.

``````y = pd.Series(np.random.normal(size = 10**6), index=np.arange(10**6))
print(y.index)
## Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
##                  8,      9,
##             ...
##             999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997,
##             999998, 999999],
##            dtype='int64', length=1000000)
``````

From a high level user’s perspective, you can pretty much ignore this subtle difference. On the surface, these Series behave the same way. But, internally these things are significantly different.

Perhaps the most obvious difference is that the Int64Index consumes more memory than the RangeIndex since it literally stores all 1 million index values whereas the range index basically stores 3 values: start, stop, and step.

You can see this pretty clearly using the `sys.getsizeof()` function, noting that `y` is about twice the size of `x`.

``````import sys
sys.getsizeof(x)
## 8000144
sys.getsizeof(y)
## 16000016
``````

Less obvious is the fact that RangeIndex actually provides a performance boost over Int64Index. When you ask for element 342, RangeIndex knows exactly where to go to fetch that data just by using start and stepsize, but for Int64Index it’s not that simple since there’s no guarantee that the index values jump by a fixed size, or that they’re in order, or that there’s no duplicates.

One advantage Int64Index has over RangeIndex is that it allows for duplicate index values. For example, you can make a Series, `alpha` like this, which has some repeated index values.

``````alpha = pd.Series([2, 3, 5, 7, 11], index = [0, 0, 1, 1, 2])
print(alpha)
## 0     2
## 0     3
## 1     5
## 1     7
## 2    11
## dtype: int64
``````

And then when you do things like ask for the element with label 0, you’ll get back every element with label 0.

``````alpha.loc
## 0    2
## 0    3
## dtype: int64
``````

Now that we know how to access data from a Series using an index, overwriting data is pretty straight-forward

For example, if you have the Series `foo` with values `[10, 20, 30, 40, 50]` with index labels `['a', 'b', 'c', 'd', 'e']`.

``````foo = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(foo)
## a    10
## b    20
## c    30
## d    40
## e    50
## dtype: int64
``````

and you want to change the 2nd element to 200, you can do

``````foo.iloc = 200
``````

If you want to set the 1st, 2nd and 3rd elements to 99 you can do

``````foo.iloc[[0, 1, 2]] = 99
``````

or use slicing

``````foo.iloc[:3] = 999
``````

This is like saying, “select every element from the start of the Series up to but excluding position 3” and then update those values to 999.

And obviously, you can do the same exact operations using the index labels with `foo.loc`.

``````foo.iloc = 200        # same as: foo.loc['b'] = 200
foo.iloc[[0, 1, 2]] = 99 # same as: foo.loc[['a', 'b', 'c']] = 99
foo.iloc[:3] = 999       # same as: foo.loc['a':'c'] = 99
``````

What if you wanted to overwrite the entire Series with a new set of values like the ones in this array?

``````new_vals = np.array([5, 10, 15, 20, 25])
``````

Your first instinct might be to overwrite the entire `foo` variable like `foo = pd.Series(new_vals)`, but then you’d lose `foo`’s index. Instead, use slicing to select and overwrite all of `foo`’s values without overwriting its index.

``````foo.iloc[:] = new_vals
print(foo)
## a     5
## b    10
## c    15
## d    20
## e    25
## dtype: int64
``````

Now suppose you have these two Series, `x` and `y`, each with four values, whose indices are different but share a few common values. Namely, 0, 2, and 3.

``````x = pd.Series([10, 20, 30, 40])
print(x)
## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64
``````
``````y = pd.Series([1, 11, 111, 1111], index=[7,3,2,0])
print(y)
## 7       1
## 3      11
## 2     111
## 0    1111
## dtype: int64
``````

What do you think the result of `x.loc[[0, 1]] = y` will be? This one’s a bit strange to get used to, but when you see the result, it’s pretty clear what’s happening.

``````x.loc[[0, 1]] = y
print(x)
## 0    1111.0
## 1       NaN
## 2      30.0
## 3      40.0
## dtype: float64
``````

Pandas starts by searching `x` for the values with index labels 0 and 1. Then it looks for matching labels in `y` to use to overwrite `x`.

Since `x`’s label 1 doesn’t doesn’t match any elements in `y`, Pandas assigns it the value `NaN.` And since `NaN` only exists as a floating point value in NumPy, Pandas casts the entire Series from ints to floats. We’ll talk more about `NaN` in a future video, but basically it’s a special value to represent missing or invalid data.

Also note that we could do the same exact thing using slicing. For example, if we do `x.iloc[:2] = y`, Pandas selects the first two values from `x`, and then searches `y` for replacement values with matching index labels.

If we try to do this using a NumPy array on the right hand side, we’ll get an error because when the right hand side is a NumPy array, pandas tries to assign each element of the right-hand-side to the left-hand-side on an element-by-element basis, and in this case we’re trying to replace 2 values with 4 values.

``````x.loc[[0, 1]] = y.to_numpy()  # ERROR
``````

If the NumPy array on the right hand side is the same length as the Series subset on the left hand side, this would work, but note that the array elements gets assigned to the Series subset by position, not index label.

``````x.loc[[0, 1]] = y.to_numpy()[:2]
print(x)
## 0     1.0
## 1    11.0
## 2    30.0
## 3    40.0
## dtype: float64
``````