Share on:

March 18, 2020

1. Introduction
2. Series
2.1 Series Creation
2.2 Series Basic Operations
2.3 Series Basic Indexing
2.4 Series Overwriting Data
2.5 Series Apply
2.6 Series Concatenation
2.7 Series Boolean Indexing
2.8 Series View Vs Copy
2.9 Series Missing Values
2.10 Series Challenges

import numpy as np
import pandas as pd

# Accessing Series Elements

Suppose we have the following series

x = pd.Series([5, 10, 15, 20, 25])
print(x)
## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## dtype: int64

If you wanted to access the ith element of this series, you might be inclined to use square-bracket indexing notation just like accessing elements from a python list or a NumPy array. And this would work.. x[0] returns the 1st element, x[1] returns the 2nd element and so on. But looks can be deceiving and what’s happening here is probably not what you think. x[0] actually returns the element or elements of the series with index label 0. In this case, that element happens to be the 1st element in the series, but if we shake up the index, the difference becomes apparent.

x.index = np.array([3,1,4,0,2])
x[0]
## 20

So, if you want to access the ith element of a series, you should use the .iloc method. For example

x.iloc[0]
## 5
x.iloc[1]
## 10

With iloc, you can also use negative indexing

x.iloc[-2]
## 20

slicing

x.iloc[1:4:2]
## 1    10
## 0    20
## dtype: int64

and pass in a list, array, or series of indices to access multiple elements at once

x.iloc[[0, 2, 3]]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[np.array([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64
x.iloc[pd.Series([0, 2, 3])]
## 3     5
## 4    15
## 0    20
## dtype: int64

# What The Heck Is The Index?

Let’s take a step back and talk about the index. Every series has an index and its purpose is to provide a label for each element in the series. As we’ve seen, when you make a series from scratch, it automatically gets an index of sequential values starting from 0. For example, here we make a series to represent the test grades of 5 students.

grades = pd.Series([82, 94, 77, 89, 91, 54])
## 0    82
## 1    94
## 2    77
## 3    89
## 4    91
## 5    54
## dtype: int64

We can change the index pretty easily, just by setting it equal to another array, list, or series of values with the proper length. The index values don’t even need to be integers. In fact, they’re often represented as strings since strings are more descriptive. For example,

grades.index = ['john', 'ray', 'sue', 'ben', 'kellie', 'ed']
## john      82
## ray       94
## sue       77
## ben       89
## kellie    91
## ed        54
## dtype: int64

Now, if we wanted to know what grade Sue got, we can access her grade value by index using

grades['sue']
## 77

This convenience of being able to access values by some descriptive ID is a really powerful feature. But, that’s not the only advantage. Remember how we discussed that, when you add two Series together, values with the same index get added to each other? This means, we could have a second array, grades2, with each student’s grade on another test, and then caclulate an average score per student, even if the students are in a different order in each series.

grades2 = pd.Series(
data=[91, 90, 79, 83, 98, 90],
index=['ray', 'ed', 'ben', 'kellie', 'sue', 'john']
)
## ben       84.0
## ed        72.0
## john      86.0
## kellie    87.0
## ray       92.5
## sue       87.5
## dtype: float64

Now, remember when I said x[0] returns the value with index label 0, not the value at position index 0? That’s not entirely true… If you test it out on our grades Series, grades[0] actually does give us the value at position index 0.

grades[0]
## 82

The reason this works is because, in this case, our index datatype is string, so when we ask for grades[0], pandas assumes we’re searching by position, not label, but in the earlier example our index datatype was int so pandas assumed we were searching for the label 0. This can be really confusing and lead to all sorts of bugs, so it’s highly recommended to avoid this square bracket notation at all costs. Instead, be explicit and use .iloc for positional indexing and .loc for label indexing.

grades.loc['john']
## 82

Just like positional indexing, you can use slicing to access a range of elements by labelled index (which is really cool).

grades.loc['sue':'kellie']  # 77,89,91
## sue       77
## ben       89
## kellie    91
## dtype: int64

Just note that this type of slicing includes the right boundary, unlike positional slicing.

You can also use a list, array, or series of labels to access multiple elements.

grades.loc[['ray', 'john', 'ben']]
## ray     94
## john    82
## ben     89
## dtype: int64
## ray     94
## john    82
## ben     89
## dtype: int64
## ray     94
## john    82
## ben     89
## dtype: int64

# Index Types And Performance

Before we move on, we need to address a few more things about the index. You may have noticed that when you make a series from scratch, pandas automatically gives you something called a range index. For example

# Make a series of 10M random normal values
x = pd.Series(np.random.normal(size = 10**7))
x.index
## RangeIndex(start=0, stop=10000000, step=1)

If we make a second series, y, but this time we specify the index as an integer array from 0 to 999999…

# # Make series with int index
y = pd.Series(np.random.normal(size = 10**7), index=np.arange(10**7))
y.index
## Int64Index([      0,       1,       2,       3,       4,       5,       6,
##                   7,       8,       9,
##             ...
##             9999990, 9999991, 9999992, 9999993, 9999994, 9999995, 9999996,
##             9999997, 9999998, 9999999],
##            dtype='int64', length=10000000)

From a high level user’s perspective, you can pretty much ignore this subtle difference. On the surface, these Series behave the same way. But, internally these things are significantly different.

Perhaps the most obvious difference is that that Int64Index is much more memory costly than the range index since it literally stores all 1 million index values whereas the range index basically stores 3 values: start, stop, and step size. You can see this pretty clearly using sys.getsizeof() and noting that y is about twice the size of x in memory

# size (in memory) of each series
sys.getsizeof(x)  # ~8M bytes
## 80000160
sys.getsizeof(y)  # ~16M bytes
## 160000032

Less obvious is the fact that range index actually provides a performance boost over Int64Index. When you ask for element 342, RangeIndex knows exactly where to go to fetch that data just by using start and stepsize, but for Int64Index it’s not that simple since there’s no guarantee that the index values jump by a fixed size, or that they’re in order. We’ll cover this more later when we start joining dataframes to each other.

Now, one advantage int or string indexes have over range index is that they allow for duplicate index values. For example, you can make a series with repeated values like this.

foo = pd.Series([2, 3, 5, 7, 11], index = [0, 0, 1, 1, 2])

And then when you do things like ask for the element with label 0, you’ll get back every element with label 0

foo.loc[0]
## 0    2
## 0    3
## dtype: int64