Python Pandas For Your Grandpa - 2.4 Series Boolean Indexing

Contents

In this section, we’ll see how to use boolean indexing to select values from a Series based on logical conditions. Just like NumPy arrays, you can subset a Pandas Series using a boolean index For example, if you have a Series of integers like this one called `foo`

``````import numpy as np
import pandas as pd

foo = pd.Series([20, 50, 11, 45, 17, 31])
print(foo)
## 0    20
## 1    50
## 2    11
## 3    45
## 4    17
## 5    31
## dtype: int64
``````

if you check `foo < 20`, you’ll get back a corresponding Series of boolean values.

``````foo < 20
## 0    False
## 1    False
## 2     True
## 3    False
## 4     True
## 5    False
## dtype: bool
``````

If you assign that Series to a variable called `mask`, you can use it to subset `foo` picking out values less than 20.

``````mask = foo < 20
## 2    11
## 4    17
## dtype: int64
``````

Or if you wanted to avoid the intermediate step, you can do a one-liner like

``````foo.loc[foo < 20]
## 2    11
## 4    17
## dtype: int64
``````

Now, you might think that the ith value in `foo` gets returned if the ith value in `mask` is True. And you’d kind of be right, but watch what happens if we swap the index labels, 4 and 5 in `foo`, and then we do the same exact boolean subset using `mask`.

``````foo.index = [0, 1, 2, 3, 5, 4]
## 2    11
## 4    31
## dtype: int64
``````

This time, the result includes 31 instead of 17. That’s because `foo.loc[mask]` picks out the elements of `foo` whose index label matches those of `mask` where `mask` has a True value. Usually this is fine, but in some cases it might not be what you want and if you’d rather just include or exclude values of `foo` by corresponding positions of True and False values in mask, just use `mask`’s underlying NumPy array to subset `foo`, like

``````foo.loc[mask.to_numpy()]
## 2    11
## 5    17
## dtype: int64
``````

In this case the third and fifth values of `mask` are True, so we get back the third and fifth values of `foo`.

If you want to combine boolean Series together, you can do that too using an `&` for and and a `|` for or. Note than when you combine two boolean Series, Pandas matches and combines boolean values based on their index.

For example, suppose we have a Series called `ages` with the age of five people,

``````ages = pd.Series(
data = [42, 43, 14, 18, 1],
index = ['peter', 'lois', 'chris', 'meg', 'stewie']
)
print(ages)
## peter     42
## lois      43
## chris     14
## meg       18
## stewie     1
## dtype: int64
``````

and a corresponding Series called `genders` with their gender.

``````genders = pd.Series(
data = ['female', 'female', 'male', 'male', 'male'],
index = ['lois', 'meg', 'chris', 'peter', 'stewie'],
dtype = 'string'
)
print(genders)
## lois      female
## meg       female
## chris       male
## peter       male
## stewie      male
## dtype: string
``````

Even though their indexes are in a different order, we can still answer questions like,

Who’s a male younger than 18?

``````mask = (genders == 'male') & (ages < 18)
## chris     True
## stewie    True
## dtype: bool
``````

In this case, we make a Series to identify whether each person is a male, and a second Series to identify whether each person is younger than 18. Then we combine them with an ampersand - i.e. the elementwise and operator - to identify whether each person is a male and younger than 18. Then if we assign that to a variable called `mask`, we can index it with itself to get the names of males less than 18. In this case the names’ll be in the index.

We can also use the `~` operator to negate a boolean Series. So for example, if we do `~mask`, we can determine “Who’s not a male and less than 18?”. In other words, “Who is a female or is at least 18?”.

``````~mask
## chris     False
## lois       True
## meg        True
## peter      True
## stewie    False
## dtype: bool
``````

When you combine boolean Series, make sure you wrap each condition in parentheses, otherwise the interpreter will read things in the wrong order and you’ll probably get an error. For example if we try to determine people between 18 and 42 like this, we’ll get an error.

``````ages.loc[ages >= 18 & ages <= 42]  # ERROR
``````

The solution here is just to wrap the conditions in parentheses like this.

``````ages.loc[(ages >= 18) & (ages <= 42)]
## peter    42
## meg      18
## dtype: int64
``````