Python Pandas For Your Grandpa - 2.4 Series Boolean Indexing
In this section, we’ll see how to use boolean indexing to select values from a Series based on logical conditions. Just like NumPy arrays, you can subset a Pandas Series using a boolean index For example, if you have a Series of integers like this one called foo
import numpy as np
import pandas as pd
foo = pd.Series([20, 50, 11, 45, 17, 31])
print(foo)
## 0 20
## 1 50
## 2 11
## 3 45
## 4 17
## 5 31
## dtype: int64
if you check foo < 20
, you’ll get back a corresponding Series of boolean values.
foo < 20
## 0 False
## 1 False
## 2 True
## 3 False
## 4 True
## 5 False
## dtype: bool
If you assign that Series to a variable called mask
, you can use it to subset foo
picking out values less than 20.
mask = foo < 20
foo.loc[mask]
## 2 11
## 4 17
## dtype: int64
Or if you wanted to avoid the intermediate step, you can do a one-liner like
foo.loc[foo < 20]
## 2 11
## 4 17
## dtype: int64
Now, you might think that the ith value in foo
gets returned if the ith value in mask
is True. And you’d kind of be right, but watch what happens if we swap the index labels, 4 and 5 in foo
, and then we do the same exact boolean subset using mask
.
foo.index = [0, 1, 2, 3, 5, 4]
foo.loc[mask]
## 2 11
## 4 31
## dtype: int64
This time, the result includes 31 instead of 17. That’s because foo.loc[mask]
picks out the elements of foo
whose index label matches those of mask
where mask
has a True value. Usually this is fine, but in some cases it might not be what you want and if you’d rather just include or exclude values of foo
by corresponding positions of True and False values in mask, just use mask
's underlying NumPy array to subset foo
, like
foo.loc[mask.to_numpy()]
## 2 11
## 5 17
## dtype: int64
In this case the third and fifth values of mask
are True, so we get back the third and fifth values of foo
.
If you want to combine boolean Series together, you can do that too using an &
for and and a |
for or. Note than when you combine two boolean Series, Pandas matches and combines boolean values based on their index.
For example, suppose we have a Series called ages
with the age of five people,
ages = pd.Series(
data = [42, 43, 14, 18, 1],
index = ['peter', 'lois', 'chris', 'meg', 'stewie']
)
print(ages)
## peter 42
## lois 43
## chris 14
## meg 18
## stewie 1
## dtype: int64
and a corresponding Series called genders
with their gender.
genders = pd.Series(
data = ['female', 'female', 'male', 'male', 'male'],
index = ['lois', 'meg', 'chris', 'peter', 'stewie'],
dtype = 'string'
)
print(genders)
## lois female
## meg female
## chris male
## peter male
## stewie male
## dtype: string
Even though their indexes are in a different order, we can still answer questions like,
Who’s a male younger than 18?
mask = (genders == 'male') & (ages < 18)
mask.loc[mask]
## chris True
## stewie True
## dtype: bool
In this case, we make a Series to identify whether each person is a male, and a second Series to identify whether each person is younger than 18. Then we combine them with an ampersand - i.e. the elementwise and operator - to identify whether each person is a male and younger than 18. Then if we assign that to a variable called mask
, we can index it with itself to get the names of males less than 18. In this case the names’ll be in the index.
We can also use the ~
operator to negate a boolean Series. So for example, if we do ~mask
, we can determine “Who’s not a male and less than 18?". In other words, “Who is a female or is at least 18?".
~mask
## chris False
## lois True
## meg True
## peter True
## stewie False
## dtype: bool
When you combine boolean Series, make sure you wrap each condition in parentheses, otherwise the interpreter will read things in the wrong order and you’ll probably get an error. For example if we try to determine people between 18 and 42 like this, we’ll get an error.
ages.loc[ages >= 18 & ages <= 42] # ERROR
The solution here is just to wrap the conditions in parentheses like this.
ages.loc[(ages >= 18) & (ages <= 42)]
## peter 42
## meg 18
## dtype: int64
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts