Python Pandas For Your Grandpa - 2.5 Series Missing Values
In this section, we’ll see how to use NaN
to represent missing or invalid values in a Series. Let’s start by talking about NaN
prior to version 1.0.0. So, back in the day, if you wanted to represent missing or invalid data, you had to use NumPy’s special floating point constant, np.nan
. So, if you had a Pandas Series of integers like this
import numpy as np
import pandas as pd
roux = pd.Series([1, 2, 3])
print(roux)
## 0 1
## 1 2
## 2 3
## dtype: int64
And then you set the 2nd element to np.nan
.
roux.iloc[1] = np.nan
The Series would get cast to floats because nan only exists in NumPy as a floating point constant.
print(roux)
## 0 1.0
## 1 NaN
## 2 3.0
## dtype: float64
By the time you’re reading this article, this may have changed, but at the moment, fixing this problem is still on the NumPy Roadmap.
So, in the past you couldn’t have a Pandas Series of integers with NaN
values, because you couldn’t (and still can’t) have a NumPy array of integers with NaN
values. If you wanted NaN
values, your Series had to be a Series of floats.
Then Pandas released version 1.0.0 which included a Nullable integer datatype. It’s called “Int64” with a capital “I” to differentiate it from NumPy’s “int64” with a lower case “i”. So, let’s rebuild that Series, roux
, this time specifying dtype='Int64'
.
roux = pd.Series([1, 2, 3], dtype='Int64')
And, again, let’s set the 2nd element to np.nan
.
roux.iloc[1] = np.nan
print(roux)
## 0 1
## 1 <NA>
## 2 3
## dtype: Int64
This time, the Series retains its Int64 datatype, and doesn’t get cast to float. In this case, a better way set that value to NaN
is to use pd.NA
.
roux.iloc[1] = pd.NA
print(roux)
## 0 1
## 1 <NA>
## 2 3
## dtype: Int64
You could also use the None
keyword, but I’d probably opt for pd.NA
.
Alright, now let’s see how this works on a Series of strings. So back in the day, if you wanted to build a Series of strings, you would do something like
gumbo = pd.Series(['a', 'b', 'c'])
print(gumbo)
## 0 a
## 1 b
## 2 c
## dtype: object
And then if you set the 2nd value to np.nan and the third value to None,
gumbo.iloc[1] = np.nan
gumbo.iloc[2] = None
and then you print the Series, it actually looks like this worked pretty well
print(gumbo)
## 0 a
## 1 NaN
## 2 None
## dtype: object
..but should it have?
Notice, the Series has dtype object. What this means is, we basically have a python list. Each element of the Series is actually just a pointer, or a memory address, pointing to some random location in your computer’s memory that’s storing the value of the element. This is bad because:
- it’s inefficient for data access and
- it doesn’t enforce a homogeneous datatype constraint on our Series
We’re supposed to have a Series of strings, but I set the second element to a floating point. Pandas 1.0.0 fixed both of these issues in one-fell-swoop with the StringDtype extension type. So, today we’d rebuild that Series just like before, except we’d specify dtype='string'
gumbo = pd.Series(['a', 'b', 'c'], dtype='string')
print(gumbo)
## 0 a
## 1 b
## 2 c
## dtype: string
And now if we set the 2nd value to np.nan
and the third value to None
, our Series would end up looking like this
gumbo.iloc[1] = pd.NA
gumbo.iloc[2] = None
print(gumbo)
## 0 a
## 1 <NA>
## 2 <NA>
## dtype: string
If you’re a little confused by this - don’t worry it’s not that important for using Pandas and it’s something you’ll probably understand more over time.
In any case, Pandas provides two helper functions for identifying nan values. If you have a Series x
with some nan values,
x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
## 0 1
## 1 <NA>
## 2 3
## 3 <NA>
## dtype: Int64
you can use pd.isna()
to check whether each value is nan
pd.isna(x)
## 0 False
## 1 True
## 2 False
## 3 True
## dtype: bool
and pd.notna()
to do the opposite.
pd.notna(x)
## 0 True
## 1 False
## 2 True
## 3 False
## dtype: bool
If you want to replace nan values with -1, you could do something like
x.loc[pd.isna(x)] = -1
and this works, but Pandas provides a really convenient fillna()
method that makes this event simpler. So instead you could just do
x.fillna(-1)
## 0 1
## 1 -1
## 2 3
## 3 -1
## dtype: Int64
Note that this returns a modified copy of x
, so x
doesn’t actually get changed here. You can see if I print(x)
it hasn’t changed.
print(x)
## 0 1
## 1 -1
## 2 3
## 3 -1
## dtype: Int64
If you want the changes to stick, you can do the same thing and set the inplace
parameter equal to True.
x.fillna(-1, inplace=True)
print(x)
## 0 1
## 1 -1
## 2 3
## 3 -1
## dtype: Int64
It’s also important to understand how NaN
s work with boolean indexing. Suppose you have a Series of values like this.
goo = pd.Series([10,20,30,40])
print(goo)
## 0 10
## 1 20
## 2 30
## 3 40
## dtype: int64
and a corresponding Series of boolean values
choo = pd.Series([True, False, pd.NA, True])
print(choo)
## 0 True
## 1 False
## 2 <NA>
## 3 True
## dtype: object
what do you think goo.loc[choo]
will return?
goo.loc[choo] # ValueError: Cannot mask with non-boolean array containing NA / NaN values
In this case we get “ValueError: Cannot mask with non-boolean array containing NA / NaN values” Notice that choo
here is one of those pesky Series with dtype ‘object’. In other words, it’s a Series of pointers. To fix this, we can rebuild choo
, specifying dtype = "boolean"
.
choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
print(choo)
## 0 True
## 1 False
## 2 <NA>
## 3 True
## dtype: boolean
and now when we do goo.loc[choo]
we get back 10 and 40, so the NaN
value in choo
is essentially ignored.
goo.loc[choo]
## 0 10
## 3 40
## dtype: int64
Keep in mind that the negation of NaN
is still NaN
, so if we do goo.loc[~choo]
, we only get back one row, not the two rows excluded in the previous subset.
goo.loc[~choo]
## 1 20
## dtype: int64
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts