Contents

Python Pandas For Your Grandpa - 2.5 Series Missing Values

In this section, we’ll see how to use NaN to represent missing or invalid values in a Series. Let’s start by talking about NaN prior to version 1.0.0. So, back in the day, if you wanted to represent missing or invalid data, you had to use NumPy’s special floating point constant, np.nan. So, if you had a Pandas Series of integers like this

import numpy as np
import pandas as pd

roux = pd.Series([1, 2, 3])
print(roux)
## 0    1
## 1    2
## 2    3
## dtype: int64

And then you set the 2nd element to np.nan.

roux.iloc[1] = np.nan

The Series would get cast to floats because nan only exists in NumPy as a floating point constant.

print(roux)
## 0    1.0
## 1    NaN
## 2    3.0
## dtype: float64

By the time you’re reading this article, this may have changed, but at the moment, fixing this problem is still on the NumPy Roadmap.

So, in the past you couldn’t have a Pandas Series of integers with NaN values, because you couldn’t (and still can’t) have a NumPy array of integers with NaN values. If you wanted NaN values, your Series had to be a Series of floats.

Then Pandas released version 1.0.0 which included a Nullable integer datatype. It’s called “Int64” with a capital “I” to differentiate it from NumPy’s “int64” with a lower case “i”. So, let’s rebuild that Series, roux, this time specifying dtype='Int64'.

roux = pd.Series([1, 2, 3], dtype='Int64')

And, again, let’s set the 2nd element to np.nan.

roux.iloc[1] = np.nan
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64

This time, the Series retains its Int64 datatype, and doesn’t get cast to float. In this case, a better way set that value to NaN is to use pd.NA.

roux.iloc[1] = pd.NA
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64

You could also use the None keyword, but I’d probably opt for pd.NA.

Alright, now let’s see how this works on a Series of strings. So back in the day, if you wanted to build a Series of strings, you would do something like

gumbo = pd.Series(['a', 'b', 'c'])
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: object

And then if you set the 2nd value to np.nan and the third value to None,

gumbo.iloc[1] = np.nan
gumbo.iloc[2] = None

and then you print the Series, it actually looks like this worked pretty well

print(gumbo)
## 0       a
## 1     NaN
## 2    None
## dtype: object

..but should it have?

Notice, the Series has dtype object. What this means is, we basically have a python list. Each element of the Series is actually just a pointer, or a memory address, pointing to some random location in your computer’s memory that’s storing the value of the element. This is bad because:

  1. it’s inefficient for data access and
  2. it doesn’t enforce a homogeneous datatype constraint on our Series

We’re supposed to have a Series of strings, but I set the second element to a floating point. Pandas 1.0.0 fixed both of these issues in one-fell-swoop with the StringDtype extension type. So, today we’d rebuild that Series just like before, except we’d specify dtype='string'

gumbo = pd.Series(['a', 'b', 'c'], dtype='string')
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: string

And now if we set the 2nd value to np.nan and the third value to None, our Series would end up looking like this

gumbo.iloc[1] = pd.NA
gumbo.iloc[2] = None
print(gumbo)
## 0       a
## 1    <NA>
## 2    <NA>
## dtype: string

If you’re a little confused by this - don’t worry it’s not that important for using Pandas and it’s something you’ll probably understand more over time.

In any case, Pandas provides two helper functions for identifying nan values. If you have a Series x with some nan values,

x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
## 0       1
## 1    <NA>
## 2       3
## 3    <NA>
## dtype: Int64

you can use pd.isna() to check whether each value is nan

pd.isna(x)
## 0    False
## 1     True
## 2    False
## 3     True
## dtype: bool

and pd.notna() to do the opposite.

pd.notna(x)
## 0     True
## 1    False
## 2     True
## 3    False
## dtype: bool

If you want to replace nan values with -1, you could do something like

x.loc[pd.isna(x)] = -1

and this works, but Pandas provides a really convenient fillna() method that makes this event simpler. So instead you could just do

x.fillna(-1)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64

Note that this returns a modified copy of x, so x doesn’t actually get changed here. You can see if I print(x) it hasn’t changed.

print(x)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64

If you want the changes to stick, you can do the same thing and set the inplace parameter equal to True.

x.fillna(-1, inplace=True)
print(x)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64

It’s also important to understand how NaNs work with boolean indexing. Suppose you have a Series of values like this.

goo = pd.Series([10,20,30,40])
print(goo)
## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64

and a corresponding Series of boolean values

choo = pd.Series([True, False, pd.NA, True])
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: object

what do you think goo.loc[choo] will return?

goo.loc[choo]  # ValueError: Cannot mask with non-boolean array containing NA / NaN values

In this case we get “ValueError: Cannot mask with non-boolean array containing NA / NaN values” Notice that choo here is one of those pesky Series with dtype ‘object’. In other words, it’s a Series of pointers. To fix this, we can rebuild choo, specifying dtype = "boolean".

choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: boolean

and now when we do goo.loc[choo] we get back 10 and 40, so the NaN value in choo is essentially ignored.

goo.loc[choo]
## 0    10
## 3    40
## dtype: int64

Keep in mind that the negation of NaN is still NaN, so if we do goo.loc[~choo], we only get back one row, not the two rows excluded in the previous subset.

goo.loc[~choo]
## 1    20
## dtype: int64

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts