Share on:

Python Pandas For Your Grandpa | Section 2.9 | Series Missing Values
March 18, 2020

Table Of Contents

  1. Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Operations
    2.3 Series Basic Indexing
    2.4 Series Overwriting Data
    2.5 Series Apply
    2.6 Series Concatenation
    2.7 Series Boolean Indexing
    2.8 Series View Vs Copy
    2.9 Series Missing Values
    2.10 Series Challenges

import numpy as np
import pandas as pd

One of the fundamental features of pandas is its ability to represent missing or invalid data using NaN.

This is an interesting lecture for me because I basically spent months putting this course together and right before I published my first lecture, pandas dropped a bomb on me - they released version 1.0.0. This version had breaking changes and major new features. Fortunately, most of my lectures weren’t affected - but this lecture, I basically had to re-write it from scratch.

So, back in the day, if you wanted to represent missing or invalid data, you had to use NumPy’s special floating point constant, np.nan. So, if you had a pandas series of integers like this

roux = pd.Series([1, 2, 3])

And then you set the 2nd element to np.nan

roux.iloc[1] = np.nan

The series would get cast to floats because NaN only existed as a floating point value.

print(roux)
## 0    1.0
## 1    NaN
## 2    3.0
## dtype: float64

Today, pandas has a Nullable integer datatype called Int64 with a capital I to differentiate it from NumPy’s int64 with a lower case i. So, let’s rebuild that Series, this time specifying dtype='Int64'.

roux = pd.Series([1, 2, 3], dtype='Int64')

And, again, let’s set the 2nd element to np.nan

roux.iloc[1] = np.nan

This time, the series retains its Int64 datatype, and doesn’t get cast to float. A couple other, and probably better ways to do this would be

roux.iloc[1] = None
roux.iloc[1] = pd.NA

Now let’s build a Series of strings, set the 2nd element to None and set the 3rd element to np.nan.

gumbo = pd.Series(['a', 'b', 'c'])
gumbo.iloc[1] = None
gumbo.iloc[2] = np.nan

If we print the Series, you’ll notice that this time pandas doesn’t really do anything.

gumbo
## 0       a
## 1    None
## 2     NaN
## dtype: object

That’s because a series of strings is a series of objects, and a series of objects is really just a NumPy array of pointers that can point to anything in memory.

Of course, in pandas 1.0.0, there’s a new experimental string datatype that makes everything I just said somewhat wrong or outdated. Now you can do stuff like this.

gumbo = pd.Series(['a', 'b', 'c'], dtype='string')
gumbo.iloc[1] = None
gumbo.iloc[2] = np.nan

print(gumbo)
## 0       a
## 1    <NA>
## 2    <NA>
## dtype: string

In any case, pandas provides two helper functions for identifying NaN values. If you have a Series x with some NaN values, and then you check x == np.nan, you’ll get back a series of all False values. That’s because NumPy designed nan so that nan == nan returns False.

x = pd.Series([1.0, np.nan, 3.0, np.nan])
x == np.nan
## 0    False
## 1    False
## 2    False
## 3    False
## dtype: bool

If you want to pick out NaN values from a Series, you should the function use pd.isna() and if you want to pick out non-NaN values use pd.notna().

pd.isna(x)
## 0    False
## 1     True
## 2    False
## 3     True
## dtype: bool
pd.notna(x)
## 0     True
## 1    False
## 2     True
## 3    False
## dtype: bool

If you want to replace NaN values with -1, you could do something like

x.loc[pd.isna(x)] = -1

and this works, but pandas provides a really convenient fillna() method that makes this even simpler.

x.fillna(-1)
## 0    1.0
## 1   -1.0
## 2    3.0
## 3   -1.0
## dtype: float64

Just remember that this returns a modified copy of x, so x doesn’t actually get changed here. If you did want to update x, you could do the same thing but set inplace=True.

x.fillna(-1, inplace=True)

print(x)
## 0    1.0
## 1   -1.0
## 2    3.0
## 3   -1.0
## dtype: float64

comments powered by Disqus