# Python Pandas For Your Grandpa - 2.5 Series Missing Values

Contents

In this section, we’ll see how to use `NaN` to represent missing or invalid values in a Series. Let’s start by talking about `NaN` prior to version 1.0.0. So, back in the day, if you wanted to represent missing or invalid data, you had to use NumPy’s special floating point constant, `np.nan`. So, if you had a Pandas Series of integers like this

``````import numpy as np
import pandas as pd

roux = pd.Series([1, 2, 3])
print(roux)
## 0    1
## 1    2
## 2    3
## dtype: int64
``````

And then you set the 2nd element to `np.nan`.

``````roux.iloc[1] = np.nan
``````

The Series would get cast to floats because nan only exists in NumPy as a floating point constant.

``````print(roux)
## 0    1.0
## 1    NaN
## 2    3.0
## dtype: float64
``````

By the time you’re reading this article, this may have changed, but at the moment, fixing this problem is still on the NumPy Roadmap.

So, in the past you couldn’t have a Pandas Series of integers with `NaN` values, because you couldn’t (and still can’t) have a NumPy array of integers with `NaN` values. If you wanted `NaN` values, your Series had to be a Series of floats.

Then Pandas released version 1.0.0 which included a Nullable integer datatype. It’s called “Int64” with a capital “I” to differentiate it from NumPy’s “int64” with a lower case “i”. So, let’s rebuild that Series, `roux`, this time specifying `dtype='Int64'`.

``````roux = pd.Series([1, 2, 3], dtype='Int64')
``````

And, again, let’s set the 2nd element to `np.nan`.

``````roux.iloc[1] = np.nan
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64
``````

This time, the Series retains its Int64 datatype, and doesn’t get cast to float. In this case, a better way set that value to `NaN` is to use `pd.NA`.

``````roux.iloc[1] = pd.NA
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64
``````

You could also use the `None` keyword, but I’d probably opt for `pd.NA`.

Alright, now let’s see how this works on a Series of strings. So back in the day, if you wanted to build a Series of strings, you would do something like

``````gumbo = pd.Series(['a', 'b', 'c'])
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: object
``````

And then if you set the 2nd value to np.nan and the third value to None,

``````gumbo.iloc[1] = np.nan
gumbo.iloc[2] = None
``````

and then you print the Series, it actually looks like this worked pretty well

``````print(gumbo)
## 0       a
## 1     NaN
## 2    None
## dtype: object
``````

..but should it have?

Notice, the Series has dtype object. What this means is, we basically have a python list. Each element of the Series is actually just a pointer, or a memory address, pointing to some random location in your computer’s memory that’s storing the value of the element. This is bad because:

1. it’s inefficient for data access and
2. it doesn’t enforce a homogeneous datatype constraint on our Series

We’re supposed to have a Series of strings, but I set the second element to a floating point. Pandas 1.0.0 fixed both of these issues in one-fell-swoop with the StringDtype extension type. So, today we’d rebuild that Series just like before, except we’d specify `dtype='string'`

``````gumbo = pd.Series(['a', 'b', 'c'], dtype='string')
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: string
``````

And now if we set the 2nd value to `np.nan` and the third value to `None`, our Series would end up looking like this

``````gumbo.iloc[1] = pd.NA
gumbo.iloc[2] = None
print(gumbo)
## 0       a
## 1    <NA>
## 2    <NA>
## dtype: string
``````

If you’re a little confused by this - don’t worry it’s not that important for using Pandas and it’s something you’ll probably understand more over time.

In any case, Pandas provides two helper functions for identifying nan values. If you have a Series `x` with some nan values,

``````x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
## 0       1
## 1    <NA>
## 2       3
## 3    <NA>
## dtype: Int64
``````

you can use `pd.isna()` to check whether each value is `nan`

``````pd.isna(x)
## 0    False
## 1     True
## 2    False
## 3     True
## dtype: bool
``````

and `pd.notna()` to do the opposite.

``````pd.notna(x)
## 0     True
## 1    False
## 2     True
## 3    False
## dtype: bool
``````

If you want to replace nan values with -1, you could do something like

``````x.loc[pd.isna(x)] = -1
``````

and this works, but Pandas provides a really convenient `fillna()` method that makes this event simpler. So instead you could just do

``````x.fillna(-1)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64
``````

Note that this returns a modified copy of `x`, so `x` doesn’t actually get changed here. You can see if I `print(x)` it hasn’t changed.

``````print(x)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64
``````

If you want the changes to stick, you can do the same thing and set the `inplace` parameter equal to True.

``````x.fillna(-1, inplace=True)
print(x)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64
``````

It’s also important to understand how `NaN`s work with boolean indexing. Suppose you have a Series of values like this.

``````goo = pd.Series([10,20,30,40])
print(goo)
## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64
``````

and a corresponding Series of boolean values

``````choo = pd.Series([True, False, pd.NA, True])
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: object
``````

what do you think `goo.loc[choo]` will return?

``````goo.loc[choo]  # ValueError: Cannot mask with non-boolean array containing NA / NaN values
``````

In this case we get “ValueError: Cannot mask with non-boolean array containing NA / NaN values” Notice that `choo` here is one of those pesky Series with dtype ‘object’. In other words, it’s a Series of pointers. To fix this, we can rebuild `choo`, specifying `dtype = "boolean"`.

``````choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: boolean
``````

and now when we do `goo.loc[choo]` we get back 10 and 40, so the `NaN` value in `choo` is essentially ignored.

``````goo.loc[choo]
## 0    10
## 3    40
## dtype: int64
``````

Keep in mind that the negation of `NaN` is still `NaN`, so if we do `goo.loc[~choo]`, we only get back one row, not the two rows excluded in the previous subset.

``````goo.loc[~choo]
## 1    20
## dtype: int64
``````