Contents

Python Pandas For Your Grandpa - 4.3 Categoricals

One of my favorite features of Pandas is its ability to represent categorical data using the Categorical type, much like the factor data type in R. In this video, we’ll see how Categoricals work, and when you should use them.

A Categorical takes on a limited, and usually fixed number of possible values, i.e. categories. To motivate this data structure, suppose we have four cars and for each car we have its VIN number, color, and size classification:

  • VINs: (‘AX193Q43’, ‘Z11RTV201’, ‘WA4Q3371’, ‘QWP77491’)
  • colors: (‘red’, ‘blue’, ‘red’, ‘green’)
  • sizes: (‘standard’, ‘mini’, ‘standard’, ‘extended’)

Our goal is to store each set of data using Pandas. VIN numbers are unique identifies for cars, so we actually don’t want to use the Categorical data type to store VIN numbers because each VIN number is unique to a single car. In other words, VIN numbers aren’t categories, they’re unique identifiers. So, in this case, it’d be better to store them using a plain ole series of strings.

import numpy as np
import pandas as pd

VINs = pd.Series(['AX193Q43', 'Z11RTV201', 'WA4Q3371', 'QWP77491'], dtype='string')
print(VINs)
## 0     AX193Q43
## 1    Z11RTV201
## 2     WA4Q3371
## 3     QWP77491
## dtype: string

In the case of colors, this is a classic candidate for categorical data since we have a limited set of colors and each color sort of represents a collection of different cars. We can build a Categorical easily, using the pd.Categorical() function, very similar to the pd.Series() function.

colors = pd.Categorical(values = ['red', 'blue', 'red', 'green'])

If we print the Categorical, we can see the values and the unique categories: blue, green, red.

print(colors)
## ['red', 'blue', 'red', 'green']
## Categories (3, object): ['blue', 'green', 'red']

Note that this is not a Series, but you could make it a Series just by wrapping it in pd.Series().

pd.Series(colors)
## 0      red
## 1     blue
## 2      red
## 3    green
## dtype: category
## Categories (3, object): ['blue', 'green', 'red']

By default, when you build a Categorical, Pandas sets the categories as the unique, non-nan values in the data. But if you want to set the categories explicitly you can do so using the categories argument. This is especially useful if, for example, your website or dealership supports more colors than the cars you have in stock. So you could build a Categorical like the last one, with the additional categories: black, orange, and yellow.

colors = pd.Categorical(
    values = ['red', 'blue', 'red', 'green'],
    categories = ['black', 'blue', 'green', 'orange', 'red', 'yellow']
)
print(colors)
## ['red', 'blue', 'red', 'green']
## Categories (6, object): ['black', 'blue', 'green', 'orange', 'red', 'yellow']

The categories parameter is also useful because it lets you organize the order in which categories should be displayed, which could be handy for things like plots or reports. Without specifying the categories parameter, Pandas displays them in lexical order which basically means A to Z, but maybe you want to report them in a different order like bright to dark in which case you can build the categorical as

colors = pd.Categorical(
    values = ['red', 'blue', 'red', 'green'],
    categories = ['yellow', 'orange', 'red', 'green', 'blue', 'black']
)
print(colors)
## ['red', 'blue', 'red', 'green']
## Categories (6, object): ['yellow', 'orange', 'red', 'green', 'blue', 'black']

Now let’s talk about the car sizes. This is another good candidate for using the Categorical datatype. It’s a lot like colors but with one key difference: car sizes have an inherent order whereas colors don’t. Now, I did say that you could order the color categories to your liking for reporting or display purposes, but in no means did I imply that the colors themselves have an inherent order. orange is not more or less than black, blue is not before or after yellow, and so on. By contrast, sizes have an inherent order, so when we build this Categorical, we’ll want to specify the categories in the correct order and set the ordered parameter to True.

sizes = pd.Categorical(
    values = ['standard', 'mini', 'standard', 'extended'],
    categories = ['mini', 'standard', 'extended'],
    ordered = True
)

Now when we print(sizes), you’ll notice the categories are reported with “<” symbols indicating that they have a meaningful order.

print(sizes)
## ['standard', 'mini', 'standard', 'extended']
## Categories (3, object): ['mini' < 'standard' < 'extended']

This is really cool because it means you can do things like compare sizes < 'extended' and get back a boolean array.

sizes < 'extended'
## array([ True,  True,  True, False])

Now it’s important to note that Categoricals don’t have .loc or .iloc accessors, so if you wanted to subset sizes as those less than ‘extended’, you’d have to do it using basic square bracket notation, like a NumPy array.

sizes[sizes < 'extended']
## ['standard', 'mini', 'standard']
## Categories (3, object): ['mini' < 'standard' < 'extended']

Although, you could create a Series with a CategoricalIndex like

sizesSeries = pd.Series(
    data = [0,1,2,3],
    index = pd.CategoricalIndex(sizes)
)

In which case you could do

sizesSeries.loc['mini']
## 1

Another really cool benefit to using Categoricals is that you can one-hot-encode them using Pandas’s get_dummies() function. For example, if you call pd.get_dummies() on sizes, you get back a corresponding 4-row DataFrame of 0s and 1s where 1s indicate the size corresponding to each row.

pd.get_dummies(sizes, prefix = 'size')
##    size_mini  size_standard  size_extended
## 0          0              1              0
## 1          1              0              0
## 2          0              1              0
## 3          0              0              1

If you’re not familiar with this format, it’s a really common input structure for a bunch of machine learning models.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts

Additional Content

  1. Python NumPy For Your Grandma
  2. Neural Networks For Your Dog
  3. Introduction To Google Colab