Python Pandas For Your Grandpa - 4.3 Categoricals
One of my favorite features of Pandas is its ability to represent categorical data using the Categorical type, much like the factor data type in R. In this video, we’ll see how Categoricals work, and when you should use them.
A Categorical takes on a limited, and usually fixed number of possible values, i.e. categories. To motivate this data structure, suppose we have four cars and for each car we have its VIN number, color, and size classification:
- VINs: (‘AX193Q43’, ‘Z11RTV201’, ‘WA4Q3371’, ‘QWP77491’)
- colors: (‘red’, ‘blue’, ‘red’, ‘green’)
- sizes: (‘standard’, ‘mini’, ‘standard’, ‘extended’)
Our goal is to store each set of data using Pandas. VIN numbers are unique identifies for cars, so we actually don’t want to use the Categorical data type to store VIN numbers because each VIN number is unique to a single car. In other words, VIN numbers aren’t categories, they’re unique identifiers. So, in this case, it’d be better to store them using a plain ole series of strings.
import numpy as np import pandas as pd VINs = pd.Series(['AX193Q43', 'Z11RTV201', 'WA4Q3371', 'QWP77491'], dtype='string') print(VINs) ## 0 AX193Q43 ## 1 Z11RTV201 ## 2 WA4Q3371 ## 3 QWP77491 ## dtype: string
In the case of colors, this is a classic candidate for categorical data since we have a limited set of colors and each color sort of represents a collection of different cars. We can build a Categorical easily, using the
pd.Categorical() function, very similar to the
colors = pd.Categorical(values = ['red', 'blue', 'red', 'green'])
If we print the Categorical, we can see the values and the unique categories: blue, green, red.
print(colors) ## ['red', 'blue', 'red', 'green'] ## Categories (3, object): ['blue', 'green', 'red']
Note that this is not a Series, but you could make it a Series just by wrapping it in
pd.Series(colors) ## 0 red ## 1 blue ## 2 red ## 3 green ## dtype: category ## Categories (3, object): ['blue', 'green', 'red']
By default, when you build a Categorical, Pandas sets the categories as the unique, non-nan values in the data. But if you want to set the categories explicitly you can do so using the
categories argument. This is especially useful if, for example, your website or dealership supports more colors than the cars you have in stock. So you could build a Categorical like the last one, with the additional categories: black, orange, and yellow.
colors = pd.Categorical( values = ['red', 'blue', 'red', 'green'], categories = ['black', 'blue', 'green', 'orange', 'red', 'yellow'] ) print(colors) ## ['red', 'blue', 'red', 'green'] ## Categories (6, object): ['black', 'blue', 'green', 'orange', 'red', 'yellow']
categories parameter is also useful because it lets you organize the order in which categories should be
displayed, which could be handy for things like plots or reports. Without specifying the
categories parameter, Pandas
displays them in lexical order which basically means A to Z, but maybe you want to report them in a different order like
bright to dark in which case you can build the categorical as
colors = pd.Categorical( values = ['red', 'blue', 'red', 'green'], categories = ['yellow', 'orange', 'red', 'green', 'blue', 'black'] ) print(colors) ## ['red', 'blue', 'red', 'green'] ## Categories (6, object): ['yellow', 'orange', 'red', 'green', 'blue', 'black']
Now let’s talk about the car sizes. This is another good candidate for using the Categorical datatype. It’s a lot like colors but with one key difference: car sizes have an inherent order whereas colors don’t. Now, I did say that you could order the color categories to your liking for reporting or display purposes, but in no means did I imply that the colors themselves have an inherent order. orange is not more or less than black, blue is not before or after yellow, and so on. By contrast, sizes have an inherent order, so when we build this Categorical, we’ll want to specify the categories in the correct order and set the
ordered parameter to True.
sizes = pd.Categorical( values = ['standard', 'mini', 'standard', 'extended'], categories = ['mini', 'standard', 'extended'], ordered = True )
Now when we
print(sizes), you’ll notice the categories are reported with “<” symbols indicating that they have a meaningful order.
print(sizes) ## ['standard', 'mini', 'standard', 'extended'] ## Categories (3, object): ['mini' < 'standard' < 'extended']
This is really cool because it means you can do things like compare
sizes < 'extended' and get back a boolean array.
sizes < 'extended' ## array([ True, True, True, False])
Now it’s important to note that Categoricals don’t have
.iloc accessors, so if you wanted to subset
sizes as those less than ‘extended’, you’d have to do it using basic square bracket notation, like a NumPy array.
sizes[sizes < 'extended'] ## ['standard', 'mini', 'standard'] ## Categories (3, object): ['mini' < 'standard' < 'extended']
Although, you could create a Series with a CategoricalIndex like
sizesSeries = pd.Series( data = [0,1,2,3], index = pd.CategoricalIndex(sizes) )
In which case you could do
sizesSeries.loc['mini'] ## 1
Another really cool benefit to using Categoricals is that you can one-hot-encode them using Pandas’s
get_dummies() function. For example, if you call
pd.get_dummies() on sizes, you get back a corresponding 4-row DataFrame of 0s and 1s where 1s indicate the size corresponding to each row.
pd.get_dummies(sizes, prefix = 'size') ## size_mini size_standard size_extended ## 0 0 1 0 ## 1 1 0 0 ## 2 0 1 0 ## 3 0 0 1
If you’re not familiar with this format, it’s a really common input structure for a bunch of machine learning models.
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.6 DataFrame View vs Copy
3.8 DataFrame Aggregation
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes
4.2 Dates And Times
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM
- Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts