Contents

Python Pandas For Your Grandpa - 4.4 MultiIndex

We’ve seen earlier that when you groupby() and aggregate() a DataFrame on multiple columns, you get back a new DataFrame with a MultiIndex. A MultiIndex, or hierarchical index, is exactly what it sounds like - it’s an index that uses multiple values in its keys. In this video, we’ll see how MultiIndexes can be useful and how to use them.

Let’s make an example dataset of (store, product) pairs. There are a few different ways to make a MultiIndex from scratch. In my opinion, the most intuitive way is to use the pd.MultiIndex.from_tuples() constructor, passing in a list of tuples. We’ll do that here.

import numpy as np
import pandas as pd

store_products = pd.DataFrame(
    data = {'Price': [35.25, 45.00, 23.50, 1.95, 29.99, 35.65]},
    index = pd.MultiIndex.from_tuples([
        ('super store', 'basketball'), ('super store', 'football'), ('super store', 'soccerball'),
        ('sports dorks', 'golfball'), ('sports dorks', 'basketball'), ('sports dorks', 'football')
    ], names=['store', 'product'])
)
print(store_products)
##                          Price
## store        product          
## super store  basketball  35.25
##              football    45.00
##              soccerball  23.50
## sports dorks golfball     1.95
##              basketball  29.99
##              football    35.65

Before we do anything, it’s important to understand that our MultiIndex has has two levels. Level 0 is the store-level and level 1 is the product-level. We could have more but in this example we have two.

Now, let’s say you just hate MultiIndexes and you don’t want to use them. You can use the reset_index() method with inplace=True to convert the index levels as columns and go back to using a plain ole RangeIndex.

store_products.reset_index(inplace=True)
print(store_products)
##           store     product  Price
## 0   super store  basketball  35.25
## 1   super store    football  45.00
## 2   super store  soccerball  23.50
## 3  sports dorks    golfball   1.95
## 4  sports dorks  basketball  29.99
## 5  sports dorks    football  35.65

Alright so that covers our tutorial on MultiIndexes I hope you enjoyed this lecture. (jk)

If you want revert back to our original DataFrame with a MultiIndex, you can use set_index() method passing in a list of column names to use as the MultiIndex.

store_products.set_index(['store', 'product'], inplace=True)
print(store_products.set_index)
## <bound method DataFrame.set_index of                          Price
## store        product          
## super store  basketball  35.25
##              football    45.00
##              soccerball  23.50
## sports dorks golfball     1.95
##              basketball  29.99
##              football    35.65>

Now let’s try selecting certain rows from our multi-indexed DataFrame. If you wanted to pick out rows the with keys (‘sports dorks’, ‘golfball’) and (‘super store’, ‘football’), you can use .loc, passing in a list of tuples.

store_products.loc[[('sports dorks', 'golfball'), ('super store', 'football')]]
##                        Price
## store        product        
## sports dorks golfball   1.95
## super store  football  45.00

If you wanted to pick out every row for ‘sports dorks’, you can use loc passing in the key ‘sports dorks’.

store_products.loc[['sports dorks']]
##                          Price
## store        product          
## sports dorks golfball     1.95
##              basketball  29.99
##              football    35.65

But what if you wanted to pick out all rows where the product is ‘football’? In this case you can’t use .loc, at least not easily. But there is a method call xs() which lets you specify a key like ‘football’ and tell pandas which level of the index to search on. So for example we can do

store_products.xs(key='football', level='product')
##               Price
## store              
## super store   45.00
## sports dorks  35.65

And if we wanted to retain both the store and product indexes in the result, we can set the drop_level argument equal to False.

store_products.xs(key='football', level='product', drop_level=False)
##                        Price
## store        product        
## super store  football  45.00
## sports dorks football  35.65

Now let’s take a look at a groupby-aggregate operation since this is a more common way MultIndexes arise in practice. We’ll start by building a DataFrame with 7 rows and 4 columns.

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'B': [False, True, False, True, True, True, True],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8],
    'D': [50, 30, 30, 90, 10, 20, 10]
})
print(df)
##      A      B    C   D
## 0  foo  False  2.1  50
## 1  bar   True  1.9  30
## 2  foo  False  3.6  30
## 3  bar   True  4.0  90
## 4  bar   True  1.9  10
## 5  foo   True  7.8  20
## 6  foo   True  2.8  10

Then we’ll set stew equal to stew = df.groupby(by=['A', 'B']).agg({'C': ['sum'], 'D': ['sum', 'mean']})

stew = df.groupby(by=['A', 'B']).agg({'C': ['sum'], 'D': ['sum', 'mean']})
print(stew)
##               C    D           
##             sum  sum       mean
## A   B                          
## bar True    7.8  130  43.333333
## foo False   5.7   80  40.000000
##     True   10.6   30  15.000000

In this case, the row index is a MultiIndex.

stew.index
## MultiIndex([('bar',  True),
##             ('foo', False),
##             ('foo',  True)],
##            names=['A', 'B'])

and so is the column index.

stew.columns
## MultiIndex([('C',  'sum'),
##             ('D',  'sum'),
##             ('D', 'mean')],
##            )

If we wanted to select all columns representing a sum, we could use our handy xs() method, but this time specify axis=1 so Pandas knows to look for the key within the columns of stew. Also, in this case the column index levels don’t have names, so we’ll have to pass in a positional index into the level parameter. In other words, we need to tell Pandas to search for the key named ‘sum’ in the second level of the MultiIndex, i.e. level 1.

stew.xs(key='sum', axis=1, level=1, drop_level=False)
##               C    D
##             sum  sum
## A   B               
## bar True    7.8  130
## foo False   5.7   80
##     True   10.6   30

We could also do something like pick out columns for D by specifying key = ‘D’ and level = 0.

stew.xs(key='D', axis=1, level=0, drop_level=False)
##              D           
##            sum       mean
## A   B                    
## bar True   130  43.333333
## foo False   80  40.000000
##     True    30  15.000000

Now let’s suppose you wanted to flatten the columns from a MultiIndex to a regular index. There are lots of ways you can do this, but perhaps the easiest is to use the to_flat_index() method. For example, if we call stew.columns.to_flat_index(), we get back an Index of tuples.

stew.columns.to_flat_index()
## Index([('C', 'sum'), ('D', 'sum'), ('D', 'mean')], dtype='object')

We could take this a step further and use basic python string joining and list comprehension to convert those tuples into strings separated by underscores.

['_'.join(s) for s in stew.columns.to_flat_index()]
## ['C_sum', 'D_sum', 'D_mean']

Setting stew.columns equal to our list of strings flattens the index how we want it.

stew.columns = ['_'.join(s) for s in stew.columns.to_flat_index()]
print(stew)
##            C_sum  D_sum     D_mean
## A   B                             
## bar True     7.8    130  43.333333
## foo False    5.7     80  40.000000
##     True    10.6     30  15.000000

Alright, so we’ve looked at how to use MultiIndexes but we haven’t really answered the question “Why would you use a MultiIndex?”. This is a bit of a sore spot for me because I really care about performance and you’d expect that a MultiIndex would provide significantly better performance than storing your data in two or more naive columns of a DataFrame, but from my testing, it really doesn’t.

Now, there’s still benefits to using a MultiIndex. The big one is probably self-documenting data. When you have a DataFrame with a MultiIndex, right away you know something important about the structure of that data. Also, MultiIndexes play nicely with some other functions like stack() and unstack() which we’ll see in a future lecture. But for the most part, the benefits to using a MultiIndex are mostly superficial, at least for now, I think that’s a real missed opportunity for Pandas.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts

Additional Content

  1. Python NumPy For Your Grandma
  2. Neural Networks For Your Dog
  3. Introduction To Google Colab