Contents

Python Pandas For Your Grandpa - 4.5 DataFrame Reshaping

In this section, we’ll see how to use the functions pivot(), pivot_table(), melt(), stack(), and unstack() to restructure the shape of a DataFrame.

Suppose you have DataFrame, df, like this

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'row': ['row0', 'row1', 'row2', 'row0', 'row1', 'row2'],
    'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
    'val': [44, 47, 64, 67, 67,  9]
})
print(df)
##     row   col  val
## 0  row0  col1   44
## 1  row1  col1   47
## 2  row2  col1   64
## 3  row0  col0   67
## 4  row1  col0   67
## 5  row2  col0    9

There’s a good chance you’ll want to reshape that data to look like this.

## col   col0  col1
## row             
## row0    67    44
## row1    67    47
## row2     9    64

Here, we converted the data from ‘long’ format to ‘wide’ format. Some people would say we converted the data from ‘stacked’ format to ‘unstacked’ format. For better or worse, there are a ton of ways to do this with Pandas. Perhaps the simplest method is to use the pivot() method of a DataFrame. In this case, we call df.pivot() telling Pandas which columns we want to use for the row index, column index, and values.

df.pivot(index='row', columns='col', values='val')
## col   col0  col1
## row             
## row0    67    44
## row1    67    47
## row2     9    64

Now suppose df had slightly different data, and looked like this.

df = pd.DataFrame({
    'row': ['row0', 'row0', 'row2', 'row2', 'row1', 'row1'],
    'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
    'val': [44, 47, 64, 67, 67, 9]
})
print(df)
##     row   col  val
## 0  row0  col1   44
## 1  row0  col1   47
## 2  row2  col1   64
## 3  row2  col0   67
## 4  row1  col0   67
## 5  row1  col0    9

What do you think the same call to df.pivot() would return in this case?

df.pivot(index='row', columns='col', values='val')  # ERROR

Trick question - we actually get an error here because there are a couple instances where multiple values in the input map to the same exact position in the output and we haven’t told Pandas how to deal with that. This is where the more general pivot_table() method comes in.

pivot_table() works just like pivot(), but you can specify an aggregation function to tell Pandas how to deal with multiple values mapping to the same place. To make this more clear, suppose we call df.pivot_table(), setting index='row', columns='col', and values='val', just like before, but here we also specify aggfunc=list.

pivotted = df.pivot_table(
    index='row',
    columns='col',
    values='val',
    aggfunc=list
)
print(pivotted)
## col      col0      col1
## row                    
## row0      NaN  [44, 47]
## row1  [67, 9]       NaN
## row2     [67]      [64]

Here it’s pretty obvious to see that 67 and 9 mapped to the same location, because the had the same row and col values. aggfunc tells Pandas how to deal with those values, in this case wrapping them into a list. Also, note that the combinations (row0, col0) and (row1, col1) didn’t exist in df, so Pandas fills those entries with NaN by default.

Now, you’re probably not interested in pivoting data just to build lists of values. More interestingly, you can use aggregation functions like mean(), sum(), size(), count(), or a combination of all of them. This works just like doing normal DataFrame aggregation. So, for example, if you wanted to get the count and sum of elements mapping to cells in the output table, you could do

df.pivot_table(
    index='row',
    columns='col',
    values='val',
    aggfunc=['count', 'sum']
)
##      count        sum      
## col   col0 col1  col0  col1
## row                        
## row0   NaN  2.0   NaN  91.0
## row1   2.0  NaN  76.0   NaN
## row2   1.0  1.0  67.0  64.0

As you could probably guess, Pandas supports this kind of pivot operation using multiple columns for the row index, column index, and values. So if you have input like this

toomuch = pd.DataFrame({
    'row_major': ['A', 'A', 'B', 'A', 'B', 'B'],
    'row_minor': ['x', 'x', 'y', 'y', 'z', 'z'],
    'col_major': ['MAMMAL', 'MAMMAL', 'MAMMAL', 'FISH', 'FISH', 'FISH'],
    'col_minor': ['dog', 'cat', 'dog', 'tuna', 'tuna', 'shark'],
    'val0': [44, 47, 64, 67, 67,  9],
    'val1': [91, 52, 86, 83, 79, 92]
})
print(toomuch)
##   row_major row_minor col_major col_minor  val0  val1
## 0         A         x    MAMMAL       dog    44    91
## 1         A         x    MAMMAL       cat    47    52
## 2         B         y    MAMMAL       dog    64    86
## 3         A         y      FISH      tuna    67    83
## 4         B         z      FISH      tuna    67    79
## 5         B         z      FISH     shark     9    92

you can do crazy stuff like this.

toomuch.pivot_table(
    index=['row_major', 'row_minor'],
    columns=['col_major', 'col_minor'],
    values=['val0', 'val1'],
    aggfunc=['count', 'sum']
)
##                     count                                           sum                                            
##                      val0                   val1                   val0                     val1                   
## col_major            FISH      MAMMAL       FISH      MAMMAL       FISH       MAMMAL        FISH       MAMMAL      
## col_minor           shark tuna    cat  dog shark tuna    cat  dog shark  tuna    cat   dog shark  tuna    cat   dog
## row_major row_minor                                                                                                
## A         x           NaN  NaN    1.0  1.0   NaN  NaN    1.0  1.0   NaN   NaN   47.0  44.0   NaN   NaN   52.0  91.0
##           y           NaN  1.0    NaN  NaN   NaN  1.0    NaN  NaN   NaN  67.0    NaN   NaN   NaN  83.0    NaN   NaN
## B         y           NaN  NaN    NaN  1.0   NaN  NaN    NaN  1.0   NaN   NaN    NaN  64.0   NaN   NaN    NaN  86.0
##           z           1.0  1.0    NaN  NaN   1.0  1.0    NaN  NaN   9.0  67.0    NaN   NaN  92.0  79.0    NaN   NaN

By the way, my column index in this DataFrame has 4 levels which is pretty nuts!

So far, we’ve been looking at ways to reshape a DataFrame from long to wide format, but what if we want to do the opposite - go from wide to long format; i.e. unstacked to stacked format? For example consider this DataFrame that shows two students’ scores on four tests.

wide = pd.DataFrame({
    'test': [1, 2, 3, 4],
    'john': [95, 81, 47, 99],
    'patty': [90, 85, 93, 97]
})
print(wide)
##    test  john  patty
## 0     1    95     90
## 1     2    81     85
## 2     3    47     93
## 3     4    99     97

If we wanted to reshape that DataFrame to look like this

##    test student  score
## 0     1    john     95
## 1     2    john     81
## 2     3    john     47
## 3     4    john     99
## 4     1   patty     90
## 5     2   patty     85
## 6     3   patty     93
## 7     4   patty     97

The tool for the job is melt(). Here’s what that looks like.

wide.melt(
    id_vars='test',
    value_vars=['john', 'patty'],
    var_name='student',
    value_name='score'
)
##    test student  score
## 0     1    john     95
## 1     2    john     81
## 2     3    john     47
## 3     4    john     99
## 4     1   patty     90
## 5     2   patty     85
## 6     3   patty     93
## 7     4   patty     97

The most important parameter here is value_vars which should be a list of the columns you want to stack. In this case, ‘john’ and ‘patty’. In fact, that’s literally all you need to provide for this function to work, although the output isn’t very friendly without other information.

wide.melt(value_vars=['john', 'patty'])
##   variable  value
## 0     john     95
## 1     john     81
## 2     john     47
## 3     john     99
## 4    patty     90
## 5    patty     85
## 6    patty     93
## 7    patty     97

var_name and value_name tell Pandas how to name the new ‘variable’ and ‘value’ columns, and id_vars tells Pandas what other columns from the original DataFrame you want to keep in the result.

Two other tools for reshaping DataFrames are the stack() and unstack() methods. For example if you have a DataFrame of daily stock prices like

df = pd.DataFrame({
    'ABC': [105.30, 107.17, 101.52],
    'DEF': [40.29, 40.97, 51.00],
    'GHI': [70.05, 64.13, 64.88]
    },
    index = pd.DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'])
)
print(df)
##                ABC    DEF    GHI
## 2020-01-01  105.30  40.29  70.05
## 2020-01-02  107.17  40.97  64.13
## 2020-01-03  101.52  51.00  64.88

And then you call df.stack(), Pandas reshapes the DataFrame into a Series, literally stacking the values of ABC, DEF, and GHI within each original row-index, resulting in a 2-level multi row index.

df.stack()
## 2020-01-01  ABC    105.30
##             DEF     40.29
##             GHI     70.05
## 2020-01-02  ABC    107.17
##             DEF     40.97
##             GHI     64.13
## 2020-01-03  ABC    101.52
##             DEF     51.00
##             GHI     64.88
## dtype: float64

Things get a little trickier when df has a MultiIndex column. Consider this DataFrame.

pets = pd.DataFrame(
    data=[[497, 1056, 2047, 3595], [352, 922, 3800, 1048], [217, 2017, 3571, 4521]],
    index=pd.Index(['AL', 'MS', 'LA'], name='state'),
    columns=pd.MultiIndex.from_tuples([('cat', 'persian'), ('cat', 'calico'), ('dog', 'corgie'), ('dog', 'lab')],
                                      names=['type', 'subtype'])
)
print(pets)
## type        cat           dog      
## subtype persian calico corgie   lab
## state                              
## AL          497   1056   2047  3595
## MS          352    922   3800  1048
## LA          217   2017   3571  4521

So, stack()'s most important parameter is called level, and it tells Pandas which level or levels from the column index you want to stack into the row index. Let’s see some examples, keeping in mind that our pets DataFrame has two levels in its column index: type with values ‘dog’ and ‘cat’ and subtype with values ‘persian’, ‘calico’, ‘corgie’, and ‘lab’.

If we call pets.stack() with level = 'type', you can see how the first level, type, gets extracted from the column index and inserted into the row index, and the data re-aligns itself accordingly.

pets.stack(level = 'type')
## subtype     calico  corgie     lab  persian
## state type                                 
## AL    cat   1056.0     NaN     NaN    497.0
##       dog      NaN  2047.0  3595.0      NaN
## MS    cat    922.0     NaN     NaN    352.0
##       dog      NaN  3800.0  1048.0      NaN
## LA    cat   2017.0     NaN     NaN    217.0
##       dog      NaN  3571.0  4521.0      NaN

Pretty much the same thing happens when you call stack() with level = 'subtype'.

pets.stack(level = 'subtype')
## type              cat     dog
## state subtype                
## AL    calico   1056.0     NaN
##       corgie      NaN  2047.0
##       lab         NaN  3595.0
##       persian   497.0     NaN
## MS    calico    922.0     NaN
##       corgie      NaN  3800.0
##       lab         NaN  1048.0
##       persian   352.0     NaN
## LA    calico   2017.0     NaN
##       corgie      NaN  3571.0
##       lab         NaN  4521.0
##       persian   217.0     NaN

Note that by default, stack() is called with level = -1 which automatically stacks the last column level into the row index. So in this case, pets.stack() with no parameters returns the same thing as pets.stack(level='subtype').

pets.stack()  # like pets.stack(level='subtype')
## type              cat     dog
## state subtype                
## AL    calico   1056.0     NaN
##       corgie      NaN  2047.0
##       lab         NaN  3595.0
##       persian   497.0     NaN
## MS    calico    922.0     NaN
##       corgie      NaN  3800.0
##       lab         NaN  1048.0
##       persian   352.0     NaN
## LA    calico   2017.0     NaN
##       corgie      NaN  3571.0
##       lab         NaN  4521.0
##       persian   217.0     NaN

You can also call pets.stack() with levels set to a list like ['type', 'subtype'] in which case both column levels are stacked into the row index and the result is a Series, not a DataFrame.

pets.stack(level = ['type', 'subtype'])
## state  type  subtype
## AL     cat   calico     1056.0
##              persian     497.0
##        dog   corgie     2047.0
##              lab        3595.0
## MS     cat   calico      922.0
##              persian     352.0
##        dog   corgie     3800.0
##              lab        1048.0
## LA     cat   calico     2017.0
##              persian     217.0
##        dog   corgie     3571.0
##              lab        4521.0
## dtype: float64

Just as Pandas give us a stack() method for stacking column levels into row levels, it gives us an unstack() method for unstacking row levels into column levels.

So, where we have pets.stack(level = 'type'), we can do almost the inverse operation with .unstack(level='type'), although in this case ‘type’ becomes the second level of the column index, and all the NaNs created by stack() get retained as columns with unstack().

pets.stack(level = 'type').unstack(level = 'type')
## subtype  calico     corgie         lab         persian    
## type        cat dog    cat     dog cat     dog     cat dog
## state                                                     
## AL       1056.0 NaN    NaN  2047.0 NaN  3595.0   497.0 NaN
## LA       2017.0 NaN    NaN  3571.0 NaN  4521.0   217.0 NaN
## MS        922.0 NaN    NaN  3800.0 NaN  1048.0   352.0 NaN

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts