Contents

Python Pandas For Your Grandpa - 3.9 DataFrame groupby()

In this section, we’ll see how you can use groupby() to partition a DataFrame into groups and subsequently aggregate or transform the data in each group.

To start, let’s build a DataFrame with four columns, A, B, C, and D.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'B': [False, True, False, True, True, True, True],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8],
    'D': [50, np.nan, 30, 90, 10, np.nan, 10]
}).convert_dtypes()
print(df)
##      A      B    C     D
## 0  foo  False  2.1    50
## 1  bar   True  1.9  <NA>
## 2  foo  False  3.6    30
## 3  bar   True  4.0    90
## 4  bar   True  1.9    10
## 5  foo   True  7.8  <NA>
## 6  foo   True  2.8    10

By the way, convert_dtypes() is a handy way to make sure we’re using Pandas' modern String and Integer datatypes which allow for NaN values. Without it, A would be type ‘object’, and D would be type ‘float64’.

Every “Group By” operation starts with a call to the groupby() method. Perhaps the most important parameter of groupby() is the ‘by’ argument, which tells Pandas how you want to split the DataFrame into groups. Usually, you’ll specify one or more column names here. For example, df.groupby(by='A') tells Pandas you want to partition the data based on unique values in column ‘A’.

df.groupby(by='A')
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc0103d8640>

In this case, Pandas partitions the data into two groups. (See video.)

Alternatively, you could do `df.groupby(by=[‘A’, ‘B’]) to partition the rows based on the set of unique (A, B) pairs.

df.groupby(by=['A', 'B'])
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc0302aafd0>

In this case, Pandas partitions the data into three groups. (See video.)

The result of calling groupby() is a special type of object called a DataFrameGroupBy.

print(df.groupby(by='A'))
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc03003ef10>

I should note that you can also call groupby() on a Series in which case you’d get back a SeriesGroupBy object. DataFrameGroupBy and SeriesGroupBy are both extensions of a generic GroupBy class, so they behave very similarly.

GroupBy objects have some special features and attributes like .groups which returns a dictionary of ‘group-key:list-of-input-rows’ pairs. For example, if we set groups_ab = df.groupby(['A', 'B'])

groups_ab = df.groupby(['A', 'B'])

Then groups_ab.groups tells us that rows 0 and 2 correspond to the key (‘foo’, False).

groups_ab.groups
## {('bar', True): [1, 3, 4], ('foo', False): [0, 2], ('foo', True): [5, 6]}

Similarly, groups_ab.ngroup() tells us the numeric group id associated with each row of the input data.

groups_ab.ngroup()
## 0    1
## 1    0
## 2    1
## 3    0
## 4    0
## 5    2
## 6    2
## dtype: int64

Notice that these group ids are ordered by the order of their keys. For example, group 0 corresponds to key (‘bar’, True) because alphabetically ‘bar’ comes before ‘foo’, but in the original DataFrame, the key (‘foo’, False) is the first one to occur. So, if you’d rather have the groups ordered by the first occurrence of each key, you can do so by setting sort=False inside the call to groupby(). For example, if we set groups_ab as df.groupby(by=['A', 'B'], sort=False) and then we call groups_ab.ngroup()

groups_ab = df.groupby(by=['A', 'B'], sort=False)
groups_ab.ngroup()
## 0    0
## 1    1
## 2    0
## 3    1
## 4    1
## 5    2
## 6    2
## dtype: int64

you can see group 0 occurs before group 1 which ocurrs before group 2.

Another thing we can do is, instead of passing a list of column names into the ‘by’ argument of groupby(), we can pass in a Series, in which case the Series values determine the groups and the Series index determines which rows of the DataFrame map to each group. For example, if we wanted to group the data based on the whole part of the values in column C, we could say

wholes = df.C.apply(np.floor)
print(wholes)
## 0    2.0
## 1    1.0
## 2    3.0
## 3    4.0
## 4    1.0
## 5    7.0
## 6    2.0
## Name: C, dtype: Float64

which returns a Series. And then

df.groupby(wholes)
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc0204bba60>

In this case, the values in the Series get partitioned into five groups. Then the index labels of the Series tell Pandas how to map the rows of df into those five groups. Index 0 goes to group 1, index 1 goes to group 0, index 2 goes to group 2, and so on. Note that typically this would be written as a one liner, like this

df.groupby(df.C.apply(np.floor))
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc010467d90>

Partitioning data into groups isn’t all that usfeul until you actually do something with the groups. Usually after you group some data, you follow it up by aggregating the groups or transforming them. Let’s start by looking at some groupby-aggregate methods.

We’ve already seen how to aggregate an entire DataFrame down to one row. Aggregating a GroupBy object is essentially the same thing, where the aggregation applies to each group of data and then the results are combined into a DataFrame or Series. The best way to get a feel for how this works is to do some examples.

Let’s start by grouping df by column ‘A’, and then aggregate the groups taking the sum of column ‘C’. Here’s one way we could do this.

df.groupby('A')['C'].sum()
## A
## bar     7.8
## foo    16.3
## Name: C, dtype: Float64

Let’s break this down. df.groupby('A') creates a DataFrameGroupBy object. Tacking on ['C'] isolates column C, returning a SeriesGroupBy object.

Now keep in mind, DataFrameGroupBy and SeriesGroupBy are both derived from a generic GroupBy class, so if we look at the documentation for GroupBy we can see that it includes a sum() method for taking group sums, which is what we do here. The result is a Series whose index labels are the group keys and whose values are the group sums.

Here’s a similar solution, where we wrap 'C' into a list.

df.groupby('A')[['C']].sum()
##         C
## A        
## bar   7.8
## foo  16.3

In this case, Pandas doesn’t convert the DataFrameGroupBy object to a SeriesGroupBy object when we pick out column ‘C’, and as a result, we get back a DataFrame of group sums instead of a Series of group sums. This is analogous to doing df['C'] which returns a Series and df[['C']] which returns a DataFrame.

We could also use this technique to calculate group sums for multiple columns. For example, df.groupby('A')[['C','D']].sum() returns a DataFrame with sums over D and C aggregated by unique groups in A.

df.groupby('A')[['C', 'D']].sum()
##         C    D
## A             
## bar   7.8  100
## foo  16.3   90

We could also replace sum() here with agg() and pass in a method name or function callable.

df.groupby(by='A')[['C']].agg('sum')
##         C
## A        
## bar   7.8
## foo  16.3

Basically everything learned when we covered DataFrame aggregation can be applied here, which means you can even do things like this to get the sum of C and the mean and variance of D for each group in A.

df.groupby(by='A').agg({'C': np.sum, 'D': ['mean', 'var']})
##         C     D        
##       sum  mean     var
## A                      
## bar   7.8  50.0  3200.0
## foo  16.3  30.0   400.0

Note that the resulting DataFrame here has something called a MultIndex for its columns. Similarly, if we did df.groupby(by=['A', 'B'])[['D']].mean() we’d get back a DataFrame with a row MultiIndex. We’ll cover MultiIndexes in a future section, but for now just know they exist.

df.groupby(by=['A', 'B'])[['D']].mean()
##               D
## A   B          
## bar True   50.0
## foo False  40.0
##     True   10.0

If we group df by B & D, taking the min and max of C, you’ll notice that Pandas ignores and drops the NaNs inside D.

df.groupby(by=['B', 'D']).agg({'C':['min', 'max']})
##             C     
##           min  max
## B     D           
## False 30  3.6  3.6
##       50  2.1  2.1
## True  10  1.9  2.8
##       90  4.0  4.0

Personally I think this is a really unfortunate default behavior, but you can turn it off by setting dropna=False within the call to groupby().

df.groupby(by=['B', 'D'], dropna=False).agg({'C':['min', 'max']})
##              C     
##            min  max
## B     D            
## False 30   3.6  3.6
##       50   2.1  2.1
## True  10   1.9  2.8
##       90   4.0  4.0
##       NaN  1.9  7.8

Another question that comes up a lot is, “How do you rename the output columns?” So, let’s say you wanted to do the same thing as before, but you want the output columns to have names ‘C_min’ and ‘C_max’. In this case, you can pass named tuples into the agg() method like this, where the keywords become the column names.

df.groupby(by=['B', 'D'], dropna=False).agg(
    C_min=('C', 'min'),
    C_max=('C', np.max)
)
##            C_min  C_max
## B     D                
## False 30     3.6    3.6
##       50     2.1    2.1
## True  10     1.9    2.8
##       90     4.0    4.0
##       NaN    1.9    7.8

Now let’s look at some groupby-transform operations. For example, suppose we want sort the values in column C, but only relative to each group in A. One way we can do this is

df.groupby('A')['C'].transform(lambda x: x.sort_values())
## 0    2.1
## 1    1.9
## 2    2.8
## 3    1.9
## 4    4.0
## 5    3.6
## 6    7.8
## Name: C, dtype: Float64

Let’s break it down. df.groupby('A')['C'] groups the data by column A and then isolates column C. Next comes the transform() bit. The transform() method takes one primary argument which is the function you want to apply. In this case, sort_values() isn’t a built-in GroupBy method, so we have to pass it in as a function callable. Here we’ve done it using lambda, a useful construct because you can use lambdas to make all sorts of custom functions on the fly. But we could actually just reference the sort_values() function directly from the Series module.

df.groupby('A')['C'].transform(pd.Series.sort_values)
## 0    2.1
## 1    1.9
## 2    2.8
## 3    1.9
## 4    4.0
## 5    3.6
## 6    7.8
## Name: C, dtype: Float64

The result of this operation is a new Series whose index matches the row index of the original DataFrame. This is useful, for example, if you wanted to tack this on as a new column of df or overwrite the existing values in column C. For example, if we add this as a new column called C_sorted_within_A, you can see how the new values compare to the old values.

df['C_sorted_within_A'] = df.groupby('A')['C'].transform(pd.Series.sort_values)
print(df)
##      A      B    C     D  C_sorted_within_A
## 0  foo  False  2.1    50                2.1
## 1  bar   True  1.9  <NA>                1.9
## 2  foo  False  3.6    30                2.8
## 3  bar   True  4.0    90                1.9
## 4  bar   True  1.9    10                4.0
## 5  foo   True  7.8  <NA>                3.6
## 6  foo   True  2.8    10                7.8

Now as you probably guessed, if we do the same exact thing but we wrap ‘C’ inside a list, we get back a DataFrame with essentially the same exact data.

df.groupby('A')[['C']].transform(pd.Series.sort_values)
##      C
## 0  2.1
## 1  1.9
## 2  2.8
## 3  1.9
## 4  4.0
## 5  3.6
## 6  7.8

Let’s look at another example where again we group by column A, but this time we calculate the mean of columns C and D.

df.groupby('A')[['C', 'D']].transform('mean')
##        C     D
## 0  4.075  30.0
## 1    2.6  50.0
## 2  4.075  30.0
## 3    2.6  50.0
## 4    2.6  50.0
## 5  4.075  30.0
## 6  4.075  30.0

Like before, df.groupby('A') creates a DataFrameGroupBy object. Then we pick out columns C and D giving us a slightly different DataFrameGroupBy object. And then we call .transform() passing in ‘mean’ which we can do because mean is one of the builtin GroupBy methods. And at this point you might expect the result to have fewer rows than the input because mean is an aggregation function - but to the contrary - the result is a DataFrame with the same number of rows as df. This is the important thing to understand about the transform() function - the output rows will always match the input rows, and this is in stark contrast to the agg() function in which the output row count is usually much less than the input row count. In this case, the mean of each group in A was broadcast to match the length of the original DataFrame which is why you see a bunch of repeated values in the output.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts

Additional Content

  1. Python NumPy For Your Grandma
  2. Neural Networks For Your Dog
  3. Introduction To Google Colab