Python Pandas For Your Grandpa - 3.8 DataFrame Aggregation

In this section, we’ll see how you can aggregate the rows of a DataFrame to calculate summary statistics.

For example, suppose we have a DataFrame with two columns of floats, x and y, and we want to calculate the sum of each column.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'x': [3.1, 5.5, 9.2, 1.7, 1.2, 8.3, 2.6],
    'y': [1.4, np.nan, 5.0, 5.8, 9.0, np.nan, 9.2]
##      x    y
## 0  3.1  1.4
## 1  5.5  NaN
## 2  9.2  5.0
## 3  1.7  5.8
## 4  1.2  9.0
## 5  8.3  NaN
## 6  2.6  9.2

To aggregate it, we just need to call df.agg() and then pass in the name of an aggregation function like ‘sum’.

## x    31.6
## y    30.4
## dtype: float64

or we could pass in a function callabe, like np.sum.

## x    31.6
## y    30.4
## dtype: float64

When you pass in a string like ‘foofunc’, Pandas first looks for a DataFrame method called ‘foofunc’. If it can’t find anything, then it checks for a NumPy function called ‘foofunc’. And if it still can’t find ‘foofunc’, it raises an error. If you defined your own function named ‘foofunc’ like this

def foofunc(x):
    return x.iloc[1]

you can use it with agg(), but you have to pass it in as a callable (not as a string).

## x    5.5
## y    NaN
## dtype: float64

Alright, so when we did df.agg('sum'), we got back a 2-element Series with the sum of each column. If we want to get the result back as a DataFrame, we could wrap the aggregation function inside a list, like

##         x     y
## sum  31.6  30.4

Using a list, we can aggregate the data using multiple functions. For example, to get the sum and mean of each column we could say

df.agg(['sum', 'mean'])
##               x      y
## sum   31.600000  30.40
## mean   4.514286   6.08

What if we wanted to calculate the sum and mean of column x, and the min and max of column y? In this case, we can use a dictionary of ‘column:list-of-functions’ pairs to tell Pandas exactly what functions to apply to each column. In this case,

df.agg({'x': ['sum', 'mean'], 'y': ['min', 'max']})
##               x    y
## sum   31.600000  NaN
## mean   4.514286  NaN
## min         NaN  1.4
## max         NaN  9.2

This might be obvious, but we should note that aggregation is intended to be used with a special class of functions that take in a Series of values and reduces them down to a single value, or at least fewer values than the input. These could be things like sum(), mean(), min(), or max(). By contrast, transformation functions like sort(), fillna(), and round() take in a Series of values and return another Series the same length as the input, so you shouldn’t use these functions to aggregate a DataFrame.

Looking at the docs for DataFrame.agg(), you’ll notice there’s an ‘axis’ argument. By default, it’s set to 0, so when we do somthing like df.agg('min'), Pandas calculates the min value across axis 0, in other words, aggregating the rows of df. If we wanted to, we could do df.agg('min', axis=1) to aggregate the columns, calculating the min value in each row.

df.agg('min', axis=1)
## 0    1.4
## 1    5.5
## 2    5.0
## 3    1.7
## 4    1.2
## 5    8.3
## 6    2.6
## dtype: float64

Also notice the *args and **kwargs parameters. So if you define some custom function like nth_value() that returns the nth value of a Series,

def nth_value(x, n):
    return x.iloc[n]

you could aggregate df, picking out the 3rd value in each column using

df.agg(nth_value, n=3)
## x    1.7
## y    5.8
## dtype: float64

By the way, for simple functions like this, you’ll often see people build them on the fly using lambda, like

df.agg(func = lambda x, n: x.iloc[n], n=3)
## x    1.7
## y    5.8
## dtype: float64

In case you’re wondering what aggregation functions exist in Pandas, a good place to start by browsing all the methods for DataFrame.

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts