Python Pandas For Your Grandpa - 3.8 DataFrame Aggregation
In this section, we’ll see how you can aggregate the rows of a DataFrame to calculate summary statistics.
For example, suppose we have a DataFrame with two columns of floats, x
and y
, and we want to calculate the sum of
each column.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'x': [3.1, 5.5, 9.2, 1.7, 1.2, 8.3, 2.6],
'y': [1.4, np.nan, 5.0, 5.8, 9.0, np.nan, 9.2]
})
print(df)
## x y
## 0 3.1 1.4
## 1 5.5 NaN
## 2 9.2 5.0
## 3 1.7 5.8
## 4 1.2 9.0
## 5 8.3 NaN
## 6 2.6 9.2
To aggregate it, we just need to call df.agg()
and then pass in the name of an aggregation function like ‘sum’.
df.agg('sum')
## x 31.6
## y 30.4
## dtype: float64
or we could pass in a function callabe, like np.sum
.
df.agg(np.sum)
## x 31.6
## y 30.4
## dtype: float64
When you pass in a string like ‘foofunc’, Pandas first looks for a DataFrame method called ‘foofunc’. If it can’t find anything, then it checks for a NumPy function called ‘foofunc’. And if it still can’t find ‘foofunc’, it raises an error. If you defined your own function named ‘foofunc’ like this
def foofunc(x):
return x.iloc[1]
you can use it with agg()
, but you have to pass it in as a callable (not as a string).
df.agg(foofunc)
## x 5.5
## y NaN
## dtype: float64
Alright, so when we did df.agg('sum')
, we got back a 2-element Series with the sum of each column. If we want to get the result back as a DataFrame, we could wrap the aggregation function inside a list, like
df.agg(['sum'])
## x y
## sum 31.6 30.4
Using a list, we can aggregate the data using multiple functions. For example, to get the sum and mean of each column we could say
df.agg(['sum', 'mean'])
## x y
## sum 31.600000 30.40
## mean 4.514286 6.08
What if we wanted to calculate the sum and mean of column x
, and the min and max of column y
? In this case, we can use a dictionary of ‘column:list-of-functions’ pairs to tell Pandas exactly what functions to apply to each column. In this case,
df.agg({'x': ['sum', 'mean'], 'y': ['min', 'max']})
## x y
## sum 31.600000 NaN
## mean 4.514286 NaN
## min NaN 1.4
## max NaN 9.2
This might be obvious, but we should note that aggregation is intended to be used with a special class of functions that take in a Series of values and reduces them down to a single value, or at least fewer values than the input. These could be things like sum()
, mean()
, min()
, or max()
. By contrast, transformation functions like sort()
, fillna()
, and round()
take in a Series of values and return another Series the same length as the input, so you shouldn’t use these functions to aggregate a DataFrame.
Looking at the docs for DataFrame.agg()
, you’ll notice there’s an ‘axis’ argument. By default, it’s set to 0,
so when we do somthing like df.agg('min')
, Pandas calculates the min value across axis 0, in other words, aggregating the rows of df
. If we wanted to, we could do df.agg('min', axis=1)
to aggregate the columns, calculating the min value in each row.
df.agg('min', axis=1)
## 0 1.4
## 1 5.5
## 2 5.0
## 3 1.7
## 4 1.2
## 5 8.3
## 6 2.6
## dtype: float64
Also notice the *args and **kwargs parameters. So if you define some custom function like nth_value()
that returns the nth value of a Series,
def nth_value(x, n):
return x.iloc[n]
you could aggregate df
, picking out the 3rd value in each column using
df.agg(nth_value, n=3)
## x 1.7
## y 5.8
## dtype: float64
By the way, for simple functions like this, you’ll often see people build them on the fly using lambda, like
df.agg(func = lambda x, n: x.iloc[n], n=3)
## x 1.7
## y 5.8
## dtype: float64
In case you’re wondering what aggregation functions exist in Pandas, a good place to start by browsing all the methods for DataFrame.
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts