Python Pandas For Your Grandpa - 4.9 Challenge: Session Groups
Setup
You run an ecommerce site called shoesfordogs.com. You want to do some analysis of your visitors, so you compile a DataFrame called hits
that represents each time a visitor hit some page on your site.
You suspect that the undocumented third-party tracking system on your website is buggy and sometimes splits one session into two or more session_id
s. You want to correct this behavior by creating a field called session_group_id
that stitches broken session_id
s together.
Two session, A & B, should belong to the same session group if
- They have the same visitor_id and
-
- Their hits overlap in time or
- The latest hit from A is within five minutes of the earliest hit from B, or vice-versa
Also associativity applies. So, if A is grouped with B, and B is grouped with C, then A should be grouped with C as well.
Create a column in hits
called session_group_id
that identifies which hits belong to the same session group.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
generator = np.random.default_rng(90)
products = ['iev','pys','vae','dah','yck','axl','apx','evu','wqv','tfg','aur','rgy','kef','lzj','kiz','oma']
hits = pd.DataFrame({
'visitor_id':generator.choice(5, size=20, replace=True) + 1,
'session_id':generator.choice(4, size=20, replace=True),
'date_time':pd.to_datetime('2020-01-01') + pd.to_timedelta(generator.choice(60, size=20), unit='m'),
'page_url':[f'shoesfordogs.com/product/{x}' for x in generator.choice(products, size=20, replace=True)]
})
hits['session_id'] = hits.visitor_id * 100 + hits.session_id
print(hits)
## visitor_id session_id date_time page_url
## 0 4 400 2020-01-01 00:05:00 shoesfordogs.com/product/pys
## 1 2 200 2020-01-01 00:18:00 shoesfordogs.com/product/oma
## 2 1 102 2020-01-01 00:48:00 shoesfordogs.com/product/evu
## 3 4 403 2020-01-01 00:21:00 shoesfordogs.com/product/oma
## 4 2 201 2020-01-01 00:40:00 shoesfordogs.com/product/yck
## 5 3 302 2020-01-01 00:33:00 shoesfordogs.com/product/pys
## 6 2 203 2020-01-01 00:37:00 shoesfordogs.com/product/rgy
## 7 3 302 2020-01-01 00:54:00 shoesfordogs.com/product/tfg
## 8 3 302 2020-01-01 00:48:00 shoesfordogs.com/product/kef
## 9 4 402 2020-01-01 00:24:00 shoesfordogs.com/product/apx
## 10 3 300 2020-01-01 00:49:00 shoesfordogs.com/product/kef
## 11 1 101 2020-01-01 00:52:00 shoesfordogs.com/product/iev
## 12 3 302 2020-01-01 00:01:00 shoesfordogs.com/product/dah
## 13 4 403 2020-01-01 00:02:00 shoesfordogs.com/product/lzj
## 14 4 401 2020-01-01 00:42:00 shoesfordogs.com/product/evu
## 15 5 500 2020-01-01 00:39:00 shoesfordogs.com/product/apx
## 16 5 503 2020-01-01 00:31:00 shoesfordogs.com/product/dah
## 17 3 303 2020-01-01 00:01:00 shoesfordogs.com/product/lzj
## 18 2 200 2020-01-01 00:16:00 shoesfordogs.com/product/aur
## 19 1 100 2020-01-01 00:11:00 shoesfordogs.com/product/apx
Solution
hits.sort_values(by=['visitor_id', 'date_time'], inplace=True)
sessions = hits.groupby(['visitor_id', 'session_id']).agg(
date_time_min=('date_time', 'min'),
date_time_max=('date_time', 'max')
)
sessions.sort_values(['visitor_id', 'date_time_min'], inplace=True)
sessions['date_time_max_p5'] = sessions.date_time_max + pd.Timedelta(minutes=5)
sessions['date_time_max_p5_cummax'] = sessions.groupby('visitor_id')['date_time_max_p5'].cummax()
sessions['date_time_max_p5_cummax_prev'] = sessions.groupby('visitor_id')['date_time_max_p5_cummax'].transform(pd.Series.shift)
sessions['group_with_prev'] = sessions.date_time_min <= sessions.date_time_max_p5_cummax_prev
sessions['session_group_id'] = (~sessions.group_with_prev).cumsum()
hits = pd.merge(left=hits, right=sessions[['session_group_id']], on='session_id')
print(hits)
## visitor_id session_id date_time page_url session_group_id
## 0 1 100 2020-01-01 00:11:00 shoesfordogs.com/product/apx 1
## 1 1 102 2020-01-01 00:48:00 shoesfordogs.com/product/evu 2
## 2 1 101 2020-01-01 00:52:00 shoesfordogs.com/product/iev 2
## 3 2 200 2020-01-01 00:16:00 shoesfordogs.com/product/aur 3
## 4 2 200 2020-01-01 00:18:00 shoesfordogs.com/product/oma 3
## 5 2 203 2020-01-01 00:37:00 shoesfordogs.com/product/rgy 4
## 6 2 201 2020-01-01 00:40:00 shoesfordogs.com/product/yck 4
## 7 3 302 2020-01-01 00:01:00 shoesfordogs.com/product/dah 5
## 8 3 302 2020-01-01 00:33:00 shoesfordogs.com/product/pys 5
## 9 3 302 2020-01-01 00:48:00 shoesfordogs.com/product/kef 5
## 10 3 302 2020-01-01 00:54:00 shoesfordogs.com/product/tfg 5
## 11 3 303 2020-01-01 00:01:00 shoesfordogs.com/product/lzj 5
## 12 3 300 2020-01-01 00:49:00 shoesfordogs.com/product/kef 5
## 13 4 403 2020-01-01 00:02:00 shoesfordogs.com/product/lzj 6
## 14 4 403 2020-01-01 00:21:00 shoesfordogs.com/product/oma 6
## 15 4 400 2020-01-01 00:05:00 shoesfordogs.com/product/pys 6
## 16 4 402 2020-01-01 00:24:00 shoesfordogs.com/product/apx 6
## 17 4 401 2020-01-01 00:42:00 shoesfordogs.com/product/evu 7
## 18 5 503 2020-01-01 00:31:00 shoesfordogs.com/product/dah 8
## 19 5 500 2020-01-01 00:39:00 shoesfordogs.com/product/apx 9
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts