Contents

Python Pandas For Your Grandpa - 4.9 Challenge: Session Groups

Setup

You run an ecommerce site called shoesfordogs.com. You want to do some analysis of your visitors, so you compile a DataFrame called hits that represents each time a visitor hit some page on your site.

You suspect that the undocumented third-party tracking system on your website is buggy and sometimes splits one session into two or more session_ids. You want to correct this behavior by creating a field called session_group_id that stitches broken session_ids together.

Two session, A & B, should belong to the same session group if

  1. They have the same visitor_id and
    1. Their hits overlap in time or
    2. The latest hit from A is within five minutes of the earliest hit from B, or vice-versa

Also associativity applies. So, if A is grouped with B, and B is grouped with C, then A should be grouped with C as well.

Create a column in hits called session_group_id that identifies which hits belong to the same session group.

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

generator = np.random.default_rng(90)
products = ['iev','pys','vae','dah','yck','axl','apx','evu','wqv','tfg','aur','rgy','kef','lzj','kiz','oma']
hits = pd.DataFrame({
    'visitor_id':generator.choice(5, size=20, replace=True) + 1,
    'session_id':generator.choice(4, size=20, replace=True),
    'date_time':pd.to_datetime('2020-01-01') + pd.to_timedelta(generator.choice(60, size=20), unit='m'),
    'page_url':[f'shoesfordogs.com/product/{x}' for x in generator.choice(products, size=20, replace=True)]
})
hits['session_id'] = hits.visitor_id * 100 + hits.session_id
print(hits)
##     visitor_id  session_id           date_time                      page_url
## 0            4         400 2020-01-01 00:05:00  shoesfordogs.com/product/pys
## 1            2         200 2020-01-01 00:18:00  shoesfordogs.com/product/oma
## 2            1         102 2020-01-01 00:48:00  shoesfordogs.com/product/evu
## 3            4         403 2020-01-01 00:21:00  shoesfordogs.com/product/oma
## 4            2         201 2020-01-01 00:40:00  shoesfordogs.com/product/yck
## 5            3         302 2020-01-01 00:33:00  shoesfordogs.com/product/pys
## 6            2         203 2020-01-01 00:37:00  shoesfordogs.com/product/rgy
## 7            3         302 2020-01-01 00:54:00  shoesfordogs.com/product/tfg
## 8            3         302 2020-01-01 00:48:00  shoesfordogs.com/product/kef
## 9            4         402 2020-01-01 00:24:00  shoesfordogs.com/product/apx
## 10           3         300 2020-01-01 00:49:00  shoesfordogs.com/product/kef
## 11           1         101 2020-01-01 00:52:00  shoesfordogs.com/product/iev
## 12           3         302 2020-01-01 00:01:00  shoesfordogs.com/product/dah
## 13           4         403 2020-01-01 00:02:00  shoesfordogs.com/product/lzj
## 14           4         401 2020-01-01 00:42:00  shoesfordogs.com/product/evu
## 15           5         500 2020-01-01 00:39:00  shoesfordogs.com/product/apx
## 16           5         503 2020-01-01 00:31:00  shoesfordogs.com/product/dah
## 17           3         303 2020-01-01 00:01:00  shoesfordogs.com/product/lzj
## 18           2         200 2020-01-01 00:16:00  shoesfordogs.com/product/aur
## 19           1         100 2020-01-01 00:11:00  shoesfordogs.com/product/apx

Solution

hits.sort_values(by=['visitor_id', 'date_time'], inplace=True)
sessions = hits.groupby(['visitor_id', 'session_id']).agg(
    date_time_min=('date_time', 'min'),
    date_time_max=('date_time', 'max')
)
sessions.sort_values(['visitor_id', 'date_time_min'], inplace=True)
sessions['date_time_max_p5'] = sessions.date_time_max + pd.Timedelta(minutes=5)
sessions['date_time_max_p5_cummax'] = sessions.groupby('visitor_id')['date_time_max_p5'].cummax()
sessions['date_time_max_p5_cummax_prev'] = sessions.groupby('visitor_id')['date_time_max_p5_cummax'].transform(pd.Series.shift)
sessions['group_with_prev'] = sessions.date_time_min <= sessions.date_time_max_p5_cummax_prev
sessions['session_group_id'] = (~sessions.group_with_prev).cumsum()
hits = pd.merge(left=hits, right=sessions[['session_group_id']], on='session_id')
print(hits)
##     visitor_id  session_id           date_time                      page_url  session_group_id
## 0            1         100 2020-01-01 00:11:00  shoesfordogs.com/product/apx                 1
## 1            1         102 2020-01-01 00:48:00  shoesfordogs.com/product/evu                 2
## 2            1         101 2020-01-01 00:52:00  shoesfordogs.com/product/iev                 2
## 3            2         200 2020-01-01 00:16:00  shoesfordogs.com/product/aur                 3
## 4            2         200 2020-01-01 00:18:00  shoesfordogs.com/product/oma                 3
## 5            2         203 2020-01-01 00:37:00  shoesfordogs.com/product/rgy                 4
## 6            2         201 2020-01-01 00:40:00  shoesfordogs.com/product/yck                 4
## 7            3         302 2020-01-01 00:01:00  shoesfordogs.com/product/dah                 5
## 8            3         302 2020-01-01 00:33:00  shoesfordogs.com/product/pys                 5
## 9            3         302 2020-01-01 00:48:00  shoesfordogs.com/product/kef                 5
## 10           3         302 2020-01-01 00:54:00  shoesfordogs.com/product/tfg                 5
## 11           3         303 2020-01-01 00:01:00  shoesfordogs.com/product/lzj                 5
## 12           3         300 2020-01-01 00:49:00  shoesfordogs.com/product/kef                 5
## 13           4         403 2020-01-01 00:02:00  shoesfordogs.com/product/lzj                 6
## 14           4         403 2020-01-01 00:21:00  shoesfordogs.com/product/oma                 6
## 15           4         400 2020-01-01 00:05:00  shoesfordogs.com/product/pys                 6
## 16           4         402 2020-01-01 00:24:00  shoesfordogs.com/product/apx                 6
## 17           4         401 2020-01-01 00:42:00  shoesfordogs.com/product/evu                 7
## 18           5         503 2020-01-01 00:31:00  shoesfordogs.com/product/dah                 8
## 19           5         500 2020-01-01 00:39:00  shoesfordogs.com/product/apx                 9

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts