R – Data.Table Rolling Joins

Rolling joins are commonly used for analyzing data involving time. A simple example – suppose you have a table of product sales and a table of commercials. You might want to associate each product sale with the most recent commercial that aired prior to the sale. In this case, you cannot do a basic join between the sales table and the commercials table because each sale was not tracked with a CommercialId attribute. Instead, you need to generate a mapping between sales and commercials based on logic involving their dates. Here, we use the data.table package to illustrate.

sales <- data.table(
  SaleId = c("S1", "S2", "S3", "S4", "S5"),
  SaleDate = as.Date(c("2014-2-20", "2014-5-1", "2014-6-15", "2014-7-1", "2014-12-31"))
)
sales
##    SaleId   SaleDate
## 1:     S1 2014-02-20
## 2:     S2 2014-05-01
## 3:     S3 2014-06-15
## 4:     S4 2014-07-01
## 5:     S5 2014-12-31

commercials <- data.table(
  CommercialId = c("C1", "C2", "C3", "C4"),
  CommercialDate = as.Date(c("2014-1-1", "2014-4-1", "2014-7-1", "2014-9-15"))
)
commercials
##    CommercialId CommercialDate
## 1:           C1     2014-01-01
## 2:           C2     2014-04-01
## 3:           C3     2014-07-01
## 4:           C4     2014-09-15

Before we answer the problem stated above, let’s analyze the behavior of the default rolling join using data.table. (Note: If you do a rolling join involving a multi-column key, the roll column should be the last keyed column.)

setkey(sales, "SaleDate")
setkey(commercials, "CommercialDate")

# Default rolling join
sales[commercials, roll = TRUE]
##    SaleId   SaleDate CommercialId
## 1:   <NA> 2014-01-01           C1
## 2:     S1 2014-04-01           C2
## 3:     S4 2014-07-01           C3
## 4:     S4 2014-09-15           C4

What just occurred is best described graphically.

data.table is associating each commercial with the most recent sale prior to the commercial date (and including the commercial date). In other words, the most recent sale prior to each commercial is said to “roll forward”, and the SaleDate is mapped to the CommercialDate. Notice that sale S4 was the most recent sale prior to commercial C3 and C4, so S4 appears twice in the resultant table.

If you don’t want to lose the original SaleDate column, you may prefer to create a designated RollDate column for your rolling join. I’m going to use this convention going forward.

sales[, RollDate := SaleDate]
commercials[, RollDate := CommercialDate]

setkey(sales, "RollDate")
setkey(commercials, "RollDate")

sales[commercials, roll = TRUE]
##    SaleId   SaleDate   RollDate CommercialId CommercialDate
## 1:   <NA>       <NA> 2014-01-01           C1     2014-01-01
## 2:     S1 2014-02-20 2014-04-01           C2     2014-04-01
## 3:     S4 2014-07-01 2014-07-01           C3     2014-07-01
## 4:     S4 2014-07-01 2014-09-15           C4     2014-09-15

Now let’s see what happens when we join commercials to sales.

commercials[sales, roll = TRUE]
##    CommercialId CommercialDate   RollDate SaleId   SaleDate
## 1:           C1     2014-01-01 2014-02-20     S1 2014-02-20
## 2:           C2     2014-04-01 2014-05-01     S2 2014-05-01
## 3:           C2     2014-04-01 2014-06-15     S3 2014-06-15
## 4:           C3     2014-07-01 2014-07-01     S4 2014-07-01
## 5:           C4     2014-09-15 2014-12-31     S5 2014-12-31

What it’s doing…

Here data.table is associating each sale with the most recent commercial prior to (or including) the SaleDate. Did you notice this is the solution to our originally stated problem?

data.table also supports backward rolling joins by setting roll = -Inf.

sales[commercials, roll = -Inf]
##    SaleId   SaleDate   RollDate CommercialId CommercialDate
## 1:     S1 2014-02-20 2014-01-01           C1     2014-01-01
## 2:     S2 2014-05-01 2014-04-01           C2     2014-04-01
## 3:     S4 2014-07-01 2014-07-01           C3     2014-07-01
## 4:     S5 2014-12-31 2014-09-15           C4     2014-09-15
commercials[sales, roll = -Inf]
##    CommercialId CommercialDate   RollDate SaleId   SaleDate
## 1:           C2     2014-04-01 2014-02-20     S1 2014-02-20
## 2:           C3     2014-07-01 2014-05-01     S2 2014-05-01
## 3:           C3     2014-07-01 2014-06-15     S3 2014-06-15
## 4:           C3     2014-07-01 2014-07-01     S4 2014-07-01
## 5:         <NA>           <NA> 2014-12-31     S5 2014-12-31

The above tools should provide you with the ability to solve the majority of your rolling join needs. However, I encourage you to read the docs for data.table as there are some other cool features for rolling joins. Specifically, you can

  • provide a positive/negative value to the roll argument, which will perform a forward/backward rolling join, only associating dates within n days of each other
  • roll dates to their nearest neighbor as opposed to a one-way (forward or backward) rolling join
  • specify the behavior of the end points in a rolling join (see the rollends argument)

Challenge

Remove sale S1 from the sales table.

commercials[sales, roll = -Inf]
##    CommercialId CommercialDate   RollDate SaleId   SaleDate
## 1:           C2     2014-04-01 2014-02-20     S1 2014-02-20
## 2:           C3     2014-07-01 2014-05-01     S2 2014-05-01
## 3:           C3     2014-07-01 2014-06-15     S3 2014-06-15
## 4:           C3     2014-07-01 2014-07-01     S4 2014-07-01
## 5:         <NA>           <NA> 2014-12-31     S5 2014-12-31
sales <- sales[SaleId != "S1"]

Now build a table that has each commercial and all sales that occurred after (and on) the CommercialDate, but before the next commercial. In other words, try to build this table that will allow you to analyze the performance of each commercial.

##    CommercialId SaleId   SaleDate CommercialDate
## 1:           C1   <NA>       <NA>     2014-01-01
## 2:           C2     S2 2014-05-01     2014-04-01
## 3:           C2     S3 2014-06-15     2014-04-01
## 4:           C3     S4 2014-07-01     2014-07-01
## 5:           C4     S5 2014-12-31     2014-09-15