R – Introduction to Data.Table Joins

2014-07-23 783 words 4 minutes

R’s data.table package provides fast methods for handling large tables of data with simplistic syntax. The following is an introduction to basic join operations using data.table.

Suppose you have two data.tables – a table of insurance policies

policies <- data.table(
  PolicyNumber = c(1, 2, 3),
  EffectiveDate = as.Date(c("2012-1-1", "2012-1-1", "2012-7-1")),
  ExpirationDate = as.Date(c("2012-12-31", "2012-6-30", "2012-12-31"))
)
policies
##    PolicyNumber EffectiveDate ExpirationDate
## 1:            1    2012-01-01     2012-12-31
## 2:            2    2012-01-01     2012-06-30
## 3:            3    2012-07-01     2012-12-31

and a table of insurance claims.

claims <- data.table(
  ClaimNumber = c(123, 124, 125, 126),
  PolicyNumber = c(1, 1, 3, 4),
  ClaimCost = c(100, 2400, 350, 8000)
)
claims
##    ClaimNumber PolicyNumber ClaimCost
## 1:         123            1       100
## 2:         124            1      2400
## 3:         125            3       350
## 4:         126            4      8000

If you want to see the policy data for each claim, you need to do a join on the policy number. In SQL terms, this is a right/left outer join. That is, you want the result to include every row from the claims table, and only rows from the policy table that are associated with a claim in the claims table. Right outer joins are the default behavior of data.table’s join method.

First we need to set the key of each table based on the column we want to use to match the rows of the tables.

setkey(policies, "PolicyNumber")
setkey(claims, "PolicyNumber")

Note: Technically we only need to specify the key of the policies table for this join to work, but the join runs quicker when you key both tables.

Now do the join.

policies[claims]
##    PolicyNumber EffectiveDate ExpirationDate ClaimNumber ClaimCost
## 1:            1    2012-01-01     2012-12-31         123       100
## 2:            1    2012-01-01     2012-12-31         124      2400
## 3:            3    2012-07-01     2012-12-31         125       350
## 4:            4          <NA>           <NA>         126      8000

Since claim 126’s policy number, 4, was not in the policies table its effective and expiration dates are set as NA.

The important thing to remember when doing a basic X[Y] join using data.table is that the table inside of the brackets will have all of its rows in the resultant table. So, doing claims[policies] will return all policies and any matching claims.

claims[policies]
##    ClaimNumber PolicyNumber ClaimCost EffectiveDate ExpirationDate
## 1:         123            1       100    2012-01-01     2012-12-31
## 2:         124            1      2400    2012-01-01     2012-12-31
## 3:          NA            2        NA    2012-01-01     2012-06-30
## 4:         125            3       350    2012-07-01     2012-12-31

If you want to return only claims that have a matching policy (i.e. rows where the key is in both tables), set the nomatch argument of data.table to 0.

policies[claims, nomatch = 0]
##    PolicyNumber EffectiveDate ExpirationDate ClaimNumber ClaimCost
## 1:            1    2012-01-01     2012-12-31         123       100
## 2:            1    2012-01-01     2012-12-31         124      2400
## 3:            3    2012-07-01     2012-12-31         125       350

(This is equivalent to claims[policies, nomatch = 0] and is referred to as an inner join.)

If you want to return rows in the claims table which are not in the policies table, you can do

claims[!policies]
##    ClaimNumber PolicyNumber ClaimCost
## 1:         126            4      8000

Or, for policies with no claims…

policies[!claims]
##    PolicyNumber EffectiveDate ExpirationDate
## 1:            2    2012-01-01     2012-06-30

Now suppose we add a field, Company, to each table and set all the values to “ABC”.

claims[, Company := "ABC"]
policies[, Company := "ABC"]

What would the result be if we try to join policies and claims based on the new Company field?

setkey(claims, "Company")
setkey(policies, "Company")

# Uncomment and run this yourself
# claims[policies]

data.table throws an error in this situation because our resultant table has more rows than the combined number of rows in each of the tables being joined. This is a common sign of a mistake, but in our case it’s desired. In this situation we need to tell data.table that this isn’t a mistake by specifying allow.cartesian = TRUE.

claims[policies, allow.cartesian = TRUE]
##     ClaimNumber PolicyNumber ClaimCost Company i.PolicyNumber EffectiveDate
##  1:         123            1       100     ABC              1    2012-01-01
##  2:         124            1      2400     ABC              1    2012-01-01
##  3:         125            3       350     ABC              1    2012-01-01
##  4:         126            4      8000     ABC              1    2012-01-01
##  5:         123            1       100     ABC              2    2012-01-01
##  6:         124            1      2400     ABC              2    2012-01-01
##  7:         125            3       350     ABC              2    2012-01-01
##  8:         126            4      8000     ABC              2    2012-01-01
##  9:         123            1       100     ABC              3    2012-07-01
## 10:         124            1      2400     ABC              3    2012-07-01
## 11:         125            3       350     ABC              3    2012-07-01
## 12:         126            4      8000     ABC              3    2012-07-01
##     ExpirationDate
##  1:     2012-12-31
##  2:     2012-12-31
##  3:     2012-12-31
##  4:     2012-12-31
##  5:     2012-06-30
##  6:     2012-06-30
##  7:     2012-06-30
##  8:     2012-06-30
##  9:     2012-12-31
## 10:     2012-12-31
## 11:     2012-12-31
## 12:     2012-12-31

Next to come – rolling joins.