Share on:

#### Sparse Matrix Construction And Use In R

##### April 14, 2019
machine learning R tutorial

In this post, we’ll cover the basics of constructing and using sparse matrices with R’s Matrix package. For background on what sparse matrices are and how they’re stored in compressed formats, check out my previous article Sparse Matrix Storage Formats.

# Sparse Matrix Construction

## Sparse Matrix From Base R Matrix

library(Matrix)

# Build a base R matrix from scratch, comprised of
#   0s with probability 0.80
#   1s with probability 0.10
#   2s with probability 0.10

set.seed(0)
nrows <- 4L
ncols <- 6L
vals <- sample(
x = c(0, 1, 2),
prob = c(0.8, 0.1, 0.1),
size = nrows * ncols,
replace = TRUE
)
matBaseR <- matrix(vals, nrow = nrows)
matBaseR
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    1    0    0    0    0
## [2,]    0    0    0    0    0    1
## [3,]    0    2    0    0    1    0
## [4,]    0    1    0    0    0    0

# Convert to sparse format
matSparse <- as(matBaseR, "sparseMatrix")
matSparse
## 4 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

Notice how the snippet as(matBaseR, "sparseMatrix") creates a dgCMatrix by default. This is a matrix in compressed sparse column (CSC) format. Instead of letting the Matrix package make this decision for you, I suggest being explicity about the storage format you want. Granted, this is usually going to be CSC.

# compressed sparse column format
matCSC <- as(matBaseR, "dgCMatrix")
matCSC
## 4 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

# compressed sparse row format
matCSR <- as(matBaseR, "dgRMatrix")
matCSR
## 4 x 6 sparse Matrix of class "dgRMatrix"
##
## [1,] 2 1 . . . .
## [2,] . . . . . 1
## [3,] . 2 . . 1 .
## [4,] . 1 . . . .

Before moving on, the following exercise is worth doing.

1. Modify the parameters defining the size of the matrix to nrows = 10^4 and ncols = 10^4
2. Re-build matBaseR, matCSC and matCSR
3. Compare their sizes in memory

## Sparse Matrix From (Row, Column) Pairs

We start by building a table of customers, products, and orders.

# Build a table of customers
custs <- data.frame(Customer = c("Bob", "Sue", "Ralf", "John", "Mary"))
custs
##   Customer
## 1      Bob
## 2      Sue
## 3     Ralf
## 4     John
## 5     Mary

# Build a table of products
prods <- data.frame(Product = c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF"))
prods
##   Product
## 1     AAA
## 2     BBB
## 3     CCC
## 4     DDD
## 5     EEE
## 6     FFF

# Build a table of orders: (customer, product, units) triplets
orders <- data.frame(
Customer = c("Bob", "Bob", "Sue", "Sue", "Bob", "Ralf", "John", "John"),
Product = c("AAA", "AAA", "FFF", "FFF", "EEE", "EEE", "AAA", "FFF")
)

# Insert RowIdx column, identifying the row index to assign each customer
orders$RowIdx <- match(orders$Customer, custs$Customer) # Insert ColIdx column, identifying the column index to assign each product orders$ColIdx <- match(orders$Product, prods$Product)

# Inspect
orders
##   Customer Product RowIdx ColIdx
## 1      Bob     AAA      1      1
## 2      Bob     AAA      1      1
## 3      Sue     FFF      2      6
## 4      Sue     FFF      2      6
## 5      Bob     EEE      1      5
## 6     Ralf     EEE      3      5
## 7     John     AAA      4      1
## 8     John     FFF      4      6

RowIdx and ColIdx in the orders table will be the (i,j) location of each record in the resulting matrix. Let’s start by constructing a basic sparse matrix where element (i,j) = TRUE if customer i purchased product j, and FALSE otherwise.

# Build sparse matrix indicating if customer i ever purchased project j
matSparse1 <- sparseMatrix(
i = orders$RowIdx, j = orders$ColIdx
)
matSparse1
## 4 x 6 sparse Matrix of class "ngCMatrix"
##
## [1,] | . . . | .
## [2,] . . . . . |
## [3,] . . . . | .
## [4,] | . . . . |

Notice that the resulting matrix has 4 rows, but we actually have 5 customers. The 5th customer, Mary, doesn’t have any orders. If we want to make sure every customer (and every product) appears in our matrix, we can set the dimensions explicitly. Furthermore, we can provide row names and column names to sparseMatrix().

matSparse2 <- sparseMatrix(
i = orders$RowIdx, j = orders$ColIdx,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse2
## 5 x 6 sparse Matrix of class "ngCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    |   .   .   .   |   .
## Sue    .   .   .   .   .   |
## Ralf   .   .   .   .   |   .
## John   |   .   .   .   .   |
## Mary   .   .   .   .   .   .

Instead of building a matrix where element (i,j) indicates if customer i purchased product j, let’s build one where element (i,j) shows the number of times customer i purchased product j.

matSparse3 <- sparseMatrix(
i = orders$RowIdx, j = orders$ColIdx,
x = 1L,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse3
## 5 x 6 sparse Matrix of class "dgCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    2   .   .   .   1   .
## Sue    .   .   .   .   .   2
## Ralf   .   .   .   .   1   .
## John   1   .   .   .   .   1
## Mary   .   .   .   .   .   .

Here, we use x = 1L to “add 1” to element (i,j) every time it is seen in the input. If you replaced x = 1L with x = 2L, we’d get a matrix that’s twice the one we just created.

## Sparse Matrix From (Row, Column, Value) Triplets

Now let’s assign a Units column to our orders table that indicates the number of units purchased for each order.

orders$Units <- c(1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L) orders ## Customer Product RowIdx ColIdx Units ## 1 Bob AAA 1 1 1 ## 2 Bob AAA 1 1 1 ## 3 Sue FFF 2 6 3 ## 4 Sue FFF 2 6 1 ## 5 Bob EEE 1 5 1 ## 6 Ralf EEE 3 5 2 ## 7 John AAA 4 1 1 ## 8 John FFF 4 6 1 We can build a sparse matrix where element (i,j) shows the number of units of product j purchased by customer i. matSparse4 <- sparseMatrix( i = orders$RowIdx,
j = orders$ColIdx, x = orders$Units,
dims = c(nrow(custs), nrow(prods)),
dimnames = list(custs$Customer, prods$Product)
)
matSparse4
## 5 x 6 sparse Matrix of class "dgCMatrix"
##      AAA BBB CCC DDD EEE FFF
## Bob    2   .   .   .   1   .
## Sue    .   .   .   .   .   4
## Ralf   .   .   .   .   2   .
## John   1   .   .   .   .   1
## Mary   .   .   .   .   .   .

In my experience, this method of constructing a sparse matrix is the most common.

# Sparse Matrix Deconstruction

## Base R Matrix From Sparse Matrix

# Make a compressed sparse matrix
matSparse <- as(matrix(data = c(1,0,0,0,1,0), nrow = 3), "dgCMatrix")
matSparse
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 .
## [2,] . 1
## [3,] . .

# Convert from compressed format to uncomressed, base R matrix
as.matrix(matSparse)
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
## [3,]    0    0

## (Row, Column) Pairs From Sparse Matrix

# Make a compressed sparse matrix
matBaseR <- matrix(
data = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE),
nrow = 3L,
dimnames = list(Foo = c("A", "B", "C"), Bar = c("y", "z"))
)
matSparse <- as(matBaseR, "ngCMatrix")
matSparse
## 3 x 2 sparse Matrix of class "ngCMatrix"
##    Bar
## Foo y z
##   A | .
##   B . |
##   C . .

# Build (row, col) pairs from non-zero entries
df <- as.data.frame(summary(matSparse))
df$Foo <- rownames(matSparse)[df$i]
df$Col <- colnames(matSparse)[df$j]
df
##   i j Foo Col
## 1 1 1   A   y
## 2 2 2   B   z

## (Row, Column, Value) Triplets From Sparse Matrix

# Make a compressed sparse matrix
matBaseR <- matrix(
data = c(77, 0, 0, 0, 55, 0),
nrow = 3L,
dimnames = list(Foo = c("A", "B", "C"), Bar = c("y", "z"))
)
matSparse <- as(matBaseR, "dgCMatrix")
matSparse
## 3 x 2 sparse Matrix of class "dgCMatrix"
##    Bar
## Foo  y  z
##   A 77  .
##   B  . 55
##   C  .  .

# Build (row, col) pairs from non-zero entries
df <- as.data.frame(summary(matSparse))
df$Foo <- rownames(matSparse)[df$i]
df$Col <- colnames(matSparse)[df$j]
df
##   i j  x Foo Col
## 1 1 1 77   A   y
## 2 2 2 55   B   z

# Basic Operations on Sparse Matrices

All the normal operations you’d do on a matrix have the same behavior on Matrix’s matrix types as they do on a Base R matrix. The only caveat is that some operations will change the class type of your matrix.

## Multiplication and Division by a Scalar

# Make a compressed sparse matrix
matSparse <- as(matrix(data = c(1,0,0,0,3,0), nrow = 3), "dgCMatrix")
matSparse
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 .
## [2,] . 3
## [3,] . .

# Multiplication
matSparse * 10
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 10  .
## [2,]  . 30
## [3,]  .  .

# Division
matSparse / 10
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 0.1 .
## [2,] .   0.3
## [3,] .   .

## Addition and Subtraction by a Scalar

Addition and subtraction will change the class of a dgCMatrix to a dgeMatrix (unless you add/subtract 0). dgeMatrix is basically Matrix’s version of a normal, uncompressed matrix.

# Addition
matSparse + 10
## 3 x 2 Matrix of class "dgeMatrix"
##      [,1] [,2]
## [1,]   11   10
## [2,]   10   13
## [3,]   10   10

# Subtraction
matSparse - 10
## 3 x 2 Matrix of class "dgeMatrix"
##      [,1] [,2]
## [1,]   -9  -10
## [2,]  -10   -7
## [3,]  -10  -10

## Multiplication by a Vector

# Multiplication by vector
matSparse * c(-1, 1, 0)
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] -1 .
## [2,]  . 3
## [3,]  . .
matSparse * c(-1, 1)
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] -1  .
## [2,]  . -3
## [3,]  .  .

## Matrix Multiplication

# Matrix multiplication by vector
matSparse %*% c(-1, 1)
## 3 x 1 Matrix of class "dgeMatrix"
##      [,1]
## [1,]   -1
## [2,]    3
## [3,]    0

# Matrix multiplication by matrix
matSparse %*% t(matSparse)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 . .
## [2,] . 9 .
## [3,] . . .

## Combining Sparse Matrices

We can use rbind() to combine the rows of two sparse matrices and cbind() to combine their columns.

# Combine rows
rbind(matSparse, matSparse)
## 6 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 .
## [2,] . 3
## [3,] . .
## [4,] 1 .
## [5,] . 3
## [6,] . .

# Combine columns
cbind(matSparse, matSparse)
## 3 x 4 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 . 1 .
## [2,] . 3 . 3
## [3,] . . . .

# NA Values

It’s important to recognize the distinction between missing values and sparsity (a bunch of 0s). If element (i,j) of a matrix represents the number of times customer i purchased project j, an NA value could mean that customer i may have purchased product j, but due to a data issue, we’re not sure. This is distinctly different from customer i did not purchase product j. The Matrix package supports NA values and treats them just like any other non-zero number.

# Build a base R matric with NAs
matBaseR <- matrix(data = c(1,NA,1,0,0,NA), nrow = 3)
matBaseR
##      [,1] [,2]
## [1,]    1    0
## [2,]   NA    0
## [3,]    1   NA

# Convert to compressed sparse column format
matCSC <- as(matBaseR, "dgCMatrix")
matCSC
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,]  1  .
## [2,] NA  .
## [3,]  1 NA

# View the structure
str(matCSC)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:4] 0 1 2 2
##   ..@ p       : int [1:3] 0 3 4
##   ..@ Dim     : int [1:2] 3 2
##   ..@ Dimnames:List of 2
##   .. ..$: NULL ## .. ..$ : NULL
##   ..@ x       : num [1:4] 1 NA 1 NA
##   ..@ factors : list()

# Common Matrix Types

You may have noticed there are a number of different Matrix types. Here’s a list of some common ones.
Class Compression Element Type
ngeMatrix none logicals
dgeMatrix none real numbers
dgCMatrix compressed sparse column real numbers
dgRMatrix compressed sparse row real numbers
ngCMatrix compressed sparse column logicals
ngRMatrix compressed sparse row real numbers 