Sparse Matrix Storage Formats

Coordinate List (COO)

We Can Do Better

Compressed Sparse Row (CSR)

Compressed Sparse Column (CSC)

Sparse Matrices In Practice

2019-04-12 1232 words 6 minutes

Contents

Sparse matrices arise natrually in many problems. For example, if we want to predict the price of an item on craigslist using the post’s text, we could build a matrix where each row represents a craigslist post, each column represents a keyword {bad, boat, car, good, new, shoes, used}, and element $ (i,j) $ represents the number of times keyword $ j $ appears in post $ i $.

PostId	Post
post 123	selling my boat
post 225	used car for sale. car is in good condition
post 481	getting rid of my truck
post 992	selling old shoes because shoes don't fit. good shoes!

	boat	car	good	shoes	used
post 123	1	0	0	0	0
post 225	0	2	1	0	1
post 481	0	0	0	0	0
post 992	0	0	1	3	0

In practice, such a matrix can have millions of rows and columns, and storing every single element of that matrix is expensive. Compressed matrix formats take advantage of the fact that sparse matrices are comprised mostly of 0s, and so they only store the non-zero entries. In this post, we’ll look at common techniques for storing sparse matrices in a compressed format.

Perhaps the most obvious way to store a sparse matrix is in coordinate list format. In this format, we simply store a list of triplets (row, column value) for every non-zero entry. For example,

	boat	car	good	shoes	used
post 123	1	0	0	0	0
post 225	0	2	1	0	1
post 481	0	0	0	0	0
post 992	0	0	1	3	0

Note that we also need to store the dimensions of the matrix. If the last row or last column of the matrix is all 0s, we wouldn’t know it from the list of triplets alone.

Most sparse matrices that occur in the wild have multiple non-zero entries on each row. This means our coordinate list format can be improved by a simple trick. Instead of storing the row index of every non-zero value, we only need to store the number of non-zero entries per row. Here’s what that looks like for our sample matrix

	boat	car	good	shoes	used
post 123	1	0	0	0	0
post 225	0	2	1	0	1
post 481	0	0	0	0	0
post 992	0	0	1	3	0

Note that I’ve picked off the non-zero values of the original matrix in row-major order for convenience.

Notice how this method stores less data than coordinate list format. This isn’t guaranteed to be true, but will usually be true in practice.

Now suppose we’re the computer storing the compressed matrix above. Our programmer says, “Hey, give me the 4th row of my matrix in uncompressed format, machine!”. In other words, he wants to get the vector (0,0,0,1,0,3,0). How might we do this? Here’s an algorithm for it…

Determine the number of non-zero elements that appear above row 4. Do this by cummulatively summing the elements in the vector Non-0s By Row up until the 4th index (i.e. indices 1-3). In this case we get 1 + 3 + 0 = 4 → there are 4 non-zero entries above row 4.
Determine the number of non-zero elements that appear in row 4. This value is given by the 4th element in the vector Non-0s By Row - in this case, 2.
Since there are 4 non-zero entries above row 4, we “jump” to the 5th elements of Column and Value.
The next 2 (Column, Value) pairs tell us that the non-zero elements of the 4th row are (column 4, value 1) and (column 6, value 3).
Hand Wavy. Return a vector of length 7 whose 4th element = 1, 6th element = 3, and all other elements = 0.

Following our example above, suppose that our programmer asks for row 3 of the matrix. .. and then row 2. .. and then row 4 again. Every time we execute our algorithm for retrieving a row, we are cumulatively summing the elements of Non-0s By Row. Instead of doing this expensive task repeatedly, we can just replace Non-0s By Row with its cumulative sum instead.

	boat	car	good	shoes	used
post 123	1	0	0	0	0
post 225	0	2	1	0	1
post 481	0	0	0	0	0
post 992	0	0	1	3	0

Assume our programmer asks for row 4 again. Now we can modify our row-retrieval algorithm as follows:

Determine the number of non-zero elements that appear above row 4. This value is stored in the 3rd element of Non-0s By Row Cmltv, in this case, 4.
Determine the number of non-zero elements that appear in row 4. This value is given by the difference of the 3rd and 4th elements of Non-0s By Row Cmltv, in this case, 6 - 4 = 2.

Steps 3, 4, and 5 are the same as above

The only scenario where this breaks is, when the programmer asks for row 1 of the matrix, step 1 results in us checking a value in index 0 which doesn’t exist. To prevent us from having to check for this edge case, we can prepend the vector Non-0s By Row Cmltv with a 0.

Our final compressed sparse row matrix looks like this

This storage format is known as Compressed Sparse Row (CSR).

In machine learning, it’s much more common to retrieve columns of a matrix than it is to retrieve rows. If we have a square matrix in compressed sparse row format, we’ll always be able to reconstruct rows faster than we can reconstruct columns. Thus arises the need for compressed sparse column (CSC) format. It’s an obvious reflection of compressed sparse row format. Here’s what it looks like applied to our sample matrix.

	boat	car	good	shoes	used
post 123	1	0	0	0	0
post 225	0	2	1	0	1
post 481	0	0	0	0	0
post 992	0	0	1	3	0

Non-0 Vals By Col Cmltv:	0	0	1	2	4	4	5	6
Row:	1	2	2	4	4	2
Value:	1	2	1	1	3	1
Dimensions: 4 x 7

Here, we pick off the non-zero values of the original matrix in column-major order

Dimensions: 4 x 7

Okay, but now what? Check out my article, Sparse Matrix Construction And Use In R.