# Sparse Matrix Storage Formats

Sparse matrices arise natrually in many problems. For example, if we want to predict the price of an item on craigslist using the post’s text, we could build a matrix where each row represents a craigslist post, each column represents a keyword {bad, boat, car, good, new, shoes, used}, and element $ (i,j) $ represents the number of times keyword $ j $ appears in post $ i $.

PostId | Post |
---|---|

post 123 | selling my boat |

post 225 | used car for sale. car is in good condition |

post 481 | getting rid of my truck |

post 992 | selling old shoes because shoes don't fit. good shoes! |

bad | boat | car | good | new | shoes | used | |
---|---|---|---|---|---|---|---|

post 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

post 225 | 0 | 0 | 2 | 1 | 0 | 0 | 1 |

post 481 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

post 992 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |

In practice, such a matrix can have millions of rows and columns, and storing every single element of that matrix is **expensive**. *Compressed matrix formats* take advantage of the fact that sparse matrices are comprised mostly of 0s, and so they only store the non-zero entries. In this post, we’ll look at common techniques for storing sparse matrices in a compressed format.

## Coordinate List (COO)

Perhaps the most obvious way to store a sparse matrix is in *coordinate list* format. In this format, we simply store a list of triplets (row, column value) for every non-zero entry. For example,

bad | boat | car | good | new | shoes | used | |
---|---|---|---|---|---|---|---|

post 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

post 225 | 0 | 0 | 2 | 1 | 0 | 0 | 1 |

post 481 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

post 992 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |

Row: | 1 | 2 | 2 | 2 | 4 | 4 |

Column: | 2 | 3 | 4 | 7 | 4 | 6 |

Value: | 1 | 2 | 1 | 1 | 1 | 3 |

Dimensions: 4 x 7 |

Note that we also need to store the dimensions of the matrix. If the last row or last column of the matrix is all 0s, we wouldn’t know it from the list of triplets alone.

## We Can Do Better

Most sparse matrices that occur in the wild have multiple non-zero entries on each row. This means our coordinate list format can be improved by a simple trick. Instead of storing the row index of every non-zero value, **we only need to store the number of non-zero entries per row**. Here’s what that looks like for our sample matrix

bad | boat | car | good | new | shoes | used | |
---|---|---|---|---|---|---|---|

post 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

post 225 | 0 | 0 | 2 | 1 | 0 | 0 | 1 |

post 481 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

post 992 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |

Non-0s By Row: | 1 | 3 | 0 | 2 | ||

Column: | 2 | 3 | 4 | 7 | 4 | 6 |

Value: | 1 | 2 | 1 | 1 | 1 | 3 |

Dimensions: 4 x 7 |

*Note that I’ve picked off the non-zero values of the original matrix in row-major order for convenience.*

Notice how this method stores less data than coordinate list format. This isn’t guaranteed to be true, but will usually be true in practice.

Now suppose we’re the computer storing the compressed matrix above. Our programmer says, “Hey, give me the 4th row of my matrix in uncompressed format, machine!”. In other words, he wants to get the vector (0,0,0,1,0,3,0). How might we do this? Here’s an algorithm for it…

- Determine the number of non-zero elements that appear above row 4. Do this by cummulatively summing the elements in the vector
*Non-0s By Row*up until the 4th index (i.e. indices 1-3). In this case we get 1 + 3 + 0 = 4 → there are 4 non-zero entries above row 4. - Determine the number of non-zero elements that appear in row 4. This value is given by the 4th element in the vector
*Non-0s By Row*- in this case, 2. - Since there are 4 non-zero entries above row 4, we “jump” to the 5th elements of
*Column*and*Value*. - The next 2 (
*Column*,*Value*) pairs tell us that the non-zero elements of the 4th row are (column 4, value 1) and (column 6, value 3). *Hand Wavy*. Return a vector of length 7 whose 4th element = 1, 6th element = 3, and all other elements = 0.

## Compressed Sparse Row (CSR)

Following our example above, suppose that our programmer asks for row 3 of the matrix. .. and then row 2. .. and then row 4 again. Every time we execute our algorithm for retrieving a row, we are cumulatively summing the elements of *Non-0s By Row*. Instead of doing this expensive task repeatedly, we can just replace *Non-0s By Row* with its cumulative sum instead.

bad | boat | car | good | new | shoes | used | |
---|---|---|---|---|---|---|---|

post 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

post 225 | 0 | 0 | 2 | 1 | 0 | 0 | 1 |

post 481 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

post 992 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |

Non-0s By Row Cmltv: | 1 | 4 | 4 | 6 | ||

Column: | 2 | 3 | 4 | 7 | 4 | 6 |

Value: | 1 | 2 | 1 | 1 | 1 | 3 |

Dimensions: 4 x 7 |

Assume our programmer asks for row 4 again. Now we can modify our row-retrieval algorithm as follows:

- Determine the number of non-zero elements that appear above row 4. This value is stored in the 3rd element of
*Non-0s By Row Cmltv*, in this case, 4. - Determine the number of non-zero elements that appear in row 4. This value is given by the difference of the 3rd and 4th elements of
*Non-0s By Row Cmltv*, in this case, 6 - 4 = 2.

*Steps 3, 4, and 5 are the same as above*

The only scenario where this breaks is, when the programmer asks for row 1 of the matrix, step 1 results in us checking a value in index 0 which doesn’t exist. To prevent us from having to check for this edge case, we can prepend the vector *Non-0s By Row Cmltv* with a 0.

Our final compressed sparse row matrix looks like this

Non-0s By Row Cmltv: | 0 | 1 | 4 | 4 | 6 | |

Column: | 2 | 3 | 4 | 7 | 4 | 6 |

Value: | 1 | 2 | 1 | 1 | 1 | 3 |

Dimensions: 4 x 7 |

This storage format is known as *Compressed Sparse Row (CSR)*.

## Compressed Sparse Column (CSC)

In machine learning, it’s much more common to retrieve columns of a matrix than it is to retrieve rows. If we have a square matrix in compressed sparse row format, we’ll always be able to reconstruct rows faster than we can reconstruct columns. Thus arises the need for *compressed sparse column (CSC)* format. It’s an obvious reflection of compressed sparse row format. Here’s what it looks like applied to our sample matrix.

bad | boat | car | good | new | shoes | used | |
---|---|---|---|---|---|---|---|

post 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

post 225 | 0 | 0 | 2 | 1 | 0 | 0 | 1 |

post 481 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

post 992 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |

Non-0 Vals By Col Cmltv: | 0 | 0 | 1 | 2 | 4 | 4 | 5 | 6 |

Row: | 1 | 2 | 2 | 4 | 4 | 2 | ||

Value: | 1 | 2 | 1 | 1 | 3 | 1 | ||

Dimensions: 4 x 7 |

*Here, we pick off the non-zero values of the original matrix in column-major order*

## Sparse Matrices In Practice

Okay, but now what? Check out my article, Sparse Matrix Construction And Use In R.