## Thursday, March 26, 2015

### Randomization and Probabilistic Techniques to scale up Machine Learning

Some time back I blogged about the possibilities that probabilistic techniques and randomization bring on to the paradigm of stream computing. Architectures based on big data not only relate to high volume storage, but also on low latency velocities, and this is exactly where stream computing has a role to play. I discussed a few data structures like bloom filters, count min sketch and hyperloglog and algorithms like Locality Sensitive Hashing that use probabilistic techniques to reduce the search and storage space while processing huge volumes of data.

Of late, I have been studying some of the theories behind machine learning algorithms and how they can be used in conjunction with the petabytes of data that we generate everyday. And the same thing strikes here - there are algorithms that can model the most perfect classifier. But you need randomization and probabilistic techniques to make them scale, even at the expense of a small amount of inaccuracy creeping within your model. In most cases we will see that the small inaccuracy that comes within your algorithm because of probabilistic bounds can be compensated by the ability to process more data within the specified computational timeline. This is true even for some of the basic algorithms like matrix multiplication that form the core of machine learning models.

The contents of this post is nothing original or new. It's just to share some of my thoughts in learning the usage of approximation techniques in building machine learning classifiers.

# Matrix Multiplication

Not only these specialized data structures or algorithms, randomization has been found to be quite effective for processing large data sets even for standard algorithms like matrix multiplication, polynomial identity verification or min cut identification from large graphs. In all such cases the best available algorithms have computational complexity which works well for a small data set but doesn't scale well enough with the volumes of data.

Consider a case where we are given 3 matrices, $A$, $B$ and $C$ and we need to verify if $AB = C$. The standard algorithm for matrix multiplication takes $\Theta(n^3)$ operations and there's also a sophisticated algorithm that works in $\Theta(n^{2.37})$ operations. Instead let's consider some randomization and choose a random vector $\bar{r} = (r_1, r_2, .. r_n) \in \{0, 1\}^n$. Now we can compute $AB\bar{r}$ by first computing $B\bar{r}$ and then $A(B\bar{r})$. And then we compute $C\bar{r}$. If we find $A(B\bar{r}) \neq C\bar{r}$, then $AB \neq C$. Otherwise we return $AB = C$. Instead of matrix-matrix multiplication our randomized algorithm uses matrix-vector multiplication, which can be done in $\Theta(n^2)$ operations the standard way.

Obviously a $\Theta(n^2)$ algorithm has a lower computational complexity than $\Theta(n^3)$ and scales better with larger data sets. Now the question is how accurate is this algorithm ? Is it guaranteed to give the correct answer every time we run it ? As with other probabilistic algorithms, there's a chance that our algorithm will return a wrong result. But as long as we can show that the chance is minimal and can be reduced by tuning some parameters, we should be fine.

It can be shown that if $AB \neq C$ and if $\bar{r}$ is chosen uniformly at random from $\{0, 1\}^n$ then $Pr(AB\bar{r} = C\bar{r}) <= 1/2$. But the trick is that we can run our randomized algorithm many times choosing $\bar{r}$ with replacement from $\{0, 1\}^n$. If for any of these trials we get $AB\bar{r} \neq C\bar{r}$, then we can conclude $AB \neq C$. And the probability that we get $AB\bar{r} = C\bar{r}$ for all $k$ trials despite $AB \neq C$ is $2^{-k}$. So for $100$ trials, the chance of error is $2^{-100}$, which we can see is really small. The detailed proof of this analysis can be found in the excellent book Probability and Computing by Michael Mitzenmacher & Eli Upfal.

Matrix multiplication is something that's used heavily especially in implementing machine learning classifiers. And if we can tolerate that little chance of error we get an algorithm with lower computational complexity that scales much better.

# Stochastic Gradient Descent

Consider another use case from core machine learning classifier design. Gradient descent is a standard way to minimize the empirical risk for measuring training set performance. The empirical risk is given by the following equation:
$$E_n(f) = (1/n)\sum_i l(f_w(x_i),y_i)$$
where $l$ is the loss function that measures the cost of predicting $f_w(x_i)$ from $n$ training examples where the actual answer is $y$ and $f_w(x)$ is the function parameterized by the weight vector $w$. Each iteration of gradient descent updates the weights $w$ on the basis of the gradient of $E_n(f_w)$ according to the following iterative step:

$$w_{t+1} = w_t - \gamma (1/n) \sum_i \nabla_w Q(z_i, w_t)$$
where $\gamma$ is an adequately chosen gain. Note that a single update step for the parameter runs through all the training examples and this gets repeated for every update step that you do before convergence. Compare this with Stochastic Gradient Descent (SGD) where the update step is given by the following:

$$w_{t+1} = w_t - \gamma \nabla_w Q(z_t, w_t)$$
Note instead of running through all examples and compute the exact gradient, SGD computes the gradient based on one randomly picked example $z_t$. So, SGD does a noisy approximation to the true gradient. But since it does not have to process all the examples in every iteration it scales better with a large data set. In this paper on Large Scale Machine Learning With Stochastic Gradient Descent, Leon Bottou classifies the error in building the classifier into 3 components:

• Approximation Error, which comes from the fact that the function $f$ that we choose is different from the optimal function $f^*$ and we approximate using a few examples

• Estimation Error, which comes from the fact that we have a finite number of training examples and would have gone away with infinite number of them

• Optimization Error, which comes from the fact that we are using an inferior algorithm to estimate the gradient

• With normal gradient descent we will have low optimization error since we run through all the training examples in every iteration to compute the gradient, which is clearly superior to the algorithm of SGD that does a noisy approximation. But SGD will report a lower approximation and estimation error since we will be able to process a larger dataset within the stipulated computation time. So it's a tradeoff of that we make using SGD, but clearly we scale better with larger data sets.

# Singular Value Decomposition

Singular Value Decomposition is a dimensionality reduction technique to unearth a smaller number of intrinsic concepts from a high dimensional matrix by removing unnecessary information. It does so by projecting the original matrix on to lower dimensions such that the reconstruction error is minimized. What this means is that given a matrix $A$ we decompose it into lower dimensional matrices by removing the lesser important information. And we do this in such a way that we can reconstruct a fairly close approximation to $A$ from those lower dimensional matrices. In theory SVD gives the best possible projection in terms of reconstruction error (optimal low rank approximation). But in practice it suffers from scalability problems with large data sets. It generates dense singular vectors even if the original matrix is a sparse one and hence is computationally inefficient, taking cubic time in the size of the data.

This can be addressed by another algorithm, the CUR algorithm which allows larger reconstruction error but lesser computation time. CUR decomposes the original matrix into ones of lesser dimensions but uses a randomized algorithm in selection of columns and rows based on their probability distribution. Now it can be shown that CUR reconstruction is just an additive term away from SVD reconstruction and it's a probabilistic bound subject to the condition that we select a specific range of columns and rows from $A$. The computational bound of CUR is of the order of the data set, which is much less than that of SVD (which as I mentioned earlier is cubic). This is yet another example where we apply randomization and probabilistic techniques to scale our algorithm better for larger data sets in exchange for a little amount of inaccuracy.

These are only a few instances of probabilistic bounds being applied to solve real world machine learning problems. There are a lots more. In fact I find that scalability of machine learning has a vey direct correlation with application of probabilistic techniques to the model. As I mentioned earlier the point of this post is to share some of my thoughts as I continue to learn techniques to scale up machine learning models. Feel free to share your ideas, thoughts and discussions in comments.