Ridge regression and lasso regression are two different techniques for increasing the robustness against colinearity of ordinary least squares regression. Both of these algorithms attempt to minimize a cost function. The cost is a function of two terms: one, the residual sum of squares (RSS), taken from ordinary least squares; the other, an additional regularizer penalty. The second term is an L2 norm in ridge regression, and an L1 norm in lasso regression. ## OverviewLet's look at the equations. In ordinary least squares, we solve to minimize the following cost function:
Cost=(y−Xβ)T(y−Xβ)
This term is the RSS, residual sum of squares. In ridge regression we instead solve:
Cost=(y−Xβ)T(y−Xβ)+λβTβ
The λβTβ term is an L2 norm. In lasso regression we instead solve:
Cost=(y−Xβ)T(y−Xβ)+λ|β|
The λ|β|λ|β| term is an L1 norm. At a higher level, the chief difference between the L1 and the L2 terms is that the L2 term is proportional to the square of the β values, while the L1 norm is proportional the absolute value of the values in β. This fundamental difference accounts for all of the difference between how lasso regression and ridge regression "work". L1-verus-L2 pops up elsewhere in machine learning as well, so it's important to understand what's going on here! ## Definition of a normLet's step back for a moment and consider the question: what is a norm? A norm is a mathematical thing that is applied to a vector (like the vector β above). The norm of a vector maps vector values to values in [0,∞)[0,∞). In machine learning, norms are useful because they are used to express distances: this vector and this vector are so-and-so far apart, according to this-or-that norm. Going a bit further, we define ||x||p||x||p as a "p-norm". Given xx, a vector with ii components, a p-norm is defined as:
||x||p=(∑i|xi|p)1/p
The simplest norm conceptually is Euclidean distance. This is what we typically think of as distance between two points in space:
||x||2=(∑ix2i)−−−−−−−−⎷=x21+x22+…+x2i−−−−−−−−−−−−−−−√||x||2=(∑ixi2)=x12+x22+…+xi2
Another common norm is taxicab distance, which is the 1-norm:
||x||1=∑i|xi|=|x1|+|x2|+…+|xi|
Taxicab distance is so-called because it emulates moving between two points as though you are moving through the streets of Manhattan in a taxi cab. Instead of measuring the distance "as the crow flies" it measures the right-angle distance between two points: You can read more about taxicab geometry here. ## p-norms and regularizationTaxicab distance is the 1-norm, also known as the L1 norm. The L2 norm is actually the 2-norm, Euclidian distance, squared. Hence, we can rewrite our cost equations as:
Ridge Cost=(y−Xβ)T(y−Xβ)+||β||22
Lasso Cost=(y−Xβ)T(y−Xβ)+||β||1
This process of adding a norm to our cost function is known as regularization. We can regularize the data for different underlying reasons and with different effects. In the case of ridge and lasso regression, both of these regularizers are built to problem-solve colinearity and model complexity; but as we saw in earlier notebooks. the way in which they go about doing so is fundamentally different. The properties of regularizing with L1 and L2 norms is what causes these differences. Since these norms will pop up in other places later, it's a good idea to study what gives them their properties, using ridge regression and lasso regression as guides. (this notebook is based on this blog post) ## L1-L2 norm comparisons## Robustness: L1 > L2Robustness is defined as resistance to outliers in a dataset. The more able a model is to ignore extreme values in the data, the more robust it is. The L1 norm is more robust than the L2 norm, for fairly obvious reasons: the L2 norm squares values, so it increases the cost of outliers exponentially; the L1 norm only takes the absolute value, so it considers them linearly. ## Stability: L2 > L1Stability is defined as resistance to horizontal adjustments. This is the perpendicular opposite of robustness. The L2 norm is more stable than the L1 norm. A later notebook will explore why. ## Solution numeracy: L2 one, L1 manyBecause L2 is Euclidean distance, there is always one right answer as to how to get between two points fastest. Because L1 is taxicab distance, there are as many solutions to getting between two points as there are ways of driving between two points in Manhattan! This is best illustrated by the same graphic from earlier: ## L1-L2 regularizer comparisons## Computational difficulty: L2 > L1L2 has a closed form solution because it's a square of a thing. L1 does not have a closed form solution because it is a non-differenciable piecewise function, as it involves an absolute value. For this reason, L1 is computationally more expensive, as we can't solve it in terms of matrix math, and most rely on approximations (in the lasso case, coordinate descent). ## Sparsity: L1 > L2Sparsity is the property of having coefficients which are highly significant: very near 0 or very not near 0. In theory, the coefficients very near 0 can later be eliminated. Feature selection is a further-involved form of sparsity: instead of shrinking coefficients near to 0, feature selection is taking them to exactly 0, and hence excluding certain features from the model entirely. Feature selection is a technique moreso than a property: you can do feature selection as an additional step after running a highly sparse model. But lasso regression is interesting in that it features inbuilt feature selection, while ridge regression is just very sparse. That about covers the high-level properties of L2 and L1 norms and regularizers. Hopefully you can see how these properties are exactly the same ones exposed in ridge and lasso regression! |