import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
14 Penalized Regression
14.1 Introduction
Now that we know how to create and use pipelines, it’s finally time to start adding more model specifications to our repertoire.
To motivate today’s new models, consider the house price prediction problem. This time, we are going to use ALL of our predictors!
Click here to download the full AMES housing dataset
Click here for data documentation
First, we’ll read and clean our data..
# Read the data
= pd.read_csv("data/AmesHousing.csv")
ames
# Get rid of columns with mostly NaN values
= ames.isna().sum() < 100
good_cols = ames.loc[:,good_cols]
ames
# Drop other NAs
= ames.dropna() ames
Then, we’ll set up our pipeline…
= ames.drop(["SalePrice", "Order", "PID"], axis = 1)
X = ames["SalePrice"]
y
= ColumnTransformer(
ct
["dummify",
(= False, handle_unknown='ignore'),
OneHotEncoder(sparse_output =object)),
make_column_selector(dtype_include"standardize",
(
StandardScaler(), =np.number))
make_column_selector(dtype_include
],= "passthrough"
remainder
)
= Pipeline(
lr_pipeline_1 "preprocessing", ct),
[("linear_regression", LinearRegression())]
( )
Then, we cross-validate:
= 5, scoring = 'r2') cross_val_score(lr_pipeline_1, X, y, cv
array([-1.00227561e+21, -2.13473460e+19, -4.65481157e+21, -4.24892786e+21,
-4.16001805e+22])
Oof. This is terrible! We used so many features in our model, we overfit to the training data, and ended up with terrible predictions.
14.2 Ridge Regression
In the previous analysis, we fit our model according to Least Squares Regression. That is, we fit a model that predicts the \(i\)-th house price with the equation
\[\hat{y}_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}\] When we chose the best numbers to plug in for \(\beta_1\) … \(\beta_p\), we did so by choosing the numbers that would minimize the squared error in the training data:
\[\sum_{i = 1}^n (y_i - \hat{y}_i)^2\]
But this equation is not our only way to define the “best” \(\beta\)’s - and in this case, fitting as closely as possible to the training data did not result in good future predictions.
Recall that modeling is a kind of “teeter totter” between models that are too inflexible (underfit) and those that are too flexible (overfit). In this instance, we are overfitting, so we need our model to be less flexible; i.e., less able to get too close to the training data.
How will we do this? Regularization - a fancy word that says we will add a new piece to our loss function that restricts the flexibility of the \(\beta\)’s. We call these new pieces penalties.
For example, what if our definition of “best beta” was the ones that minimize:
\[ \ell(\beta) = \sum_{i = 1}^n (\hat{y}_i - y_i)^2 + \sum_{j = 1}^p \beta_j^2 \] Here, the \(\sum \beta_j^2\) is a Ridge penalty, another name for the sum of squared coefficients.
Now our loss function - which defines the best betas - has two concerns:
To make the Sum of Squared Error (SSE) small
To make the betas themselves small
These two concerns are acting in opposition to each other: We can make the SSE small by choosing \(\beta\)’s that overfit; or we can make the Ridge penalty small by choosing \(\beta\)’s near 0, but we can’t do both at the same time.
To decide how much we want to prioritize each concern, we’ll throw a number in called \(\lambda\):
\[ \ell(\beta) = \sum_{i = 1}^n (\hat{y}_i - y_i)^2 + \lambda \sum_{j = 1}^p \beta_j^2 \]
Now, we have a “knob” we can use to balance the pieces of the loss function. If \(\lambda\) is very large, then we care much more about restricting our coefficients to small values than about getting small SSE. If \(\lambda\) is close to 0, then we care much more about SSE.
When \(\beta\)’s are chosen according to the above loss function, instead of just considering SSE, we call this Ridge Regression.
The sklearn
function Ridge()
uses the argument name alpha
for \(\lambda\).
14.2.1 Tuning
Now, you might be wondering, how do we know what value to use for \(\lambda\)? That is, how should we prioritize between the two concerns?
Well, we don’t really know! But what we can do is try many different values of \(\lambda\) and see which one results in the best metrics.
That is, we can tune the hyperparameter \(\lambda\) to pick a final value - not because we are interested in the value itself, but because it will tell us the best model specification (i.e., the best loss function) to use.
14.3 LASSO
This idea of adding a penalty to the loss function is very powerful - and the Ridge penalty is not the only one we could have used.
Another common choice is the LASSO (least absolute shrinkage and selection operator) penalty, which regularizes the \(\beta\)’s according to their absolute value instead of the square.
\[ \ell(\beta) = \sum_{i = 1}^n (\hat{y}_i - y_i)^2 + \lambda \sum_{j = 1}^p |\beta_j| \]
This difference should feel fairly trivial, but it has some surprisingly different effects on the resulting coefficients! While Ridge Regression will result in coefficients that are generally smaller than OLS, LASSO Regression will cause some coefficients to be equal to 0.
This outcome, where some of the \(\beta\)’s are fully eliminated from the model, is very powerful. It essentially performs automatic feature selection; the predictors whose coefficients became zero are the ones that were not needed to get better predictive power.
14.3.1 Your Turn
The sklearn
function Lasso()
uses the argument name alpha
for \(\lambda\).
14.3.2 Elastic Net
If Ridge regression is one way to improve OLS in overfit cases, and LASSO regression is another, why not have the best of both worlds?
There is no reason not to add both penalties to our loss function:
\[ \ell(\beta) = SSE + \lambda \left(\sum_{j = 1}^p \beta^2 + \sum_{j = 1}^p |\beta_j| \right)\]
Of course, we now have three concerns to balance: The SSE, the Ridge penalty, and the LASSO penalty. So in addition to \(\lambda\), which balances SSE and penalties, we need to add in a number \(\alpha\) for how much we care (relatively) about Ridge vs LASSO:
\[ \ell(\beta) = SSE + \lambda \left(\alpha \sum_{j = 1}^p \beta^2 + (1-\alpha) \sum_{j = 1}^p |\beta_j| \right)\]
But what is the best choice of \(\alpha\)? I think you can see where this is headed…
14.3.3 Your Turn
The sklearn
function ElasticNet()
uses the argument name alpha
for \(\lambda\), and the argument name l1_ratio
for \(\alpha\).
Yes, this is confusing and annoying! But since essentially all of the literature uses \(\lambda\) and \(\alpha\) the way we have used them in this chapter, we are choosing to be consistent with that notation instead of sklearn
’s argument names.
14.4 Wrap-Up
In this chapter, you may feel a bit tricked! We promised you new model specifications, but actually, we still haven’t changed our prediction approach from a simple linear model:
\[\hat{y}_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}\]
It’s important to remember, though, that how you fit the model to data is every bit as important a piece of the model specification decision as the equation itself.
By changing our definition of “best” coefficients - that is, changing the loss function that our ideal \(\beta\)’s would minimize - we were able to massively impact the resulting prediction procedure.