An epic journey through statistics and machine learning

Regularization Part 1: Ridge Regression

10 thoughts on “Regularization Part 1: Ridge Regression”

Hello Josh,

Thank you for your great videos!
One question: In your example, the data outliers (red dots) are arranged so that the slope of the red line is higher than the slope of the green data regression line (higher values of y for high values of x and smaller values of y for small values of x).
What if the slope of the red line is smaller than the slope of the green line (smaller values of y for high values of x and higher values of y for small values of x)? How does Ridge Regression work in this scenario?

I understand cross validation but I don’t understand how you would use cross validation to find the best lambda. Do you simply plug different values for lambda holding slope^2 constant and take the lambda value that returns the smallest SSE from the cross validation? Thanks in advance.

For CV, you input various values for lambda, like 0, 0.1, 1 and 10. Then, for each candidate value for lambda, you find the the slope that has the minimum SSR + Ridge Regression Penalty for the training data. Then we see how good that slope predicts the values in the testing data. We then pick the value for lambda that performs best with the testing data.

How do you know which values of lambda to test? If a lambda value can go from 0 to infinity, how do you know what are the highest/lowest lambda values to try in cross validation?

I have a question of the video Part 1 Ridge Regression. The formula of the ridge regression line is Size = 0.9 + 0.8 * Weight. How did you determine this formula? The penalty is 0.74 for the ridge regression line?

The optimal parameters for ridge regression are determined using an iterative procedure like Gradient Descent. For details on Gradient Descent, see: https://youtu.be/sDv4f4s2SB8

Hello Josh,

Thank you for your great videos!

One question: In your example, the data outliers (red dots) are arranged so that the slope of the red line is higher than the slope of the green data regression line (higher values of y for high values of x and smaller values of y for small values of x).

What if the slope of the red line is smaller than the slope of the green line (smaller values of y for high values of x and higher values of y for small values of x)? How does Ridge Regression work in this scenario?

If ridge regression can not improve predictions by shrinking parameters, then it will do nothing at all.

I understand cross validation but I don’t understand how you would use cross validation to find the best lambda. Do you simply plug different values for lambda holding slope^2 constant and take the lambda value that returns the smallest SSE from the cross validation? Thanks in advance.

For CV, you input various values for lambda, like 0, 0.1, 1 and 10. Then, for each candidate value for lambda, you find the the slope that has the minimum SSR + Ridge Regression Penalty for the training data. Then we see how good that slope predicts the values in the testing data. We then pick the value for lambda that performs best with the testing data.

How do you know which values of lambda to test? If a lambda value can go from 0 to infinity, how do you know what are the highest/lowest lambda values to try in cross validation?

I talk about how to find good “hyperparameter”, including lambda for regularization, values in this video: https://www.youtube.com/watch?v=GrJP9FLV3FE&t=2701s

I have a question of the video Part 1 Ridge Regression. The formula of the ridge regression line is Size = 0.9 + 0.8 * Weight. How did you determine this formula? The penalty is 0.74 for the ridge regression line?

The optimal parameters for ridge regression are determined using an iterative procedure like Gradient Descent. For details on Gradient Descent, see: https://youtu.be/sDv4f4s2SB8

I viewed the video several times but do not understand anything

I’m sorry to hear that! Perhaps it would be helpful to review linear regression first? https://youtu.be/nk2CQITm_eo