Advanced Regression.
- Generalized Linear Regression
- Regularized Regression - Ridge and Lasso Regression
Generalized Linear Regression process consists of the following two steps:
1. Conduct exploratory data analysis by examining scatter plots of explanatory and dependent variables.
2. Choose an appropriate set of functions which seem to fit the plot well and build models using them.
Functions having no global maxima or minima are usually polynomial functions. Also, they typically have multiple roots and local maxima and minima.
Ex: ax^4+bx^3+cx^2+dx+f
Monotonically increasing function
Ex: e^x
1. Can x1.x2.x3 be a feature if the raw attributes are x1, x2, x3 and x4?
A. Yes, Derived features can be created using any combination of the raw attributes (linear or non-linear). In this case, the combination x1. x2. x3 is non-linear.
2. How many maximum features can be created if we have d raw attributes for n data points? Note that (n r)
here refers to the number of ways of selecting r items from a set of n.
A. Infinite, You can (in principle) create as many features as you want.
Summarising the important takeaways from the lecture:
- In generalised regression models, the basic algorithm remains the same as linear regression- we compute the values of coefficients which result in the least possible error (best fit).
The only difference is that we now use the features ϕ1(x),ϕ2(x),ϕ3(x)....ϕk(x) instead of the raw attributes.
- The term 'linear' in linear regression refers to the linearity in the coefficients, i.e. the target variable y is linearly related to the model coefficients.
It does not require that y should be linearly related to the raw attributes or features - feature functions could be non-linear as well.
1. Is the following equation linear? y=ax1+be^(x2+x3)+ cos(dx4)
Note : a,b,c and d are coefficients of regression.
A. No, Because of the term cos(dx4)
As stated in the text and the video, the explanatory variable should be linear with respect to the coefficients.
Regularized Regression:
Regularized is a process used to create an optimally complex model, i.e. a model which is as simple as possible while performing well on the training data.
In regularized regression, the objective function has two parts -
-The error term
-The regularization term.
Ridge regression and Lasso regression
Both these methods are used to make the regression model simpler while balancing the 'bias-variance' tradeoff.
Ridge regression: error term + "sum of the squares of the coefficients" is added to the cost function.
Lasso regression: error term + "sum of the absolute value of the coefficients" is added to the cost function.
important benefits of Lasso regression is that it results in model parameters such that the lesser important features' coefficients become zero.
In other words, Lasso regression indirectly performs feature selection.
1. Which of Ridge and Lasso regressions is computationally more intensive?
A. Lasso Regression: Coz Ridge regression almost always has a matrix representation for the solution while Lasso requires iterations to get to the final solution.
2. Which of the following methods perform variable selection, i.e. can help discard redundant variables from the model?
A. Lasso Regression, Lasso trims down the coefficients of redundant variables to zero, and thus indirectly performs variable selection also. Ridge, on the other hand, reduces the coefficients to arbitrarily low values, though not zero.
3. Suppose there are 14 predictors to build a model. How many models can be built using 0 predictors, 1 predictor, 2 predictors and 3 predictors?
1, 14, 91 and 364 respectively
4. How many total models can be built using 10 predictors?
2^10
5. Suppose we are regressing using 120 predictors on a dataset of 80 observations. Which of the following methods can be used to perform regression?
A. Forward Stepwise Selection
When n < p and the number of predictors are > 40, Forward Stepwise selection can only be used.
Forward Stepwise Selection - Start with 0 and move forward
Backward Stepwise Selection - you start with d=p features (a model Mp, with all the features as predictors) and remove a feature in every iteration the one that minimises the error (or maximises R^2).
6. Suppose we are regressing an independent variable y on 18 predictors on a dataset with 400 observations. Which method shall be able to give us the best model, that is with lowest test error?
A. Best Subset Selection
Best Subset Selection can get the best model as it tries each and every combination and here, the number of predictors is also less than 40.
7. As λ increases from 0 to infinity, select the correct option that describes the pattern of the residual sum of squares (RSS) of the training dataset.
A. Steadily increase - Differentiating the cost function with lambda=0 gives the value of the coefficients which minimizes the RSS. Again, putting λ = infinity gives us a constant model with maximum RSS. Thus, the RSS steadily increases with the variation of lambda.
8. As λ increases from 0 to infinity, select the correct option that describes the pattern of the variance of the model.
A. Steadily decrease - When λ=0, the alphas have their least square estimate values. The actual estimates heavily depend on the training data and hence variance is high. As we increase λ, alphas start decreasing and model becomes simpler. In the limiting case of λ approaching infinity, all betas reduce to zero and model predicts a constant and has no variance.
9. As λ increases from 0 to infinity, select the correct option that describes the pattern of the (squared) bias of the model.
A. Steadily increase - When λ=0, alphas have their least-square estimate values and hence have the least bias. As λ increases, alphas start reducing towards zero, the model fits less accurately to training data and hence bias increases. In the limiting case of λ approaching infinity, the model predicts a constant and hence bias is maximum.
10. Reasons for High Variance - Overfitting, Multicollinearity & Outliers
11. You decide to use regularization to tackle this problem, that is Ridge and Lasso Regression. What will happen if we use a very large value of the hyperparameter λ?
A. 1. Test error will be high. Even though the variance will be very low, test error will be high as the model would not have captured the behaviour of the data correctly.
2. Ridge will lead to some of the coefficients to be very close to 0 as Ridge leads to the shrinkage of the coefficients.
3. Lasso will cause some of the coefficients to be 0 as it cause some of the coefficients to be 0.
12.Which of the following models should be used to predict on the test set?
Model BIC
M1 186.1
M2 193.6
M3 188.9
A. M1 Lower the BIC, lesser will be the test error.
13. Which of the following models should be used to predict on the test set?
Model Adjusted R^2
M1 0.72
M2 0.67
M3 0.83
A. M3- Higher the Adjusted R^2, lesser will be the test error.
- Generalized Linear Regression
- Regularized Regression - Ridge and Lasso Regression
Generalized Linear Regression process consists of the following two steps:
1. Conduct exploratory data analysis by examining scatter plots of explanatory and dependent variables.
2. Choose an appropriate set of functions which seem to fit the plot well and build models using them.
Functions having no global maxima or minima are usually polynomial functions. Also, they typically have multiple roots and local maxima and minima.
Ex: ax^4+bx^3+cx^2+dx+f
Monotonically increasing function
Ex: e^x
1. Can x1.x2.x3 be a feature if the raw attributes are x1, x2, x3 and x4?
A. Yes, Derived features can be created using any combination of the raw attributes (linear or non-linear). In this case, the combination x1. x2. x3 is non-linear.
2. How many maximum features can be created if we have d raw attributes for n data points? Note that (n r)
here refers to the number of ways of selecting r items from a set of n.
A. Infinite, You can (in principle) create as many features as you want.
Summarising the important takeaways from the lecture:
- In generalised regression models, the basic algorithm remains the same as linear regression- we compute the values of coefficients which result in the least possible error (best fit).
The only difference is that we now use the features ϕ1(x),ϕ2(x),ϕ3(x)....ϕk(x) instead of the raw attributes.
- The term 'linear' in linear regression refers to the linearity in the coefficients, i.e. the target variable y is linearly related to the model coefficients.
It does not require that y should be linearly related to the raw attributes or features - feature functions could be non-linear as well.
1. Is the following equation linear? y=ax1+be^(x2+x3)+ cos(dx4)
Note : a,b,c and d are coefficients of regression.
A. No, Because of the term cos(dx4)
As stated in the text and the video, the explanatory variable should be linear with respect to the coefficients.
Regularized Regression:
Regularized is a process used to create an optimally complex model, i.e. a model which is as simple as possible while performing well on the training data.
In regularized regression, the objective function has two parts -
-The error term
-The regularization term.
Ridge regression and Lasso regression
Both these methods are used to make the regression model simpler while balancing the 'bias-variance' tradeoff.
Ridge regression: error term + "sum of the squares of the coefficients" is added to the cost function.
Lasso regression: error term + "sum of the absolute value of the coefficients" is added to the cost function.
important benefits of Lasso regression is that it results in model parameters such that the lesser important features' coefficients become zero.
In other words, Lasso regression indirectly performs feature selection.
1. Which of Ridge and Lasso regressions is computationally more intensive?
A. Lasso Regression: Coz Ridge regression almost always has a matrix representation for the solution while Lasso requires iterations to get to the final solution.
2. Which of the following methods perform variable selection, i.e. can help discard redundant variables from the model?
A. Lasso Regression, Lasso trims down the coefficients of redundant variables to zero, and thus indirectly performs variable selection also. Ridge, on the other hand, reduces the coefficients to arbitrarily low values, though not zero.
3. Suppose there are 14 predictors to build a model. How many models can be built using 0 predictors, 1 predictor, 2 predictors and 3 predictors?
1, 14, 91 and 364 respectively
4. How many total models can be built using 10 predictors?
2^10
5. Suppose we are regressing using 120 predictors on a dataset of 80 observations. Which of the following methods can be used to perform regression?
A. Forward Stepwise Selection
When n < p and the number of predictors are > 40, Forward Stepwise selection can only be used.
Forward Stepwise Selection - Start with 0 and move forward
Backward Stepwise Selection - you start with d=p features (a model Mp, with all the features as predictors) and remove a feature in every iteration the one that minimises the error (or maximises R^2).
6. Suppose we are regressing an independent variable y on 18 predictors on a dataset with 400 observations. Which method shall be able to give us the best model, that is with lowest test error?
A. Best Subset Selection
Best Subset Selection can get the best model as it tries each and every combination and here, the number of predictors is also less than 40.
7. As λ increases from 0 to infinity, select the correct option that describes the pattern of the residual sum of squares (RSS) of the training dataset.
A. Steadily increase - Differentiating the cost function with lambda=0 gives the value of the coefficients which minimizes the RSS. Again, putting λ = infinity gives us a constant model with maximum RSS. Thus, the RSS steadily increases with the variation of lambda.
8. As λ increases from 0 to infinity, select the correct option that describes the pattern of the variance of the model.
A. Steadily decrease - When λ=0, the alphas have their least square estimate values. The actual estimates heavily depend on the training data and hence variance is high. As we increase λ, alphas start decreasing and model becomes simpler. In the limiting case of λ approaching infinity, all betas reduce to zero and model predicts a constant and has no variance.
9. As λ increases from 0 to infinity, select the correct option that describes the pattern of the (squared) bias of the model.
A. Steadily increase - When λ=0, alphas have their least-square estimate values and hence have the least bias. As λ increases, alphas start reducing towards zero, the model fits less accurately to training data and hence bias increases. In the limiting case of λ approaching infinity, the model predicts a constant and hence bias is maximum.
10. Reasons for High Variance - Overfitting, Multicollinearity & Outliers
11. You decide to use regularization to tackle this problem, that is Ridge and Lasso Regression. What will happen if we use a very large value of the hyperparameter λ?
A. 1. Test error will be high. Even though the variance will be very low, test error will be high as the model would not have captured the behaviour of the data correctly.
2. Ridge will lead to some of the coefficients to be very close to 0 as Ridge leads to the shrinkage of the coefficients.
3. Lasso will cause some of the coefficients to be 0 as it cause some of the coefficients to be 0.
12.Which of the following models should be used to predict on the test set?
Model BIC
M1 186.1
M2 193.6
M3 188.9
A. M1 Lower the BIC, lesser will be the test error.
13. Which of the following models should be used to predict on the test set?
Model Adjusted R^2
M1 0.72
M2 0.67
M3 0.83
A. M3- Higher the Adjusted R^2, lesser will be the test error.
Comments
Post a Comment