International Journal of Statistics and Applications
p-ISSN: 2168-5193 e-ISSN: 2168-5215
2018; 8(4): 167-172
doi:10.5923/j.statistics.20180804.02

N. Herawati, K. Nisa, E. Setiawan, Nusyirwan, Tiryono
Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Lampung, Bandar Lampung, Indonesia
Correspondence to: N. Herawati, Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Lampung, Bandar Lampung, Indonesia.
| Email: | ![]() |
Copyright © 2018 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

This study aims to compare the performance of Ordinary Least Square (OLS), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge Regression (RR) and Principal Component Regression (PCR) methods in handling severe multicollinearity among explanatory variables in multiple regression analysis using data simulation. In order to select the best method, a Monte Carlo experiment was carried out, it was set that the simulated data contain severe multicollinearity among all explanatory variables (ρ = 0.99) with different sample sizes (n = 25, 50, 75, 100, 200) and different levels of explanatory variables (p = 4, 6, 8, 10, 20). The performances of the four methods are compared using Average Mean Square Errors (AMSE) and Akaike Information Criterion (AIC). The result shows that PCR has the lowest AMSE among other methods. It indicates that PCR is the most accurate regression coefficients estimator in each sample size and various levels of explanatory variables studied. PCR also performs as the best estimation model since it gives the lowest AIC values compare to OLS, RR, and LASSO.
Keywords: Multicollinearity, LASSO, Ridge Regression, Principal Component Regression
Cite this paper: N. Herawati, K. Nisa, E. Setiawan, Nusyirwan, Tiryono, Regularized Multiple Regression Methods to Deal with Severe Multicollinearity, International Journal of Statistics and Applications, Vol. 8 No. 4, 2018, pp. 167-172. doi: 10.5923/j.statistics.20180804.02.
and each observation i includes a scalar response yi and a vector of p explanatory (regressors) xij for j=1,...,p, a multiple linear regression model can be written as
where
is the vector dependent variable,
represents the explanatory variables,
is the regression coefficients to be estimated, and
represents the errors or residuals.
is estimated regression coefficients using OLS by minimizing the squared distances between the observed and the predicted dependent variable [1, 4]. To have unbiased OLS estimation in the model, some assumptions should be satisfied. Those assumptions are that the errors have an expected value of zero, that the explanatory variables are non-random, that the explanatory variables are lineary independent, that the disturbance are homoscedastic and not autocorrelated. Explanatory variables subject to multicollinearity produces imprecise estimate of regression coefficients in a multiple regression. There are some regularized methods to deal with such problems, some of them are RR, LASSO and PCR. Many studies on the three methods have been done for decades, however, investigation on RR, LASSO and PCR is still an interesting topic and attract some authors until recent years, see e.g. [7-12] for recent studies on the three methods.
require X as a centered and scaled matrix, the cross product matrix (X’X) is nearly singular when X-columns are highly correlated. It is often the case that the matrix X’X is “close” to singular. This phenomenon is called multicollinearity. In this situation
still can be obtained, but it will lead to significant changes in the coefficients estimates [13]. One way to detect multicollinearity in the regression data is to use the use the variance inflation factors VIF. The formula of VIF is
.Ridge regression technique is based on adding a ridge parameter (λ) to the diagonal of X’X matrix forming a new matrix (X’X+λI). It’s called ridge regression because the diagonal of ones in the correlation matrix can be described as a ridge [6]. The ridge formula to find the coefficients is
. When λ=0, the ridge estimator become as the OLS. If all
are the same, the resulting estimators are called the ordinary ridge estimators [14, 15]. It is often convenient to rewrite ridge regression in Lagrangian form:
Ridge regression has the ability to overcome this multicollinearity by constraining the coefficient estimates, hence, it can reduce the estimator’s variance but introduce some bias [16]. b. The LASSOThe LASSO regression estimates
by the optimazation problem:
for some
By Lagrangian duality, there is one-to-one correspondence between constrained problem
and the Lagrangian form. For each value of t in the range where the constraint
is active, there is a corresponding value of λ that yields the same solution form Lagrangian form. Conversely, the solution of
to the problem solves the bound problem with
[17, 18]. Like ridge regression, penalizing the absolute values of the coefficients introduces shrinkage towards zero. However, unlike ridge regression, some of the coefficients are shrunken all the way to zero; such solutions, with multiple values that are identically zero, are said to be sparse. The penalty thereby performs a sort of continuous variable selection.c. Principal Component Regression (PCR)Let V=[V1,...,Vp} be the matrix of size p x p whose columns are the normalized eigenvectors of
, and let λ1,..., λp be the corresponding eigenvalues. Let W=[W1,...,Wp]= XV. Then Wj= XVj is the j-th sample principal components of X. The regression model can be written as
where
. Under this formulation, the least estimator of
is
And hence, the principal component estimator of β is defined by
[19-21]. Calculation of OLS estimates via principal component regression may be numerically more stable than direct calculation [22]. Severe multicollinearity will be detected as very small eigenvalues. To rid the data of the multicollinearity, principal component omit the components associated with small eigen values.
is measured. The AMSE is defined by
where
denotes the estimated parameter in the l-th simulation. AMSE value close to zero indicates that the slope and intercept are correctly estimated. In addition, Akaike Information Criterion (AIC) is also used as the performance criterion with formula:
where
are the parameter values that maximize the likelihood function, x = the observed data, n = the number of data points in x, and k = the number of parameters estimated by the model [23, 24]. The best model is indicated by the lowest values of AIC.
. We simulate a set of data with sample size n= 25, 50, 75, 100, 200 contain severe multicolleniarity among all explanatory variables (ρ=0.99) using R package with 100 iterations. Following [25] the explanatory variables are generated by
Where
are independent standard normal pseudo-random numbers and ρ is specified so that the theoretical correlation between any two explanatory variables is given by
. Dependent variable
for each p explanatory variables is from
with β parameters vectors are chosen arbitrarily
for p= 4, 6, 8, 10, 20 and ε~N (0, 1). To measure the amount of multicolleniarity in the data set, variance inflation factor (VIF) is examined. The performances of OLS, LASSO, RR, and PCR methods are compared based on the value of AMSE and AIC. Cross-validation is used to find a value for the λ value for RR and LASSO.
|
![]() | Figure 1. AMSE of OLS, LASSO, RR and PCR for p=4 |
![]() | Figure 2. AMSE of OLS, LASSO, RR and PCR for p=6 |
![]() | Figure 3. AMSE of OLS, LASSO, RR and PCR for p=8 |
![]() | Figure 4. AMSE of OLS, LASSO, RR and PCR for p=10 |
![]() | Figure 5. AMSE of OLS, LASSO, RR and PCR for p=20 |
|
![]() | Figure 6. Bar-graph of AIC for p=4 |
![]() | Figure 7. Bar-graph of AIC for p=6 |
![]() | Figure 8. Bar-graph of AIC for p=8 |
![]() | Figure 9. Bar-graph of AIC for p=10 |
![]() | Figure 10. Bar-graph of AIC for p=20 |