International Journal of Statistics and Applications

p-ISSN: 2168-5193    e-ISSN: 2168-5215

2020;  10(3): 55-59

doi:10.5923/j.statistics.20201003.01

 

Selecting the Method to Overcome Partial and Full Multicollinearity in Binary Logistic Model

N. Herawati, K. Nisa, Nusyirwan

Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Lampung, Bandar Lampung, Indonesia

Correspondence to: K. Nisa, Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Lampung, Bandar Lampung, Indonesia.

Email:

Copyright © 2020 The Author(s). Published by Scientific & Academic Publishing.

This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

Abstract

The aim of our study is to select the best method for overcoming partial and full multicollinearity in binary logistic model for different sample sizes. Logistic ridge regression (LRR), least absolute shrinkage and selection operator (LASSO) and principal component logistic regression (PCLR) compared to maximum likelihood estimator (MLE) using simulation data with different level of multicollinearity and different sample sizes (n=20, 50, 100, 200). The best method is chosen based on mean square error (MSE) values and the best model is characterized by AIC value. The results show that LRR, LASSO and PCLR surpass MLE in overcoming partial and full multicollinearity in binary logistic model. PCLR exceeds LRR and LASSO when full multicollinearity occurs in binary logistic model but LASSO and LRR are better used when partial multicollinearity exists in the model.

Keywords: Binary logistic model, Multicollinearity, LRR, LASSO, PCLR

Cite this paper: N. Herawati, K. Nisa, Nusyirwan, Selecting the Method to Overcome Partial and Full Multicollinearity in Binary Logistic Model, International Journal of Statistics and Applications, Vol. 10 No. 3, 2020, pp. 55-59. doi: 10.5923/j.statistics.20201003.01.

1. Introduction

Consider that the model has the form where and dependent variables has value either 0 or 1. Estimating parameters in this model where the response variable is binary or multinomial is not appropriate when using the linear regression model estimation method. The linear regression model is based on a ratio scale measurement [1,2,3]. In this case logistic regression model is more suitable.
Logistic regression model is based on a logistic function to model binary dependent variables. It is a classification of individuals in different groups. Unlike multiple regression, logistic regression is much more flexible in terms of basic assumptions to be met. Logistic regression model as one of nonlinear regression model does not require liner relationship between independent and dependent variables, assumption of normal distribution and homoscedasticity in the error terms. Despite all the flexibility, the logistic regression model still requires no correlation between independent variables [4,5]. When there is a correlation between the independent variable, logistic model becomes unstable. This can cause errors in the interpretation of the relationship between the dependent and each independent variable in terms of odds ratios [6,7].
There are several methods for overcoming the problem of multicollinearity in the logistic models and have been examined by several researchers [8,9,10,11,12]. In this research, a selection of LRR, LASSO and PCLR methods was conducted in logistic model with binary responses and a set of continuous predictor variables. Each method was compared using simulation data that contains partial and full multicollinearity with different sample sizes. The best method was examined based on the minimum value of MSE and the best model is characterized by AIC value.

2. Logistic Regression Model

Suppose the response variable of regression application of interest has two possible outcomes or Yi is a Bernoulli random variable with the probability distribution and The probability function for each observation is i=1,2,…,n [4,5,6,7]. The multiple logistic regression model of the response variable with (X) is an n x 1 vector and
(1)
where is a vector of estimated parameters. The logit function of is or in linear form can be written as [3,13]:
(2)
The parameters were estimated by maximizing likelihood function . When the log-likelihood is differentiated with respect to equal to zero, we get
(3)
where Z is a nx1 column vector with elements and [7].

2.1. Logistic Ridge Regression (LRR)

When multicollinearity exist between independent variables in the logistic model, the matrix X’WX is (near) singular. Using maximum likelihood method to estimate the parameters in the model is not suitable because we cannot get the inversion of the matrix. As a result, the estimation of the parameters in the logistic model using maximum likelihood method is being unstable and cannot be uniquely estimated. In this situation, the ridge regression method can be applied by using a penalty to the diagonal matrix of X’WX to stabilize the coefficients estimates [14,15,16]. Although this method will produce a bias in the coefficient estimates of the model, it provides a lower variance of the coefficient estimates than the unpenalized model. Ridge likelihood estimator of the logistic model is done by maximize the ridge penalized loglikelihood [17,18,19,20,21]:
(4)
where the ridge penalty is the second summand (the sum of the square of the elements of ) with as penalty parameter. Because the value of the equation is not linear, Newton-Raphson method is used to solve it. The solution uses and follows the iterative weighted least square algorithm to obtained the estimates. The logistic ridge regression (LRR) model following [17] is:
(5)
with 5 and as in equation (3) [17].

2.2. Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO method can be used to overcome problems in multicollinearity [22]. LASSO shrinks the coefficient parameter β which correlates to exactly zero or close to zero [23]. Lagrangian constraint (L1-norm) can be combined in a log-likelihood parameter estimation in logistic regression [24,25]. The estimation of parameters in LASSO in combining log-likelihood and Lagrangian constraints produces:
(6)
So we get a logistic regression parameter estimates with LASSO:
(7)
λ > 0 is the tuning parameter that control the strength of penalty in the LASSO method and can be obtained using generalized cross validation [22].

2.3. Principal Component Logistic Regression (PCLR)

In linear regression analysis, principal component regression (PCR) is one of the methods that has been confirmed to be able to overcome the problem of multicollinearity [1,11,26,27]. PCR aims to simplify the observed variables by reducing the dimensions, where the chosen principal components must maintain as much diversity as possible. This is done by eliminating the correlation between the independent variables through the transformation of the original independent variable into a new variable that does not correlate at all. In terms of the principal component (PC) of the predictor variables, the logit transformation (2) can be written as principal component regression form as:
(8)
where as an n x k matrix whose columns are the PCs of X with V is a k x k matrix whose columns are the eigenvectors of the of the matrix X’X denoted by with It is obvious that can be estimated by
(9)
The prediction equation of MLE is with
where is the j-th PC value for a point x. The logit model (8) can be expressed as
(10)
The principal component logistic regression (PCLR) model in terms of the first PC is and the logit transformation, has components is defined as . The parameter estimate of the PCLR [9] is:
(11)
where the subscript (s) indicates number of PCs were used in the PCLR model.
This method was introduced by Aguilera et al. [10] for solving the problem of high-dimensional multicollinear data in logistic regression of binary response variable and a set of continuous predictor variables. They showed that the PCLR model provides better estimation of model parameters compared to partial least square (PLS) logistic regression.

3. Methods

Illustration of the performance of LRR, LASSO PCLR methods used in this study was carried out using a simulation study to show how these methods can improve the estimation of parameters of the binary logistic model contains partial and full multicollinearity using R. Six independent variables (p=6) were generated using the formula and with and . The dependent variable Y is generated by the binary logistic regression probability
with and respectively. Partial and full multicollinearity between independent variables were applied in the model with different sample sizes (n=20, 40, 60, 100, 200) and replicated 1000 times. Multicollinearity of the independent variables is measured by with is the coefficient of determination. The best method in estimating the parameters is evaluated using MSE with formula:
and the best model is characterized by where is the value that maximize the likelihood function, n is the number of recorded measurements and k is the number of parameters estimated [28,29].

4. Results and Discussion

The partial and full multicollinearity of the independent variables applied in this study is shown in Table 1. First, partial multicollinearity in which correlation only applies between X1 and X2; second, partial multicollinearity where correlation occurs only between X1, X2, and X3; third, full multicollinearity in all independent variables. This condition is applied to all sample sizes that are being studied.
Table 1. Multicollinearity in independent variables for all sample sizes studied
     
From Table 1, we can see that the VIF values are greater than 10 for all given cases in this study. It means that the independent variables seem to correlate to each other and indicate there is a multicollinearity between these variables. To select the method that is considered the best in overcoming the multicollinearity problems in this study, MSE value is used. The best method is determined from an MSE value that close to zero. The MSE values of MLE, LRR, LASSO, PCLR for partial and full multicollinearity in the model at different sample sizes are shown in Table 2.
Table 2. MSE of MLE, LRR, LASSO, PCLR
     
From Table 2 where partial multicollinearity in X1 and X2 occurs in the model, MLE gives MSE =35608.77 for n=20, MSE= 1112.951, for n=50, MSE= 0.0820, for n=100, respectively. These values are far above the MSE of LRR, LASSO and PCLR which give MSE = 0.0857, 0.0170, 0.0207 for n=20, MSE =0.0506, 0.0085, 0.0175 for n=50, and MSE =0.0385, 0.0026, 0.0126 for n=100, respectively. Similar results are obtained when partial multicollinearity exists in X1, X1, X3, X4 and when the model contains full multicollinearity. It is obvious that MLE is unable to overcome partial and full multicollinearity between independent variables very well in logistic regression with binary responses when sample sizes are small enough. In a larger sample size (n=200) the MSE of MLE seems to decrease significantly, but its value still above the MSE of LRR, LASSO and PCLR. This suggests that MLE should not be used in estimating the parameters of logistic models with binary responses that have partial and full multicollinearity on small and large sample sizes.
To provide clearer results from the LRR, LASSO and PCLR methods in overcoming partial and full multicollinearity for all sample sizes (n=20, 40, 60, 100, 200), we compared the MSE of the three methods separately from MLE as shown in Figure 1-3.
Figure 1. MSE of partial multicollinearity in X1, X2
Figure 2. MSE of Partial multicollinearity in X1, X2, X3, X4
Figure 3. MSE of full multicollinearity
Figures 1-3 shows MSE of LRR, LASSO and PCLR in conditions where the binary logistic model contains partial and full multicollinearity at different sample sizes (n=20, 50, 100, 200. It can be seen that MSE values of LRR, LASSO and PCLR vary depending on the number of correlated variables and sample sizes. If partial multicollinearity occurs between X1 and X2 in the model, LASSO gives MSE= 0.0170, 0.0085, 0.0025, and 0.0017 for n=20, 50, 100, and 200, respectively. These values are much lower than MSE value of LRR and PCLR. However, when partial multicollinearity occurs among X1, X2, X3 and X4 in the model, the results vary. For n = 20 and 50, LASSO and LRR gives lower MSE than PCLR. Conversely, for n= 100 and n=200, PCLR has the lowest MSE compared to LASSO and LRR. This suggests that when partial multicollinearity exists in the binary logistic model, LASSO, LRR, and PCLR can be used depending on the amount of multicollinearity and sample sizes.
In situation where there is full multicollinearity between the independent variables in the model, we can see from Figure 3 that PCLR has the lowest MSE value than LRR and LASSO with MSE= 0.0182, 0.0089, 0.0049, and 0.0006 for n=20, 50, 100, 200, respectively. Obviously, LRR and LASSO appear unable to overcome full multicollinearity in logistic regression with binary responses. In this case PCLR exceeds LRR and LASSO for each sample size studied (n=20, 50. 100, 200). This indicates that PCLR is the best method for overcoming full multicollinearity in binary logistic model for all sample sizes being studied.
To determine the best model, we examine the AIC values of LRR, LASSO and PCLR as shown in Table 3.
Table 3. AIC of LRR, LASSO and PCLR
     
Based on the AIC values from Table 3 it was found that the best model depends on the number of correlated variables in the binary logistics model and sample size. This supports the results obtained based on the MSE value.

5. Conclusions

We conclude from the results of this study that LRR, LASSO and PCLR surpass MLE in overcoming partial multicollinearity and full multicollinearity occur in binary logistic model. PCLR exceeds LRR and LASSO when full multicollinearity occurs in binary logistic model but LASSO and LRR are better used when partial multicollinearity exists in the model.

ACKNOWLEDGEMENTS

This study was financially supported by University of Lampung. The authors thank the Rector and the Institute of Research and Community Service for their support and assistance.

References

[1]  D.C. Montgomery, E.A. Peck, and G.G. Vinning, Introduction to Linear Regression Analysis, New York: A Wiley Intersection Publication, 2012.
[2]  N.R. Draper and H. Smith, Applied Regression Analysis, 3rd ed., New York: Wiley, 1998.
[3]  R.H. Myers, Classical and Modern Regression With Application, Boston: PWSKENT publishing Company, 1990.
[4]  M.H. Kutner, C.J. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models, Boston: McGraw-Hill, 2005.
[5]  Midi, H., Sarkar, S.H., and Rana, S., 2010, Collinearity diagnostics of binary logistic regression model, Journal of Interdisciplinary Mathematics, 13(3): 253-267.
[6]  D.W. Hosmer and S. Lemeshow, Applied Logistic Regression. New York: John Wiley & Sons, 2000.
[7]  T.P. Ryan, Modern Regression Methods. New York: Wiley, 1997.
[8]  Schaeffer, R.L., 1986, Alternative estimators in logistic regression when the data is collinear. Journal of Statistical Computation and Simulation, 25: 75-91.
[9]  Aguilera, A.M., and Escabias, M., 2000, Principal component logistic regression. In: Bethlehem J.G., van der Heijden P.G.M. (eds) COMPSTAT. Physica, Heidelberg, 175-180.
[10]  Aguilera, A.M., Escabias, M., and Valderrama, M.J., 2006, Using principal components for estimating logistic regression with-high-dimensional multicollinear data, Computational Statistics & Data Analysis, 50:1905-1924.
[11]  T. Hastie, R. Thibsirani, and J. Friedman, The Elements of .Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., New York: Springer-Verlag, 2001.
[12]  Toka, O., 2016, A Comparative Study on Regression Methods in the presence of Multicollinearity, Journal of Statisticians: Statistics and Actuarial Sciences, 2: 47-53.
[13]  P. McCullagh, and J.A. Nelder, Generalized Linear Models, 2nd Ed., London: Chapman and Hall, 1989.
[14]  Hoerl, A.E., 1962, Application of ridge analysis to regression problems, Chem. Eng. Prog., 58: 54-59.
[15]  Hoerl, A.E. and Kennard, R.W., 1970, Ridge Regression: Biased Estimation for nonorthogonal problems, Technometrics, 12(1): 55-67.
[16]  Dorugade, A.V. and Kashid, D.N., 2010, Alternative method for choosing ridge parameter for regression, Applied Mathematical Sciences, 4(9): 447-456.
[17]  Schaffer, R.L., Roi, L.D., and Wolfe, R.A., 1984, A ridge logistic estimator, Communications in Statistics: Theory and Methods, 13: 99-113.
[18]  Le Cessie, S. and van Houwelingen, J.C., 1992, Ridge estimators in logistic regression, Applied Statistics, 41(1), 191-201.
[19]  Kibria, B.M.G., Shukur, G., Mansson, K., 2012, Performance of some logistic ridge regression estimators, Computational Economics, 40: 401-4014.
[20]  Wu, J. and Asar, Y., 2016, On almost unbiased ridge logistic estimator for the logistic regression model, Hacettepe Journal of Mathematics and Statistic, 43(3): 989-998.
[21]  Duffy, D.E. and Santner, T.J., 1989, On the small properties of norm – restricted maximum likelihood estimators for logistic regression models, Communications in Statistics - Theory and Methods, 18: 959-980
[22]  Tibshirani, R., 1996, Regression Shrinkage and Selection via LASSO, Journal of the Royal Statistical Society, 58(1): 267-288.
[23]  G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R. New York: Springer-Verlag, 2013.
[24]  Hastie, T., Tibshirani, R., and Wainwright, M., Statistical Learning with Sparsity, The LASSO and Generalizations. New Jersey: CRC Press, 2015.
[25]  Fonti, V. and Belitser, E, 2017, Feature Selection using LASSO, p. 1-25, Research Paper In Business Analytics, VU Amsterdam.
[26]  I.T. Jolliffe, Principal Component Analysis, 2nd ed., New York: Springer-Verlag.
[27]  Herawati, N., Nisa, K., Setiawan, E., Nusyirwan and Tiryono, 2018, Regularized Multiple Regression Methods to Deal with Severe Multicollinearity, International Journal of Statistics and Applications, 8(4): 167-172.
[28]  Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In B.N. Petrow and F. Csaki (eds), Second International symposium on information theory (pp.267-281). Budapest: Academiai Kiado.
[29]  Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723.