International Journal of Statistics and Applications
p-ISSN: 2168-5193 e-ISSN: 2168-5215
2017; 7(5): 239-249
doi:10.5923/j.statistics.20170705.01
Haider R. Mannan
Translational Health Research Institute and School of Medicine, Western Sydney University, New South Wales, Australia
Correspondence to: Haider R. Mannan, Translational Health Research Institute and School of Medicine, Western Sydney University, New South Wales, Australia.
Email: |
Copyright © 2017 Scientific & Academic Publishing. All Rights Reserved.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/
In public health and in applied research in general, analysts frequently use automated variable selection methods in order to identify independent predictors of an outcome. However, the use of these methods result in spurious noise variables being mistakenly identified as independent predictors of the outcome as well as overestimation of effect sizes and underestimation of estimated standard errors and p values. Although there are methods for correcting p values for automated variable selection limited to forward selection (Taylor and Tibshirani, 2015) and for a wide range of automated variable selection methods (Brombin et al., 2007), they are not yet directly available in any software for the users to correct their p values. We assess the performance of epidemiologic logistic regression models selected by forward, backward and stepwise variable selection methods against models selected by forced entry using multiple bootstrap samples following the initial selection of potential predictors by univariate logistic regression from a list of candidate variables and subsequent screening for eliminating collinear variables. This approach of variable selection by forced entry regression based on multiple bootstrap samples was shown by Harrell (2001) as a simple and acceptable method for variable selection. The metrics estimated for evaluating our model performance were effect sizes (odds ratios) and p values. This analysis was demonstrated using sample from an original Framingham study, for predicting the odds of an incident cardiovascular event using 10 potential predictors. SAS macros were provided to perform the analyses. The results showed that a noise predictor (VLDL cholesterol) was selected by only the forward variable selection method. There was overestimation in regression coefficients and effect sizes for the independent predictors selected by automated methods. The degree of overestimation was higher for forward variable selection compared to the other two automated variable selection methods. The given method provides a convenient way for assessing independent predictors selected by automated methods and their estimated effect sizes. The SAS macros provided are easy to follow and implement and can be easily adapted to different datasets involving a range of predictors and any binary outcome variable.
Keywords: Automated methods, Model assessment, Bootstrapping, Forced entry, SAS macros, Framingham cohort
Cite this paper: Haider R. Mannan, A Practical Application of a Simple Bootstrapping Method for Assessing Predictors Selected for Epidemiologic Risk Models Using Automated Variable Selection, International Journal of Statistics and Applications, Vol. 7 No. 5, 2017, pp. 239-249. doi: 10.5923/j.statistics.20170705.01.
|
|
|
|
|