Get the latest tech news

Stepwise selection of variables in regression is Evil


Stepwise variable selection is bad and dangerous, and you shouldn't do it. It increases false positives. It drops variables that should be in the model. It gives biased estimates for regression coefficients. The problems are worse for smaller samples; higher correlation between the X variables; and models with weaker explanatory power for the y (i.e. lower R-squared).

But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Rather than include a whole bunch of individual cases, I ran some more simulations covering a range of such values so we can see the relationship to those parameters of the average bias in the estimated regression coefficients remaining in the model. Use theory-driven model selection if it’s explanation you’re after, Bayesian methods are going to be good too as a complement to that and forcing you to think about the problem; and for regression-based prediction use a lasso or elastic net regularization.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of evil

evil

Photo of regression

regression

Photo of variables

variables

Related news:

News photo

A User’s Guide to Statistical Inference and Regression

News photo

Security biz KnowBe4 hired fake North Korean techie, who got straight to work ... on evil

News photo

Premature Abstraction