What's the state of the art #rstats #MachineLearning approach to penalized regression/variable selection with n = 1900, p = 1000 (all binary, many with low rate)? Goal = maximized validated predictive validity. Interpretability is nice but not necessarily.
Full linear fit works, but gives a lot of overfitting because there are too many predictors. Most of the predictors have little to no validity when used alone in simple t-test (last 2 pics).pic.twitter.com/3DRAmTASmY