Feature selection in machine learning refers to the process of isolating only those variables (or “features”) in a dataset that are pertinent to the analysis. Failure to do this effectively has many drawbacks, including: 1) unnecessarily complex models with difficult-to-interpret outcomes, 2) longer computing time, and 3) collinearity and overfitting. Effective feature selection eliminates redundant variables and keeps only the best subset of predictors in the model, thus making it possible to represent the data in the simplest way.
This post begins by identifying steps that must be taken to prepare datasets for meaningful analysis—and how machine learning can help. We then introduce and discuss some commonly used machine learning techniques for variable selection.
Real world data contains a wide range of holes, noise, and inconsistencies. Before doing any statistical analysis, it is crucial to ensure that the data can be meaningfully analyzed. In practice, data cleansing is often the most time-consuming part of data analysis. This upfront investment is necessary, however, because the quality of data has a direct bearing on the reliability of model outputs.
Various machine learning projects require different sorts of data cleansing steps, but in general, when people speak of data cleansing, they are referring to the following specific tasks.
Cleaning Missing Values
Many machine learning techniques do not support data with missing values. To address this, we first need to understand why data are missing. Missing values usually occur simply because no information is provided, but other circumstances can lead to data holes as well. For instance, setting incorrect data types for attributes when data is extracted and integrated from multiple sources can cause data loss.
One way to investigate missing values is to identify patterns for missing data. For example, missing answers for certain questions from female respondents in a survey may indicate that those questions are only asked of male respondents. Another example might involve two loan records that share the same ID. If the second record contains blank values for every attribute except ‘Market Price,’ then the second record is likely simply updating the market price of the first record.
Once the early-stage evaluation of missing data is complete, we can set about determining how to address the problem. The easiest way to handle missing values is simply to ignore the records that contain them. However, this solution is not always practical. If a relatively large portion of the dataset contains missing values, then removing all of them could result in remaining data that may not be a good representation of the initial population. In that case, rather than filtering out relevant rows or attributes, a more proper approach is to impute missing values with sensible values.
A typical imputing method for categorical variables involves replacing the missing values with the most frequent value or with a newly created “unknown” category. For numeric variables, missing values might be replaced with mean or median values. Other, more advanced methods for dealing with missing values, e.g., listwise deletion for deleting rows with missing data and multiple imputation for substituting missing values, exist as well.
Reducing Noise in Data
“Noise” in data refers to erroneous values and outliers. Noise is an unavoidable problem which can be caused by human mistakes in data entry, technical problems, and many other factors. Noisy data adversely influences model performance, so its detection and removal has a key role to play in the data cleaning process.
There are two major noise types in data: class noise and attribute noise. Class noise often occurs in categorical variables and can include: 1) non-standardized class labels, 2) duplicate records mapping to different class labels, and 3) mislabeled records. Attribute noise refers to corruptive values and outliers, such as percentages inappropriately greater than 100% and placeholders (e.g., 999,000).1
There are many ways to deal with noisy data. Certain type of noise can be easily identified by sorting the data—thus isolating text input where numeric input is expected and other placeholders. Other noise can be addressed only using statistical methods. Clustering analysis groups the data by similarity and can help with detecting irrelevant objects and outliers. Data binning is used to reduce the impact of observation errors by combining ‘neighborhood’ data into a small number of bins. Advanced smoothing algorithms, including moving average and loess, fit the data into regression functions to eliminate the effect due to random variation and allow important patterns to stand out.
Data normalization converts numerical values into specific ranges to meet the needs of a model. Performing data normalization makes it possible to aggregate data with different scales. Several algorithms require normalized data. For example, it is necessary to normalize data before feeding into principal component analysis (PCA) so that all variables have zero mean and unit variance and therefore the same weight. This also applies when performing support vector machines (SVM), which assumes that the input data is in range [0,1] or [-1,1]. Unnormalized data slows down model convergence time and skews results.
The most common way of normalizing data involves Z-score. Also known as standard-score normalization, this approach normalizes the error by dividing the difference between the data and mean by standard deviation. Z-score normalization is often used when min and max are unknown. Another common method is feature scaling, which brings all values into range [0,1] by dividing the difference between the data and min by the difference between max and min. Other normalization methods include studentized residual, t-statistics, and coefficient of variation.
Feature Selection Methods2
A stepwise procedure adds or subtracts individual features from a model until the optimal mix is identified. Stepwise procedures take three forms: backward elimination, forward selection, and stepwise regression.
Backward elimination is the simplest method. It fits the model using all available features and then systematically removes features one at a time, beginning with the feature with the highest p-value (provided the p-value exceeds a given threshold, usually 5%). The model is refit after each elimination and process loops until a model is identified in which each feature’s p-value falls below the threshold.
Forward selection is the opposite of backward elimination. It includes no variables in the model at first and then systematically adds features one at a time, beginning with the lowest p-value (provided the p-value falls below a threshold). The model is refit after each addition and loops until additional features do not help model performance.
Stepwise regression combines backward elimination and forward selection by allowing a feature to be added or dropped at each iteration. Using this method, a newly added variable in an early stage may be removed later, and vice versa.
A variable’s p-value is not the only statistic that can be used for feature selection. Penalized-likelihood criteria, such as akaike information criterion (AIC) and bayesian information criterion (BIC), are also valuable. Lower AICs and BICs indicate that a model is more likely to be true. They are given as: nlog (RSS/n) + kp, where RSS is residual sum of square (which decreases as the model complexity increases), n is sample size, p is numbers of predictors, and k is two for AIC and log(n) for BIC. Both criteria penalize larger models as p goes up, and BIC penalizes model complexity more heavily, which explains why BIC tends to favor smaller models in comparison to AIC. Other criteria are 1) Adjusted R2, which increases only if a new feature improves model performance more than expected, 2) PRESS, summing up squares of predicted residuals, and 3) Mallow’s Cp Statistic, estimating the average MSE of prediction.
Lasso and Ridge Regression
Lasso and ridge regressions are powerful techniques for dealing with large feature coefficients. Both approaches reduce overfitting by penalizing features with large coefficients and minimizing the difference between predicted value and observation, but they differ when adding penalized terms. Lasso adds a penalty term equivalent to the absolute value of the magnitude of coefficients, so that it zeros out target variables’ coefficients and eliminates them from the model. Ridge assigns a penalty equivalent to square of the magnitudes of the coefficients. Even though it does not shrink the coefficient to zero, it can regularize and constrain the coefficients to control variance.
Lasso and ridge regression models have been widely used in finance since their introduction. A recent example used both these methods in predicting corporate bankruptcy.3 In this study, the authors discovered that these regression methods are optimal as they handle multicollinearity and minimize the numerical instability that may occur due to overfitting.
“Dimensionality reduction” is a process of transforming an extraordinarily complex, “high-dimensional” dataset (i.e., one with thousands of variables or more) into a dataset that can tell the story using a significantly smaller number of variables.
The most popular linear technique for dimensionality reduction is principal component analysis (PCA). It converts complex dataset features into a new set of coordinates named principal components (PCs). PCs are created in such a way that each succeeding PC preserves the largest possible variance under the condition that it is uncorrelated with the preceding PCs. Keeping only the first several PCs in the model reduces data dimensionality and eliminates multi-collinearity among features.
PCA has a couple of potential pitfalls: 1) PCA is sensitive to the scale effects of the original variables (data normalization is required for performing PCA), and 2) Applying PCA to the data will hurt its ability to interpret the influence of individual features since the PCs are not real variables any more. For these reasons, PCA is not a good choice for feature selection if interpretation of results is important.
Dimensionality reduction and specifically PCA have practical applications to fixed income analysis, particularly in explaining term-structure variation in interest rates. Dimensionality reduction has also been applied to portfolio construction and analytics. It is well known that the first eigenvector identified by PCA maximally captures the systematic risk (variation of returns) of a portfolio.4 Quantifying and understanding this risk is essential when balancing a portfolio.
 Pereira, J. M., Basto, M., & da Silva, A. F. (2016). The Logistic Lasso and Ridge Regression in Predicting Corporate Failure. Procedia Economics and Finance, v.39, pp.634-641.
 Alexander, C. (2001). Market models: A guide to financial data analysis. John Wiley & Sons.