There are two main challenges when implementing a machine learning solution: building a model that performs well and effectively leveraging the results. Having a good understanding of the machine learning process and model being used is key to tackling both issues. Using a predictive model without appropriately understanding it can substantially increase risk and lead to missed opportunities. If the performance of a model is unclear, misunderstood, or overestimated then subsequent decisions will be biased or outright wrong. Likewise, if the ability of a model is underestimated then its use will not be optimized.

With these challenges in mind, this blog post serves as a comprehensive introduction to machine learning – a peek into the black box to see why machine learning and its underlying concepts are powerful and what the different parts of the process are trying to accomplish.

## Machine Learning Basics

Machine learning is powerful because it can solve challenging problems. The term machine learning was first defined by Arthur Samuel as “a field of study that gives computers the ability to learn without being explicitly programmed.”^{1} This definition points to the magic of machine learning – the ability of a computer to iteratively learn an optimal solution. Given a set of data points, complex relationships can be learned that may have otherwise been missed and vast quantities of data can be rapidly consumed and examined for insights.

With problem solving at its center, the machine learning landscape can be broadly divided into two areas: *supervised machine learning *and *unsupervised machine learning. Supervised machine learning *contains a set of methods for discovering the relationship between data when the problem has a defined target variable. For example, if the problem is to predict if a loan will default, the method used will be supervised, and the target will have a value of either yes or no based on whether the loan defaulted. This problem can be further defined as *classification *because the target in this example is categorical.If the target is a continuous variable (e.g. dollars lost on a defaulted loan) then the problem falls into the *regression* category.

When the problem does not include a defined target, then *unsupervised machine learning *methods will be used. Imagine if the problem is to identify different subgroups within the mortgage population. Even though there is no subgroup label in the data, it is possible to discover these subgroups using unsupervised learning techniques.

Once the problem is defined and the type of learning has been selected, the machine learning process generally includes these steps:

- Variable selection and creation
- Model selection
- Model evaluation
- Model tuning
- Implementation

These steps overlap quite often, but each has a distinct goal that can be discussed individually.

## Variable Selection and Creation

The process of choosing a subset of available variables to keep in a final machine learning model is known as variable selection. The main benefits of variable selection include reduced model complexity (making models easier to interpret), improved model performance, reduced training time, and a reduction in *overfitting* (this will be covered in the model evaluation section). The goal of variable selection is to end with a set of variables that captures the relationship as well as possible without providing extraneous information (noise) that will hurt model performance. The two major considerations for variable selection are data quality and information contribution.

### Data Quality

The saying “garbage in, garbage out” applies to variable selection. If variables are included in the model that have lots of missing values, outliers (abnormally high/low values), or other quality issues, then the model will suffer as it learns. The model is going to learn the optimal solution based on the data provided – if the data does not represent the real world, neither will the resulting model.

Generating descriptive statistics is a great way to explore data quality. Variable counts (non-NA values), measures of central tendency (mean, median, mode), standard deviation, and distribution plots will help identify data quality issues. There are also automated methods for detecting outliers in the data that should be removed.

There are a few options when dealing with missing data points: dropping rows, dropping variables, filling based on business rules, substituting with the mean value, or dropping the variable from the model. A new option that has been developed by RiskSpan is to use machine learning to fill in missing data with their most likely value. Each of these methods has pros and cons and should be considered carefully. There is no silver bullet to solving data quality issues; the guiding principle is to end with a dataset that is as representative of the real world as possible.

### Information Contribution

Model performance can be improved by providing only variables that are important (i.e. contribute to performance). This is especially useful when the number of available variables is very large. Methods for selecting variables based on their contribution include filter methods, wrapper methods, embedded methods, and dimensionality reduction.

Filter methods select variables by eliminating those that do not meet a certain threshold. One of the simplest thresholds to use is the *variance *of each variable. *Variance *is a measure of how much the values of a variable are spread out. When the variance level is very low there is a good chance that the model will struggle as it attempts to find a nonexistent relationship in the data. Two other prominent measures used to select variables include the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These measures are powerful because they judge if an added variable improves model performance.

Wrapper methods use iterative processes to reduce the number of features selected. When performing *forward* *selection*, the model begins with no features. In each iteration, the variable that improves the model most is added until the addition of more variables does not lead to an increase in performance. Alternately, *recursive feature elimination* begins by building a model with all available variables and then setting aside the best performing feature. This process continues iteratively until all variables have been eliminated, at which point they are ranked and a best subset can be selected.

Embedded variable selection methods are characterized by being a direct part of the model being built. One of the most popular methods is the *lasso* regression model. Recall that when performing a regression, the goal is to predict the value of a continuous variable. This is often accomplished by minimizing the distance (known as error) between the observed value of the target and a value that is generated by the model as a prediction. In doing so, coefficients are created for each variable that explain the impact on the prediction. When conducting lasso regression, a penalty term is added that constrains the size of the variable coefficients. The lasso model will often shrink the worst performing variable coefficients to 0, effectively eliminating them from the model.

Rather than removing variables from a model, dimensionality reduction techniques solve the variable selection problem by constructing a new smaller set of variables from the original dataset. One of the most popular techniques for dimensionality reduction is *principal* *component* *analysis,* which has two goals. First, the newly created variables (called components) are created in a way that preserves as much variance as possible (recall that having low variance can hurt performance). Second, the variables are created so that that the original variables can be reconstructed as well as possible. An added benefit of principal component analysis is that the resulting variables are *not* correlated with each other – known as *multicollinearity*. Multicollinearity is problematic because it can be impossible to determine what variable is impacting the model when they are correlated with each other. This greatly reduces model interpretability and applicability.

## Model Selection

Selecting the right machine learning model is crucial because models can vary significantly in their performance, interpretability, and speed based on the type of problem. For some problems and data types models have been established as top performers. For this reason, seeking domain experience and research can be beneficial when approaching a new problem.

The first step in selecting a machine learning model is to define the type of problem. As mentioned previously, if the problem requires *supervised* *learning* and the target variable is a quantity, then the model will be in the *regression* family. If the target is a categorical variable, then the model will be *classification*. If the problem is *unsupervised *then the model will most likely be a *clustering* algorithm (though there are other unsupervised models, such as dimensionality reduction).

When selecting a model, it is important to consider the model *bias-variance tradeoff. *In this context *bias* refers to error that is a result of flawed model assumptions. For example, no matter how well a linear regression model is trained, it will always struggle to capture nonlinear relationships. It is the underlying assumption of linearity that increases bias and results in increased error when a nonlinear relationship is present. To combat this problem and capture nonlinear relationships, polynomial regression, smoothing splines, or other methods can be used. However, this comes at the cost of increased error due to model sensitivity to fluctuations, also known as *variance.*

In the example below, the bias of the linear regression line (blue) results in a poorly fitting model for the data points generated by the true function (black). The line created by the smoothing spline (grey) fits the data points well, but is significantly different from the true function and has a high *variance*. The second order polynomial model (red) fits better than the linear model, but is outperformed significantly by the third order polynomial (yellow), which is the best estimate of the true function.

#### Figure One: The Bias-Variance Tradeoff

The goal of model selection is to pick a model that satisfies the bias-variance tradeoff, successfully captures the true relationship, and effective at predicting on unseen data. To accomplish this, it is common to train and evaluate multiple models using *cross-validation.*

## Model Evaluation

To properly evaluate a model, the data is first segmented into *training* and *test* sets and the *test *set is put aside. Next, the *training* set is split again so that a third set of *validation *data is created. The model is first trained on the training set, which returns a model with the parameters fit to the data. Once the model is fit, the *validation* set is used to evaluate how well the model learned. There are several metrics available to evaluate *unsupervised* learning methods, but they are highly dependent on the type of problem and specific model selected. For *supervised *learning models are evaluated by using the fitted model to generate a set of predictions from the validation data and then comparing the predictions to the actual target values.

This process of generating predictions and evaluating accuracy can be completed multiple times by alternating what portion of the training data is split off as the validation set. This is called *cross-validation*. For example, using *10-fold* cross validation, the training set would be split into ten equal parts. The model would be trained on the first nine subsets, and then tested using the last one. This process continues for ten iterations, each time holding out a different subset to validate against. This will result in ten model scores which can be analyzed.

Cross-validation is powerful because it evaluates model prediction accuracy and the bias-variance tradeoff. If there is a large decrease in performance between the training and validation sets, then it is likely that the model has learned the training data too well. This is analogous to memorizing the answers rather than learning the rules. This problem is called *overfitting* and occurs when the model *variance* is too high. Overfitting is a serious concern because the model gives the appearance of being accurate but will fail when predicting on new data in a production environment. The opposite problem, *underfitting,* may occur due to high model *bias. *This can be identified when the model cross-validation results are poor (though the possibility exists that no relationship is present in the data).

## Model Tuning

Once a model has been selected it must be tuned. Most machine learning models have *hyperparameters* that are not learned automatically and must be set by the researcher. Recall the *lasso* model from the variable selection section. This model contains a regularization hyperparameter *α* that will shrink the regression coefficients. As α increases, the amount that the coefficients shrink will increase. There are many methods for determining at what level α should be set, including random search, grid search, and Bayesian methods. Random and grid search are brute force methods that will build many models, each with a different value for α, and return the value that performs best. Bayesian tuning methods are a relatively new and growing area of machine learning that work by iteratively trying a parameter value and then generating a predicted value that is expected to perform best. This continues until the process converges to an optimal value.

Model tuning is crucial as it can substantially increase model performance. To ensure that the correct hyperparameters are selected, the tuning process is performed within cross-validation as part of the model selection process. Once the model is finished tuning, it can be evaluated on the *test* set that was set aside at the beginning of the cross-validation process. This will deliver the most accurate picture of the models ability to generalize to unseen data and perform in a production environment.

There are many different hyperparameters, and some models contain more than one. When combined with cross-validation, the number of models to be considered can increase rapidly and become computationally intensive, which is a big reason why Bayesian methods are growing in popularity. Regardless of the method chosen, the goal of tuning is to select hyperparameters that maximize performance without causing overfitting.

## Implementing a Machine Learning Solution

When implementing a machine learning solution, each step in the model building process should be considered due to the number of crucial decisions that need to be made. Without understanding the underlying data quality, variable selection, model building, and tuning processes, it is difficult, if not impossible, to have a proper grasp of the model's ability. Machine learning is powerful and can yield impressive results when the expected performance and limitations are thoroughly understood.

Machine learning is a growing discipline, particularly within the mortgage industry, the application of which is nuanced and complex. While this article was intended to provide a high-level introduction to the machine learning process, there is much more to learn about each of the steps outlined above. RiskSpan sees great potential within the mortgage industry for applied machine learning and will continue to explore its viability within the sector and share what we’ve learned.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.368.2254&rep=rep1&type=pdf