The ability of machine learning models to predict loan performance makes them particularly interesting to lenders and fixed-income investors. This expanded post provides an example of applying the machine learning process to a loan-level dataset in order to predict delinquency. The process includes variable selection, model selection, model evaluation, and model tuning.
The data used in this example are from the first quarter of 2005 and come from the publicly available Fannie Mae performance dataset. The data are segmented into two different sets: acquisition and performance. The acquisition dataset contains 217,000 loans (rows) and 25 variables (columns) collected at origination (Q1 2005). The performance dataset contains the same set of 217,000 loans coupled with 31 variables that are updated each month over the life of the loan. Because there are multiple records for each loan, the performance dataset contains approximately 16 million rows.
For this exercise, the problem is to build a model capable of predicting which loans will become severely delinquent, defined as falling behind six or more months on payments. This delinquency variable was calculated from the performance dataset for all loans and merged with the acquisition data based on the loan’s unique identifier. This brings the total number of variables to 26. Plenty of other hypotheses can be tested, but this analysis focuses on just this one.
1 Variable Selection
An overview of the dataset can be found below, showing the name of each variable as well as the number of observations available
Most of the variables in the dataset are fully populated, with the exception of DTI, MI Percentage, MI Type, and Co-Borrower Credit Score. Many options exist for dealing with missing variables, including dropping the rows that are missing, eliminating the variable, substituting with a value such as 0 or the mean, or using a model to fill the most likely value.
The following chart plots the frequency of the 34,000 MI Percentage values
The distribution suggests a decent amount of variability. Most loans that have mortgage insurance are covered at 25%, but there are sizeable populations both above and below. Mortgage insurance is not required for the majority of borrowers, so it makes sense that this value would be missing for most loans. In this context, it makes the most sense to substitute the missing values with 0, since 0% mortgage insurance is an accurate representation of the state of the loan. An alternative that could be considered is to turn the variable into a binary yes/no variable indicating if the loan has mortgage insurance, though this would result in a loss of information.
The next variable with a large number of missing values is Mortgage Insurance Type. Querying the dataset reveals that that of the 34,400 loans that have mortgage insurance, 33,000 have type 1 borrower paid insurance and the remaining 1,400 have type 2 lender paid insurance. Like the mortgage insurance variable, the blank values can be filled. This will change the variable to indicate if the loan has no insurance, type 1, or type 2.
The remaining variable with a significant number of missing values is Co-Borrower Credit Score, with approximately half of its values missing. Unlike MI Percentage, the context does not allow us to substitute missing values with zeroes. The distribution of both borrower and co-borrower credit score as well as their relationship can be found below
As the plot demonstrates, borrower and co-borrower credit scores are correlated. Because of this, the removal of co-borrower credit score would only result in a minimal loss of information (especially within the context of this example). Most of the variance captured by co-borrower credit score is also captured in borrower credit score. Turning the co-borrower credit score into a binary yes/no ‘has co-borrower’ variable would not be of much use in this scenario as it would not differ significantly from the Number of Borrowers variable. Alternate strategies such as averaging borrower/co-borrower credit score might work, but for this example we will simply drop the variable.
In summary, the dataset is now smaller—Co-Borrower Credit Score has been dropped. Additionally, missing values for MI Percentage and MI Type have been filled in. Now that the data have been cleaned up, the values and distributions of the remaining variables can be examined to determine what additional preprocessing steps are required before model building. Scatter matrices of pairs of variables and distribution plots of individual variables along the diagonal can be found below. The scatter plots are helpful for identifying multicollinearity between pairs of variables, and the distributions can show if a variable lacks enough variance that it won’t contribute to model performance.
The third row of scatterplots, above, reflects a lack of variability in the distribution of Original Loan Term. The variance of 3.01 (calculated separately) is very small, and as a result the variable can be removed—it will not contribute to any model as there is very little information to learn from. This process of inspecting scatterplots and distributions is repeated for the remaining pairs of variables. The Number of Units variable suffers from the same issue and can also be removed.
2 Heatmaps and Pairwise Grids
Matrices of scatterplots are useful for looking at the relationships between variables. Another useful plot is a heatmap and pairwise grid of correlation coefficients. In the plot below a very strong correlation between Original LTV and Original CLTV is identified.
This multicollinearity can be problematic for both the interpretation of the relationship between the variables and delinquency as well as the actual performance of some models. To combat this problem, we remove Original CLTV because Original LTV is a more accurate representation of the loan at origination. Loans in this population that were not refinanced kept their original LTV value as CLTV. If CLTV were included in the model it would introduce information not available at origination to the model. The problem of allowing unexpected additional information in a dataset introduces an issue known as leakage, which will bias the model.
Now that the numeric variables have been inspected, the remaining categorical variables must be analyzed to ensure that the classes are not significantly unbalanced. Count plots and simple descriptive statistics can be used to identify categorical variables are problematic. Two examples below show the count of loans by state and by seller.
Inspecting the remaining variables uncovers that Relocation Indicator (indicating a mortgage issued when an employer moves an employee) and Product Type (fixed vs. adjustable rate) must be removed as they are extremely unbalanced and do not contain any information that will help the models learn. We also removed first payment date and origination date, which were largely redundant. The final cleanup results in a dataset that contains the following columns:
The final two steps before model building are to standardize each of the numeric variables and turn each categorical variable into a series of dummy or indicator variables. Numeric variables are scaled with mean 0 and standard deviation 1 so that it is easier to compare variables that have a different scale (e.g. interest rate vs. LTV). Additionally, standardizing is also a requirement for many algorithms (e.g. principal component analysis).
Categorical variables are transformed by turning each n value of the variable into its own yes/no feature. For example, Property State originally has 50 possible values, so it will be turned into 50 variables (e.g. Alabama yes/no, Alaska yes/no). For categorical variables with many values this transformation will significantly increase the number of variables in the model.
After scaling and transforming the dataset, the final shape is 199,716 rows and 106 columns. The target variable—loan delinquency—has 186,094 ‘no’ values and 13,622 ‘yes’ values. The data are now ready to be used to build, evaluate, and tune machine learning models.
3 Model Selection
Because the target variable loan delinquency is binary (yes/no) the methods available will be classification machine learning models. There are many classification models, including but not limited to: neural networks, logistic regression, support vector machines, decision trees and nearest neighbors. It is always beneficial to seek out domain expertise when tackling a problem to learn best practices and reduce the number of model builds. For this example, two approaches will be tried—nearest neighbors and decision tree.
The first step is to split the dataset into two segments: training and testing. For this example, 40% of the data will be partitioned into the test set, and 60% will remain as the training set. The resulting segmentations are as follows:
1. 60% of the observations (as training set)- X_train
2. The associated target (loan delinquency) for each observation in X_train- y_train
3. 40% of the observations (as test set)- X_test
4. The targets associated with the test set- y_test
Data should be randomly shuffled before they are split, as datasets are often in some type of meaningful order. Once the data are segmented the model will first be exposed to the training data to begin learning.
4 K-Nearest Neighbors Classifier
Training a K-neighbors model requires the fitting of the model on X_train (variables) and y_train (target) training observations. Once the model is fit, a summary of the model hyperparameters is returned. Hyperparameters are model parameters not learned automatically but rather are selected by the model creator.
The K-neighbors algorithm searches for the K closest (i.e., most similar) training examples for each test observation using a metric that calculates the distance between observations in high-dimensional space. Once the nearest neighbors are identified, a predicted class label is generated as the class that is most prevalent in the neighbors. The biggest challenge with a K-neighbors classifier is choosing the number of K neighbors to use. Another significant consideration is the type of distance metric to use.
To see more clearly how this method works, the 6 nearest neighbors of two random observations from the training set were selected, one that is a non-default (0 label) observation and one that is not.
Random delinquent observation: 28919
Random non delinquent observation: 59504
The indices and minkowski distances to the 6 nearest neighbors of the two random observations are found below. Unsurprisingly, the first nearest neighbor is always itself and the first distance is 0.
Indices of closest neighbors of obs. 28919 [28919 112677 88645 103919 27218 15512]
Distance of 5 closest neighbor for obs. 28919 [0 0.703 0.842 0.883 0.973 1.011]
Indices of 5 closest neighbors for obs. 59504 [59504 87483 25903 22212 96220 118043]
Distance of 5 closest neighbor for obs. 59504 [0 0.873 1.185 1.186 1.464 1.488]
Recall that in order to make a classification prediction, the kneighbors algorithm finds the K nearest neighbors of each observation. Each neighbor is given a ‘vote’ via their class label, and the majority vote wins. Below are the labels (or votes) of either 0 (non-delinquent) or 1 (delinquent) for the 6 nearest neighbors of the random observations. Based on the voting below, the delinquent observation would be classified correctly as 3 of the 5 nearest neighbors (excluding itself) are also delinquent. The non-delinquent observation would also be classified correctly, with 4 of 5 neighbors voting non-delinquent.
Delinquency label of nearest neighbors- non delinquent observation: [0 1 0 0 0 0]
Delinquency label of nearest neighbors- delinquent observation: [1 0 1 1 0 1]
5 Tree-Based Classifier
Tree based classifiers learn by segmenting the variable space into a number of distinct regions or nodes. This is accomplished via a process called recursive binary splitting. During this process observations are continuously split into two groups by selecting the variable and cutoff value that results in the highest node purity where purity is defined as the measure of variance across the two classes. The two most popular purity metrics are the gini index and cross entropy. A low value for these metrics indicates that the resulting node is pure and contains predominantly observations from the same class. Just like the nearest neighbor classifier, the decision tree classifier makes classification decisions by ‘votes’ from observations within each final node (known as the leaf node).
To illustrate how this works, a decision tree was created with the number of splitting rules (max depth) limited to 5. An excerpt of this tree can be found below. All 120,000 training examples start together in the top box. From top to bottom, each box shows the variable and splitting rule applied to the observations, the value of the gini metric, the number of observations the rule was applied to, and the current segmentation of the target variable. The first box indicates that the 6th variable (represented by the 5th index ‘X’) Borrower Credit Score was used to split the training examples. Observations where the value of Borrower Credit Score was below or equal to -0.4413 follow the line to the box on the left. This box shows that 40,262 samples met the criteria. This box also holds the next splitting rule, also applied to the Borrower Credit Score variable. This process continues with X (Original LTV) and so on until the tree is finished growing to its depth of 5. The final segments at the bottom of the tree are the aforementioned leaf nodes which are used to make classification decisions. When making a prediction on new observations, the same splitting rules are applied and the observation receives the label of the most commonly occurring class in its leaf node.
A more advanced tree based classifier is the Random Forest Classifier. The Random Forest works by generating many individual trees, often hundreds or thousands. However, for each tree, number of variables considered at each split is limited to a random subset. This helps reduce model variance and de-correlate the trees (since each tree will have a different set of available splitting choices). In our example, we fit a random forest classifier on the training data. The resulting hyperparameters and model documentation indicate that by default the model generates 10 trees, considers a random subset of variables the size of the square root of all variables (approximately 10 in this case), has no depth limitation, and only requires each leaf node to have 1 observation.
Since the random forest contains many trees and does not have a depth limitation, it is incredibly difficult to visualize. In order to better understand the model, a plot showing which variables were selected and resulted in the largest drop in the purity metric (gini index) can be useful. Below are the top 10 most important variables in the model, ranked by the total (normalized) reduction to the gini index. Intuitively, this plot can be described as showing which variables can be used to best segment the observations into groups that are predominantly one class, either delinquent and non-delinquent.
6 Model Evaluation
Now that the models have been fitted, their performance must be evaluated. To do this, the fitted model will first be used to generate predictions on the test set (X_test). Next, the predicted class labels are compared to the actual observed class label (y_test). Three of the most popular classification metrics that can be used to compare the predicted and actual values are recall, precision, and the f1-score. These metrics are calculated for each class, delinquent and not-delinquent.
Recall is calculated for each class as the ratio of events that were correctly predicted. More precisely, it is defined as the number of true positive predictions divided by the number of true positive predictions plus false negative predictions. For example, if the data had 10 delinquent observations and 7 were correctly predicted, recall for delinquent observations would be 7/10 or 70%.
Precision is the number of true positives divided by the number of true positives plus false positives. Precision can be thought of as the ratio of events correctly predicted to the total number of events predicted. In the hypothetical example above, assume that the model made a total of 14 predictions for the label delinquent. If so, then the precision for delinquent predictions would be 7/14 or 50%.
The f1 score is calculated as the harmonic mean of recall and precision: (2(Precision*Recall/Precision+Recall)).
The classification reports for the K-neighbors and decision tree below show the precision, recall, and f1 scores for label 0 (non-delinquent) and 1 (delinquent).
There is no silver bullet for choosing a model—often it comes down to the goals of implementation. In this situation, the tradeoff between identifying more delinquent loans at the cost of misclassification can be analyzed with a specific tool called a roc curve. When the model predicts a class label, a probability threshold is used to make the decision. This threshold is set by default at 50% so that observations with more than a 50% chance of membership belong to one class and vice-versa.
The majority vote (of the neighbor observations or the leaf node observations) determines the predicted label. Roc curves allow us to see the impact of varying this voting threshold by plotting the true positive prediction rate against the false positive prediction rate for each threshold value between 0% and 100%.
The area under the ROC curve (AUC) quantifies the model’s ability to distinguish between delinquent and non-delinquent observations. A completely useless model will have an AUC of .5 as the probability for each event is equal. A perfect model will have an AUC of 1 as it is able to perfectly predict each class.
To better illustrate, the ROC curves plotting the true positive and false positive rate on the held-out test set as the threshold is changed are plotted below.
7 Model Tuning
Up to this point the models have been built and evaluated using a single train/test split of the data. In practice this is often insufficient because a single split does not always provide the most robust estimate of the error on the test set. Additionally, there are more steps required for model tuning. To solve both of these problems it is common to train multiple instances of a model using cross validation. In K-fold cross validation, the training data that was first created gets split into a third dataset called the validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated K times, each time holding out a different portion of the training set to validate against. Once the model has been tuned using the train/validation splits, it is tested against the held out test set just as before. As a general rule, once data have been used to make a decision about the model they should never be used for evaluation.
8 K-Nearest Neighbors Tuning
Below a grid search approach is used to tune the K-nearest neighbors model. The first step is to define all of the possible hyperparameters to try in the model. For the KNN model, the list nk = [10, 50, 100, 150, 200, 250] specifies the number of nearest neighbors to try in each model. The list is used by the function GridSearchCV to build a series of models, each using the different value of nk. By default, GridSearchCV uses 3-fold cross validation. This means that the model will evaluate 3 train/validate splits of the data for each value of nk. Also specified in GridSearchCV is the scoring parameter used to evaluate each model. In this instance it is set to the metric discussed earlier, the area under the roc curve. GridSearchCV will return the best performing model by default, which can then be used to generate predictions on the test set as before. Many more values of K could be specified to search through, and the default minkowski distance could be set to a series of metrics to try. However, this comes at a cost of computation time that increases significantly with each added hyperparameter.
In the plot below the mean training and validation scores of the 3 cross-validated splits is plotted for each value of K. The plot indicates that for the lower values of K the model was overfitting the training data and causing lower validation scores. As K increases, the training score lowers but the validation score increases because the model gets better at generalizing to unseen data.
9 Random Forest Tuning
There are many hyperparameters that can be adjusted to tune the random forest model. We use three in our example: n_estimators, max_features, and min_samples_leaf. N_estimators refers to the number of trees to be created. This value can be increased substantially, so the search space is set to list estimators. Random Forests are generally very robust to overfitting, and it is not uncommon to train a classifier with more than 1,000 trees. Second, the number of variables to be randomly considered at each split can be tuned via max_features. Having a smaller value for the number of random features is helpful for decorrelating the trees in the forest, which is especially useful when multicollinearity is present. We tried a number of different values for max_features, which can be found in the list features. Finally, the number of observations required in each leaf node is tuned via the min_samples_leaf parameter and list samples.
The resulting plot, below, shows a subset of the grid search results. Specifically, it shows the mean test score for each number of trees and leaf size when the number of random features considered at each split is limited to 5. The plot demonstrates that the best performance occurs with 500 trees and a requirement of at least 5 observations per leaf. To see the best performing model from the entire grid space the best estimator method can be used.
By default, parameters of the best estimator are assigned to the GridSearch object (cvknc and cvrfc). This object can now be used generate future predictions or predicted probabilities. In our example, the tuned models are used to generate predicted probabilities on the held out test set. The resulting
ROC curves show an improvement in the KNN model from an AUC of .62 to .75. Likewise, the tuned Random Forest AUC improves from .64 to .77.
Predicting loan delinquency using only origination data is not an easy task. Presumably, if significant signal existed in the data it would trigger a change in strategy by MBS investors and ultimately origination practices. Nevertheless, this exercise demonstrates the capability of a machine learning approach to deconstruct such an intricate problem and suggests the appropriateness of using machine learning model to tackle these and other risk management data challenges relating to mortgages and a potentially wide range of asset classes.