# Forecasting loan status

# Presentation of the problem

This document proposes a simple solution to the problem presented by Benjamin Obi Tayo on this page. The author talks about an interview he had some years ago where he was asked to build a prediction model to foresee the status at the end of the repayment period of the loan portfolio managed by a business entity. The idea was to construct a simple model capable to predict the loan loss of the business entity at the end of the repayment period by studying the time evolution of the loan status prior to that end.

The specific question is: which is the fraction of loans that will have charged off at the end of the repayment period based in the loan status of the portfolio at the current moment (the data series goes up to about two years since origination of the loan). The input is a dataset with the records of the loan status for 50000 originated loans. The data provides the elapsed days from loan origination till charge-off is registered (if it occurs) together with the elapsed days till the loan status was noted down.

A transcription of the original formulation of the problem is given next:

In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:

* First, the borrower receives the funds. This event is called the origination.

* The borrower then makes regular repayments until one of the following happens:

(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.

(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:

* The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.

* For loans that charged off before the data was collected, the column with header “days from origination to charge-off” indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.

# Model and assumptions

The author does not specify whether every instance of the dataset corresponds to a **different** loan. For the sake of simplicity we will assume that is the case, since the data do not provide any hint about it.

The second assumption we do is to consider that the global behavior of the borrowers will not change after the time the data was collected. In effect, we are going to assume that all the annotations were done within an economically stable period of time where no abrupt financial events, that could have triggered more charge-offs at certain dates, occurred. This will allow us to model the problem with a probability distribution function whose features do not change with time.

Thus, as it is done for this cases we propose the use of the uniform Poisson probability distribution function to model the behavior of the loan’s funds repayment based in the dataset registered in the portfolio of the business entity.

# Proposed solution

We are going to use R to solve this problem and the first thing we want to do is to plot a data histogram showing the amount of days when a certain number of people stopped paying back the loan at the same day. To do so we need to create a table storing the amount of days that 1 charge off was registered, the amount of days that 2 charge offs were registered the same day, the amount of days that 3 charge offs were registered the same day, and so on….

Then, let us first load the dataset from the website provided by Benjamin. We store the original dataset in the variable *rawdata* (we will need this variable later again).

`sourceFile = “https://raw.githubusercontent.com/bot13956/Monte_Carlo_Simulation_Loan_Status/master/loan_timing.csv"`

rawdata <- read.csv(header = TRUE, sep = “,”, file=sourceFile)

head(rawdata)

Next, we eliminate the missing values and we store the second column of the dataset in the variable *elapsedDaysChargeOff*:

`data <- na.omit(rawdata)`

elapsedDaysChargeOff <- data[2]

In order to obtain the number of people that stopped paying at each day since origination, we use the R function *table()* on the newly created variable. This function computes the appearance frequency of the values within the input data, representing the number of loans who were charged off each day.

`dataTable <- data.frame(table(elapsedDaysChargeOff))`

colnames(dataTable) <- c(“days”, “persons”)

dataTable$days <- as.numeric(as.character(dataTable$days))

head(dataTable)

Now we can plot the histogram showing the amount of days when 1 loan has charge off, the amount of days when 2 loans have charge off the same day, the amount of days when 3 loans have charge off the same day, and so on….

The number of breaks used in the histogram is exactly the maximum frequency registered in the data frame *dataTable*, being of 19 and corresponding to the maximum number of loans having charged off the same day. Curiously enough the data from the loan portfolio shows that 18 charge-offs have never been observed in the same day, even though 17 and 19 loan charge-offs occurring the same day have been observed several times.

As we have assumed that the Poisson distribution function is adequate to model the time behavior of unpaid loan funds, we are going to estimate the mean value of the sampling distribution by using the expected value formula:

where *f(i)* and *x(i)* are the frequency vector (whose value is represented by the height of the bars in the histogram) and the vector storing the amount of loans that have charged off the same day, respectively. We use the library *’geometry’* to calculate the expected value of the sampling distribution as a vector dot product with the command *dot().* In order to obtain the mean value of charge offs per day we divide the obtained expected value by the maximum number of days present in the original dataframe. The frequency vector is retrieved as an output of the previously created histogram.

`frequency <- ggplot_build(output)$data[[1]]$count`

xi <- c(1:max(dataTable$persons))

library(geometry)

meanValDay <- dot(xi,frequency)/max(dataTable$days)

As we have assumed that the global situation of the economy during the period the data has been collected has been relatively calm, we can project an estimate of the average number of charge offs that we would have observed over a span of 3 years from the data obtained within the time period represented in the records of the portfolio (725 days). We have just to evaluate the product of the mean value of loan charge-offs per day times 365 (days per year) times 3 (years).

`meanVal3Yrs <- meanValDay*365*3`

The fraction of loans that have charged off before the 3 years refund period over the total amount of originated loans is just the value of the variable *meanVal3Yrs* divided by the total number of loans existing in the original dataset (50000 as provided by the original data stored in the variable *rawdata*).

`frac3Yrs <-meanVal3Yrs/length(rawdata$days.from.origination.to.chargeoff)`

This yields **a fraction of charge-offs after three years period of around 10%**, what is a bit large, but still within the order of magnitude of the typical charge off rate values observed on the portfolio of the Federal Reserve in the USA for real state loans: federalreserve

Of course the biggest weakness of this approach is that it assumes future loss rates similar to the historical series, since we have assumed that the global behavior of the borrowers does not to change with time. This can only yield accurate predictions if no special event causes a radical change in the economical status of the families that received the loans. Indeed, a strong crisis, like the one we are still experiencing with the Covid-19 can radically modify the scenario of the economy, shattering all the models that we have built with the available dataset.

# Alternative solution

As an alternative we can use the library *“fitdistrplus”* to fit our data to the Poisson distribution so that we can estimate more rigorously the average number of charge offs per day from the sampling distribution. To be rigorous in the use of the fitting library we must fill the empty values for the days not appearing in the original data frame (since no annotation is registered for several number of elapsed days).

We create a new table with the original data, filling with zeros (by default) the rows corresponding to the number of elapsed days from origination that do not appear in the original dataset.

`newdataTable <- data.frame(c(0:725), rep(0,726))`

colnames(newdataTable) <- c(“days”, “persons”)

We copy the original dataframe in the right rows of the new dataframe.

`newdataTable$persons[dataTable$days+1] <- dataTable$persons[]`

Now we can use the library *’fitdistrplus’* to fit our data to the Poisson probability distribution function in order to find the lambda value (the mean value of the sampling distribution).

`library(fitdistrplus)`

pois = fitdist(newdataTable$persons, ‘pois’, method = ‘mle’)

Indeed, we see that we obtain exactly the same value for the lambda of the

distribution, what represents the mean value:

`lambdaFit <- pois$estimate`

We see that the mean value of the fitted distribution is very close to the expected value of the sampling distribution. We can now apply the same operation as before but this time with the newly *“fitted”* daily mean of charge offs.

`meanVal3YrsFit <- lambdaFit*365*3`

Again the value is very close to the one obtained with the first approach. Let us finally calculate the fraction of loans which will charge off at the end of the payment period.

`frac3YrsFit <- meanVal3YrsFit/length(rawdata$days.from.origination.to.chargeoff)`

We see that **the fraction of loans that will charge off by the time of the 3-year term is finished is also around 10%** like the result obtained with the first approach. Therefore, the model looks meaningful and the outcome seems to be coherent with the values obtained in real-world scenarios.