**A primer on neural networks**

By Ryan Huff

Machine learning (ML) has entered the common vernacular as an umbrella term for algorithms pertaining to artificial intelligence and complex data analysis. In practice, ML encompasses any process that learns patterns from one dataset and attempts to extrapolate into another, traditionally using statistics. While not typically labeled as ML, the generalized linear model (GLM) is an example of machine learning. The GLM is used as both a descriptive and predictive tool for various applications, ranging from demand planning to sociology. Because of its transparency and interpretability, the GLM can be an appealing tool for stakeholders of different backgrounds (not necessarily statistics-based).

However, there are some drawbacks to the GLM—namely, the amount of discretion required by the analyst to construct robust, well-informed models. That is, GLM models are specified by the analyst, and it can be difficult to identify and capture the multiple complex relationships and interactions that are observed. A neural network is also an ML model with origins dating back to the 20th century. With advances in computing power, neural networks have recently become a more popular modeling approach for their ability to capture complex relationships and interactions. While neural networks can solve complex problems (ChatGPT being a specific type of neural network known as a large language model), they can also be used to solve traditional analytical tasks. The neural network has many properties like the more widely adopted GLM, while also producing more accurate forecasts. This article seeks to highlight the conceptual overlap between the GLM and the neural network, as well as contrast their strengths and weaknesses.

Machine learning (ML) encompasses any process that learns patterns from one dataset and attempts to extrapolate into another. While ML has been labeled as a “black box” approach to modeling, ML models can be related to more common regression approaches used in the financial insurance industry. As actuaries, we can focus on building and implementing models that best reflect the risks being evaluated. Often these risks include complex relationships. ML models are one approach to capture complex relationships.

**A familiar face—mechanics of the GLM **

The GLM is a flexible class of models that linearly relates a set of predictor variables to a target variable. When a GLM is estimated, a set of coefficients is produced that represents the effect of a 1-unit change in each predictor. To illustrate, suppose we are developing a model to estimate the probability of mortgage default. Assume that the only variable available for this model is a borrower’s credit score. For this exercise, we’ll demonstrate using a linear probability model (LPM), which assumes that the probability a borrower defaults is directly proportional to credit score. The setup of the model would appear as Equation 1:

Our model has a few components. First, the intercept, which in this case represents the probability of default when a borrower’s credit score is 0. If one were drawing a line, consider this to be the point where the line crosses the y-axis, where no intercept would mean that the line goes through the point (0, 0). The second component is beta, which represents the estimated change in the probability of default if credit score increases by 1. If beta were 0.01 and the credit score increases by 1, the probability of default would increase by 1%. The final component, û, is the residual, an estimate of the model’s error. This isn’t directly provided as a singular value but is representative of each observational prediction’s deviation from its actual value.

Why is this setup appealing? For one, all components of the
model are laid out for us. It’s easy to explain that by adding the intercept to
the product of beta and the credit score, we can obtain a predicted default
probability. Secondly, the coefficient on beta provides insight into the *relationship*
between the probability of default and the credit score. Interpreting the
direction (positive or negative) and size of the coefficient and comparing it
with what relationship we would expect can help teach us about the data or
provide confidence that the model will produce logical results.

Now let’s make the example slightly more complicated. Assume that two additional variables, the borrower’s income and the unemployment rate, are known. The new model would be specified by Equation 2:

Adding variables into the model has resulted in a few important contributions. For one, we now account for additional information about borrower default that wasn’t explained by credit score. We’ve also begun to isolate the actual effects of credit score, income, and unemployment in our model. To this second point, assume that income and credit score are related, where higher levels of income are associated with higher credit scores. We’ve established that both income and credit score are predictive of borrower default. In Equation 1, where only credit score was considered, the individual effect of credit score on borrower default was likely overstated, because it can also infer information about income. By specifying a model where credit score and income are provided separately, these two effects can be separated. Therefore, each beta in Equation 2 (1, 2, and 3) is the marginal effect on the probability of default *with respect to the other variables in the model*. The more saturated with *relevant *predictors our model is, the more information on default behavior we can explain, and the more isolated estimated individual variable effects become.

Regression models produce coefficients that are interpretable and allow end-users to understand the relationships assumed in the model. However: Regression models must be specified by the analyst:

To have the appropriate form (e.g., linear vs. logistic regression)

To have the appropriate relationships in the data (e.g., non-linear effects)

To have the appropriate interactions in the data (e.g., credit score and income)

Changes to the data, model form, and model variables could have material impacts on the results of the model.

How did we get here? The GLM determines each of the
coefficients (intercept and our betas) using a process known as optimization.
For our linear probability model, an algorithm called ordinary least squares
(OLS) is used. OLS selects values of the intercept and each beta that minimizes
the squared error (actual – predicted)^{2} as a sum across the entire
sample of data. Different variants on the GLM that are suited to handle
different types of problems (i.e, logistic regression, Poisson regression) use
different optimization algorithms, such as maximum likelihood estimation, to
choose coefficients. The result is an equation that represents the “best fit”
to the target given the variables in the model.

There are a few pitfalls associated with using GLMs for inference and prediction. As mentioned, there are many variations on the GLM that are specific to different distributions in the data. Choosing the correct model requires taking on different assumptions that may or may not hold when using the model on new data. In some instances, the “correct” answer depends on the use case. For our example, we’ve proposed a linear probability model to predict borrower default. For a task to predict binary outcomes (i.e., did the borrower default), many in the industry would default to a logistic regression, which exclusively produces predictions between 0 and 1 (as a probability would behave).

The process of choosing variables to have in the model and what shape they may take on is a manual process for the analyst when estimating a GLM. Adding variables errantly without consideration for logical justification can result in a very noisy model that doesn’t predict or infer well. Conversely, omitting key variables can limit how well the model explains the target and, in some cases, bias the estimated coefficients (the case of not considering income when using credit score to explain mortgage default).

Variables can also have complex relationships to the target. Assume that higher credit scores are associated with lower default rates; however, this has a diminishing return. This would suggest that increases to credit score for a borrower on the lower end of the distribution are associated with a *larger* reduction in default probability than increases on the upper end. Properly identifying this trend may be critical to obtaining accurate predictions of borrower default across different credit scores. Variables may also have interaction effects that provide additional information in the model. It may be that the effect of credit score is also dependent on the effect of income. The analyst must identify this relationship and then decide to build it into the model. When datasets are large and relationships are complex, choosing the best subset of variables and transformations can be an extremely expensive process. This can also turn the transparency advantage of the model into a deficit, as confirmation bias leaks in when the analyst tweaks the model to produce outcomes they expect. This can lead to biased (systematically incorrect) forecasts as well as Type I or Type II errors when interpreting coefficients to explain results.

If the cost of an inaccurate forecast is high, the need for an accurate prediction may outweigh the need for explainability. This is the point where exploring alternative predictive models may be appropriate.

**Conceptually similar—mechanics of the neural network**

Neural networks are a flexible set of models that can be adapted to many challenges that the GLM is not well suited for. For instance, it efficiently handles high-volume, unstructured datasets to produce strong performance in speech detection, image recognition, and text generation. Neural networks can also be applied to many traditional modeling problems, like the mortgage default example. At their core, neural networks utilize intercepts and coefficients, known as weights and biases, to obtain predictions similar to GLMs.

To deconstruct the components of the neural network, we will point to the mortgage default model in Equation 2. Figure 1 shows a conceptual diagram of a simple neural network:

When talking about neural networks, there is a shift in terminology compared to the GLM, but the parallels are abundant. Layers can be thought of as different stages in the model. In the example put forth in Figure 1, we have two layers—an input layer and an output layer. The input layer contains each of our model variables as a “stack.” The output layer represents the predicted probability. In this example, this can be simplified as the right-hand side and left-hand side of the GLM equation. Each circle represents a single “neuron.” For a single observation (row of our data), each neuron can be thought of as one value. The neuron for income represents the income for that single borrower. Each arrow represents a connection, signifying that each neuron in one layer is added together to the neuron they are connected to in the next layer. Specific to each connection is a weight (w_{1}, w_{2}, w_{3},). The weight is multiplied by the value of the neuron (i.e., w_{1} * Credit Score), after which the product of all of these is added up to obtain the probability of default. The green neuron is a placeholder for 1, which is multiplied by what we’ve differentiated as “bias” along its connection and added to the total. Why do we differentiate this out? Equation 3, which writes out our prediction as an equation, reveals why:

This should look familiar. The neural network we have outlined above produces a similar setup to Equation 2. The intercept, which is a constant, is analogous to our bias multiplied by 1. Each weight is multiplied by each variable in the same way that coefficients are applied in the GLM. Finally, all these terms are added together to produce the output. In the same way that the GLM selects the intercept and set of betas to minimize error, the neural network chooses weights and biases to achieve the same result. If one examined each weight and bias in the model, they are likely to be similar to the intercept and betas produced by the GLM (though given that the process for estimating weights and biases is different, they may differ slightly). Nonetheless, the predictions between this model and the GLM should be very similar.

To outperform the GLM, the neural network scales the number of neurons and layers to account for more complexity.

**Capturing nonlinearity—hidden layers and activation
functions**

In the above case, we’ve considered a neural network with two layers. One input to enter each variable into the model, and one output to produce our forecast. This was to demonstrate that neural networks function similarly to GLMs. Neural networks also have the special property of allowing for more layers. Figure 2 demonstrates how we could estimate our mortgage default model with three layers.

While the diagram may seem intimidating, the process is very
similar to what has occurred in the previous neural network. To simplify, let’s
consider what goes on to get the output for neuron h_{1}. Credit score,
unemployment, and income are multiplied by the weight specific to that
connection and then added together with the bias term (or the “intercept” from
the GLM). The output for h_{1} is very much like a neural network with
one input layer and one output layer. This same process is performed for h_{2},h_{3}, and h_{4}. So, what is different? Adding the
second layer allows for the model to “learn” nonlinear relationships and
interactions in the data. This exercise is not always a pure “black box”
exercise where the model picks up random effects in the data. The data input to
the model is limited to known variables that should impact the probability of
default. The benefit from the model and added layer is to identify how default
rates change by interacting each of the input variables. This simplifies the
model development process and allows the data to inform the model instead of
the analyst iteratively trying to evaluate relationships and interactions.

Along with hidden layers, activation functions enable neural networks to autonomously learn nonlinear relationships in the data. Activation functions alter the output of each individual neuron, which carries several benefits. As a neural network gets wider (more layers), the mathematical training process becomes more complicated. Activation functions help compress the output of each neuron to allow the model to find the best weights and biases. These functions are often nonlinear, which also leads the model to pick up on the complex relationships in the data between the predictors and the target. Different layers can have different activation functions, and the activation function for the output layer generally matches the type of data being predicted.

To demonstrate the mechanics of a single neuron, let’s perform an example calculation using the “sigmoid” activation function. The sigmoid transformation is given by Equation 4.

This equation is used in a logistic regression to compress the predictions between 0 and 1. This same compression concept is used to help the neural network training algorithm learn.

Assume we want to evaluate the model output of neuron h_{1} for a borrower with a 700 credit score, yearly income of $100,000, and an unemployment rate of 5%. Further assume the model has learned the weights and biases for this neuron already. When running the model, we need to *scale* the data. This means that every variable is comparable in value. This is a necessary step because the scale for income is much larger than the scale for credit score. Credit scores typically range from 300 to 850, and income can be anywhere from $0 to several million dollars. In our example, we’ve divided each variable by the average and input these into our model. In practice, there are lots of different ways to scale data. The sample calculation is given in Figure 3.

In this table, the predictor variables are first divided by their average to get a “Scaled” value. Then, the model weight (which is estimated by the neural network) is multiplied by the scaled value. Then, these are added together. Finally, this sum (1.51) is passed through Equation 4 (the activation function) to get the final output of 0.82. Using different activation functions provides some necessary features. In the same way that we scale each of the variables going into the model, the activation function helps scale the value of each neuron. This helps prevent one pattern in the data from dominating the model.

Because different neurons will have different weights and biases, they can learn different information about the data. In this sense, each neuron has a different “perspective” about the data. Maybe h_{1} is well calibrated to credit scores in the higher end not signaling much additional information about default. Then, suppose that h_{2} has figured out that on the lower end of the credit score distribution, differences in scores matter more to predicting mortgage default. Each neuron tries to become an expert about some pattern in the previous layer. The final prediction is a weighted average of what perspectives the experts in the second-to-last layer have determined.

**Pitfalls and considerations—there’s no silver bullet**

While the prospect of automatically capturing complex data is alluring, this can be a trap for those looking to deploy neural networks—or any complicated ML process. One notable advantage the GLM has over neural networks is their interpretability. The coefficients produced by the model are easy to understand: Some variables make the prediction go up, some make them go down. If a relationship isn’t linear, this is spelled out in the model and explainable. Neural networks with increasing complexity are not easy to examine in the same way. The weights and biases are visible, but with multiple layers it can be difficult to discern how the predictor variables reach the output.

When training a neural network, more valuable information will receive more weight. Activation functions are used to make sure that the information coming in is important. Adding more layers and neurons adds more information from the data to inform the model estimates (but we must be careful not to overfit!). In the GLM, each relationship must be specified in the model, and this is largely the work of the analyst. A neural network with multiple layers automates some of the decision-making, as the impact of unimportant variables can be reduced and complex relationships don’t need to be manually specified.

Does this mean that neural networks are entirely a “black box”? No. Responsible deployment of these models should demand a similar level of exploratory analysis on the predictor variables and the forecast as with the GLM. The analyst should understand what complex relationships may exist in the data, and then test whether these come through in the final prediction. There are several tools that can open neural networks up to identify what relationships may exist and what variables provide the most importance to the final prediction. No matter what type they are, trustworthy models are always backed by well-documented, credible analysis. Whether you’re considering using someone else’s model or building your own, the importance of spending the time to break down what the model has learned cannot be overstated. Much like a self-driving car, we need to keep our hands on the wheel. It is still essential to understand and hypothesize on underlying relationships in the data even if the process is more automatic.

By learning “hidden” information on our data, the forecasts made by neural networks can produce very powerful forecasts relative to other methods. When conducting a modeling task, it is usually best practice to test several different models. Other than neural networks and GLMs, several strong alternatives do exist. These include tree-based methods and support vector regressions. Comparing the results of different models can determine whether the improvement in predictive power is worth sacrificing interpretability. This is where the business constraints can play a role.

The complexity of the model should be determined by the business problem: What costs do we associate with producing inaccurate forecasts? How long will this model take to train, and how expensive will it be? How complicated is my data? Failure to consider *how* the model will be used and what the end-user looks like can result in overengineering the solution.

**Down the rabbit hole—the world of deep learning **

Similar to GLMs, neural networks have many variations used to solve different problems. As one would imagine, the more complex the challenge, the more complicated the model. Time series analysis, image recognition, and text generation all have neural network variants fit to the unique characteristics of those problems. Tools such as ChatGPT use a much more sophisticated version of the models discussed in this article.

The business constraints can also be an important factor in determining whether the neural network is an appropriate choice. Neural networks can be computationally intensive to estimate and deploy. Without the proper technical infrastructure, slow prediction and re-estimation times can be expensive. This expense may include the need for specialized hardware such as graphical processing units (GPUs), which present an explicit cost barrier.

The impressive performance of neural networks is undeniable—however, the need for model transparency, interpretability, and operational feasibility are important considerations when deciding whether a neural network is an appropriate choice. Errantly choosing the most novel and complex model without justification can sink an analytics project. Used effectively, however, these models can be powerful decision-making aids.

The nuances of neural network training provide a deep dive into mathematics. We have opted to avoid discussion on topics like backpropagation, gradient descent, and hyperparameter tuning. While deeper knowledge may be beneficial for model development and deployment, hopefully the conceptual comparison made inspires some interest in how neural networks may be used to produce more powerful predictions over existing techniques.

**RYAN HUFF is a data scientist at Milliman.**