Choosing a loss function is an important step in setting up a well-designed machine learning task. It’s a choice that requires domain and business context. It also often requires some amount of technical experience. Finally, it’s something you probably don’t want to change too often or at all.
So a up-front, somewhat irreversible decision that requires expertise and weigh-in from multiple disciplines. Super fun, right? Let’s talk about what goes into this kind of decision, what loss functions entail, and then how you can pick the best one. I’ll also have a list of common loss functions toward the bottom.
The pain of failure
Loss functions are mathematical measurements of the pain of failure. Any decision process, including automated ones such as machine learning algorithms, may suffer some kind of punishment when it makes the wrong decision. At the very least, it’s an opportunity cost: the more optimal decision would have made more profit.
Generally speaking, loss functions are critical for machine learning systems because—at the highest level—all an ML system does is search over a huge space of possible configurations for whichever one minimizes loss. If you pick a loss function that’s poor for the situation at hand, then the supposedly-optimal model may still make errors both frequently and severely.
As a tangible example, you might work in a space where you can expect, fairly regularly, substantial outliers. You understand from the business perspective that these are just lost causes and would prefer to consider the pain of failure on the non-outlier cases for the most part. A poorly tuned loss function can cause the optimal model to “chase after” those lost causes and suffer regularly worse performance on the standard cases. Choosing the proper loss function is one opportunity for you to encode that little gem of business intelligence.
In very realistic scenarios, this loss can be actual money lost or some measurement of risk-adjusted injury. For instance, in choosing whether or not to diagnose a patient with a disease, given that they have a borderline lab result, one has to balance the risks
- If you choose not to diagnose, then you put yourself at risk that the patient truly does have the disease. You may need to design a follow-up plan which gives more “shots on goal” in the future. In either case, the patient suffers the risk of injury from not beginning treatment immediately in the case they had the disease.
- On the other hand, if you choose to diagnose and begin treatment you suffer the risk that the patient actually did not have the condition. This could in simple cases be lost time and money from the treatment. In more severe cases, the treatment itself induces real risks to the patient’s health that they would have otherwise avoided. You don’t want to be in the middle of a surgery only to discover it was unnecessary.
These kinds of scenarios chart the true nature of loss and suggest why business context is key. We want to build out a plan which discovers and considers the whole space of possible actions. We want to consider things like follow up and supplemental testing and decision making.
Or sometimes the decision is whether or not to buy ad X at cost C this very minute. Choose yes and risk the price C, choose no and risk the marginal chance of that ad converting. Much more cut and dry.
Most of the time, loss functions aren’t actually this complex. The situations described above can be very important to model and consider, but they are also much more difficult. Most of the time, data scientists will begin with much more austere and mathematical loss functions (as discussed toward the end of this article).
Human-in-the-loop losses and interpretability
Many statistical systems are decision support systems. This occurs when a model influences a decision but doesn’t actually, automatically, make that decision. There’s a “human in the loop”, someone who witnesses the outputs of the model, augments them with other information and their own experience and intution, and then presses the button.
The doctor/patient example above is, of course, such a system.
The loss of the entire system becomes vastly more difficult to measure when there’s a human in the loop. It’s a question of both statistics and behavior. How does every tool, every process, every interview question affect the decision making of individuals, teams, the whole org, oh wow!
In fact, the doctor/patient model above is one of the rare cases where human-in-the-loop systems are actually regularly evaluated. The consequences of error, especially systematic error across practices, hospitals, or the whole industry, are giant. There are multiple academic fields evaluating these decision systems in the large and they often spend a lot of time considering how doctors interpret and respond to data and inference.
In the small, though, human-in-the-loop systems can make the model evaluation much easier. Because there’s an expert evaluator who brings domain knowledge and experience into play we prefer models to have simple, interpretable notions of accuracy and loss. These align much better with the easier, simpler loss functions we’ll discuss below.
If you find yourself as the expert in a human-in-the-loop system (or maybe training such experts) then it’s important to understand the tradeoffs of the loss function being employed.
Average loss and adversarial loss
As I’ve discussed it so far, loss is a concept defined against a single prediction of a model. We’d like to think about how much injury we suffer by making individualized bad decisions. This is only one part of what must be considered when designing loss overall, though. In practice, we’re designing a policy and want to understand how that decision making policy suffers loss over repeated decisions.
So, generally speaking we want to measure loss by taking a whole set of simulated decisions and somehow combining all of the individual losses across that set.
There are two major ways of doing this. The first, and most common, is “average loss” where we (a) assume that our testing scenarios are a reasonable sample of things that would really happen and then (b) average the losses suffered in each individual case. This gives us one final number which represents, more or less, the loss we could expect to see from this policy in real situations.
That said, any given case may perform quite a bit better or worse than the average. To this end, we might also want to look at other distributional statistics of the individual losses. You might choose to measure median loss instead of average loss to account for outliers. You might want to consider the average loss and the “variance of loss” to suggest a measure of spread.
Loss is a strictly positive quantity and loss distributions of well-performing models should be near zero. This means that mean and variance of loss actually may not be terribly good summaries for this distribution, despite their popularity. It may not be easy to capture a sense of “tail loss” or “outlier loss”. For this, we consider the second method.
Adversarial loss (or minmax loss) is when you deliberately design the test scenarios for your models to include an unfair mix of worst-case scenarios. The name “minmax” comes from the idea of seeking the “minimal possible loss against an adversary trying to induce maximal loss”.
Many standard models perform surprisingly poorly in minmax settings. If you recognize that you need to be prepared for adversarial settings, you need to also be prepared for investigating less common models and anticipating that they will be slower to converge, more conservative. Generally, an adversarial world is one that is less regular, less exploitable.
Some example loss functions
There are as many loss functions as one can imagine, but in standard settings there are very standard loss functions which show up over and over. These usually are popular do to their mathematical convenience and elegance.
I’m going to classify them into two different varieties: regression losses and classification losses. These correspond to different problem forms.
- Classification losses are the simplest and relate to models which try to segment observations into two or more classes.
- Regression losses occur with models that make predictions for which there’s a continuous notion of how bad those outputs are.
There are other sorts of losses than these, though most ML settings are either classification or regression. There are also variations and extensions on normal losses. For instance, sometimes loss functions—which normally deal only with model predictions and the underlying true values—also consider the complexity of the model. These kinds of losses tend to penalize more complex models in a variety of ways. This process is known as regularization.
Another popular generalization of the basic loss functions is to mix them together. Given any two loss functions f and g and any two positive numbers p and q, you can create a new loss function m(x) = p f(x) + q g(x). This loss function will behave a little like f and a little like g, the mixture of the two.
Finally, it’s worth spending a moment thinking of truly exotic loss functions which occur in exotic machine learning settings.
One example might be an ML model which is attempting to predict an output distribution over potential states. If the number of states is finite and the true state is known, this is just a classification problem. Otherwise, it’s a distribution learning problem and we need to design new loss functions for that situation (though, cross-entropy, defined below, will still work).
Another example might be structured prediction algorithms, such as a model which will emit a whole molecule design in a chemistry setting. This kind of output isn’t clearly a class nor is there an obvious notion of distance or residual which can be used to treat it as a regression problem. In this setting, you may need to design a number of ad hoc residuals, choose a technique like random projections to do that for you, or devise an entirely new form of loss appropriate specifically to this setting.
Regression losses apply when a model outputs some continuous prediction. For example, a model which predicts the temperature tomorrow. We can consider ideas of “how far off” the model was. For this example, it’d be the number of degrees for which the prediction was wrong.
Most of these can be generalized to more exotic situations as long as some idea of “distance” between the predicted answer and the true answer can be devised.
Mean Squared Error (L2 Loss, Quadratic Loss)
This is perhaps the very most standard form of loss. It is incredibly popular due to its mathematical convenience. Many models built atop it have straightforward, easy-to-compute forms. So easy that they used to be commonly computed by hand.
The key idea of this loss function is larger errors get penalized more severely than smaller errors. Specifically, if one prediction is twice as far away from the truth as another it will be penalized four times as much.
This loss is associated with the idea of a mean, expectation, and with the Gaussian/Normal distribution. It’s the basis of the incredibly standard linear model. I’d take a bet and say that nearly all regression analyses either explicitly or implicitly use this loss.
On the other hand, this loss does very poorly in situations where there are regular outliers. It also is very inappropriate in situations where pain is decaying with error. For instance, in the temperature example, I might be a little annoyed if the prediction is 8 degrees (F) off, but the difference between being 20 degrees and 30 degrees is less material—in both cases, I’m dressed very inappropriately.
Mean Absolute Loss (L1 Loss)
Absolute loss is a refinement of L2 loss where larger errors get penalized in proportion to smaller errors. Unlike above, a prediction that’s twice as bad as another will only be penalized twice as much. Also, unlike above the models built on this loss function can be harder to compute. That said, in the era of computers this is not such a huge issue.
Despite that, L1 loss is vastly less popular than L2 loss.
L1 loss is often called robust in that it is less sensitive to outliers, weighing them only linearly as opposed to quadratically. This is usually the reason why L1 is chosen as opposed to L2, the anticipation and desire to be insensitive to regular outliers.
L1 loss is associated with the ideas of median, percentile, and with the Laplacian distribution. Many standard ideas associated with L2 loss can be recast in their robust form by switching to L1 loss (e.g., robust linear models).
Generally speaking, models based on L1 loss will be more conservative than those based on L2 loss. In cases where there is little data, an L1 model will produce wider error bars and less informative predictions. This can be a downside of “robust” statistics generally.
L1 models can be a little challenging when used predictively. Because they “absorb” outliers more freely, their predictions will also sometimes include these outlier values. This can be very desirable, but also sometimes unexpected. This tendency is sometimes called the property of having “fat tails” or modeling “black swans”.
Other Lp Losses
So far I’ve been using a convention of naming losses as Lp for p as 2 and 1. In general, we can make a sensible loss function for Lp where p is any value greater than 0. And, actually, there’s a notion of L0 loss that gets used sometimes as well.
Discussion of these norms could easily get into exotic territory, but there are 3 cases worth thinking about even if they only very rarely get used.
The L0 loss is the 1-0 loss. This is only meaningful in regression situations where it’s sometimes possible to get the answer exactly right. In these cases, 1-0 loss is the loss where you add up all of the misses and count them all as the same. You can also easily soften this loss and make it appropriate in more situations by considering a prediction a miss if it’s outside of some neighborhood of the truth.
For instance, L0 loss with a margin might be sensible for the temperature detector. If you get within plus or minus 5 degrees of the true temperature, I’ll be fine. If you’re outside of that, I will suffer.
The L-infinity loss is the maximal loss. It’s what happens when you consider only the worst prediction that gets made. In other words, if your model makes 4 guesses and then gets penalized in proportion to the least accurate of them then you’re using L-infinity.
Finally, you might consider something like the L-1/2 loss. For values of p between 0 and 1 you get behavior almost opposite of the L2 loss. In L2 loss, the further away a prediction is from truth the more and more it is punished (quadratically). In L-1/2 loss, the penalty of loss ramps up very quickly and then eventually flattens out. You can see it as a smoothed version of L0 loss.
A final kind of loss that occurs from time to time is a non-standard mixture of L1 and L2 loss called Huber Loss. In this case, a threshold distance is chosen and the loss is calculated as L2 beneath that threshold and L1 above. Huber loss is an attempt to get the best of both worlds from L1 and L2, the robustness of L1 to severe (above threshold) outliers with the sensitivity and confidence of an L2 model.
In a classification problem, our models may output their best guess as to which class, a number between 0 and K, each observation belongs to. Alternatively, they may also output instead a set of scores for each observation. In this case, there will be one score for each class and we generally interpret them to say that the higher the score is for a given class the better off you’d be guessing that class.
Certain kinds of score-producing models additionally offer interpretable scores. More commonly, these scores will have the property that the sum to one and represent probabilities of the observation belonging to that class.
Consider a model that’s predicting the sentiment of a tweet. We consider three sentiments, happy, sad, and neutral. One model reads a tweet and just outputs one of those three words. Another model outputs the score set [353, 2, 298] suggesting with some strength that the tweet is happy, but maybe also representing evidence that it could just be neutral. A final model outputs [0.53, 0.02, 0.45] suggesting the same idea, but justifying words like “this tweet is probably happy or neutral with a slight preference for happy”. The former score set may or may not be justifiably interpreted that way.
Generally speaking, probability-producing classifiers can also be consider calibrated or otherwise. A well-calibrated classifier produces probabilities which can be genuinely interpreted as either sampling or Bayesian posterior probabilities. Uncalibrated or poorly-calibrated classifiers might still produce probabilities, but these may be known to be, say, overconfident.
1-0 Loss (Indicator Loss)
The principal loss for any kind of classifier is to (1) pick the best class that it predicts and (2) call each successful prediction as “no loss” and each bad prediction as “full loss”. In other words, we count up how often the classifier is dead on.
This is the most straightforward and obvious loss function for a classifier. That said, it has the disadvantage that it throws away the rich scoring information that some classifiers might produce. It’s also usually not possible to design a classifier to directly optimize this metric—it has essentially no nice mathematical properties. That said, it’s simple to compute for any given classifier and thus is still a good tool for model evaluation.
It’s also often a good proxy for real world loss. If you have to choose between three options and are only graded on whether you’re correct, then you’ve got 1-0 loss.
The next simplest classification loss occurs when different bad choices have different penalties. This is a very common model of real world loss. For instance, the doctor/patient diagnosis scenario could be modeled in this way if we can measure the “amount of pain” each wrong decision would entail.
To get more concrete, we could build a very simple model
- If we diagnose as sick and are wrong, the cost is (cost of treatment) + (probability of complication)(cost of managing complication)
- If we diagnose as well and are wrong, the cost is (cost of consequence of disease)
This kind of thing may not be very accurate, but can be a huge improvement over 1-0 loss. In 1-0 loss we assume that the pain of each kind of wrong decision is equal. Here we can “bias” our loss in a given direction due to our preference for a more conservative or more liberal strategy.
Cross Entropy Loss
Cross entropy loss is one of the most popular losses for classifiers which emit probabilistic scores. It is somewhat limited in that it tends to assume that these scores represent at least somewhat calibrated probabilities.
Cross entropy models, in some sense, the amount of “surprise” you feel at discovering the true class of an observation given that you believe the scores produced by the model. That’s a complex statement to unpack. The answer is tied up in the field of information theory, but let me give an abridged version.
“Surprise”, as a mathematical quantity, is an attempt to measure the degree to which you were confidently wrong. If I make you make a bet on whether I’m going to roll a 1 on a 6-sided die you would be smart to weight your bet by the anticipation that there’s a 1-in-6 chance of that occurring. Most of the time, when you see me roll 2, 3, 4, 5, and 6, you aren’t terrifically “surprised”. That’s the expected result. In the rarer events when a 6 actually comes up, you are more “surprised”.
I’ll stop with the scare quotes now, and note that this idea of surprise relates to the likelihood you feel of a given outcome occurring. In the event of a classification algorithm giving you a (calibrated, probabilistic) score vector like [0.2, 0.5, 0.3] you would be less surprised if the middle class were true and more surprised if the first class were true.
So cross-entropy minimizes surprise. We want to create a classifier where the output scores represent the best possible mental state we could have in preparation for learning the true answer. This drives its popularity—that’s a pretty human-interpretable situation.
As a final note, if you’re following along so far and go look at the equation for cross-entropy loss you’ll see that it involves a lot of logarithms. The idea of using the negative logarithm of a probability as a measure of “surprise” is core to information theory. To my eye, the key principle that makes this work is that the logarithm transformation makes repeated observations additive as opposed to multiplicative.
Hinge loss is a special form of loss that’s most commonly used in binary classification with (not necessarily probabilistic) scores. In this case, you can see the model as emitting a single, positive or negative number. The sign up the number suggests the class being chosen and its magnitude the “degree of confidence”.
Hinge loss penalizes classifiers for being wrong and for being insufficiently confident when right. This can be a little complex, so we can consider some cases for an observation where the true class is +1.
- If the model predicts -1, the hinge loss is 2. This demonstrates that incorrect predictions are given positive losses.
- If the model predicts -2, the hinge loss is 3. This demonstrates that it penalizes being wrong confidently in a linear way.
- If the model predicts 0, the hinge loss is 1. This demonstrates that non-decisiveness is also penalized.
- If the model predicts +0.5, weak correctness, the hinge loss is 0.5. This demonstrates that weak confidence is still partially penalized.
- Finally, if the model predicts anything +1 or above then the hinge loss is 0. This demonstrates that being very confidently correct doesn’t actually reduce the loss.
Hinge loss is famously the form of loss used in the SVM algorithm. In this setting, it’s known as a “maximal margin loss”. This is due to its effect of penalizing insufficiently confident correct answers. It punishes uncertainty and maximizes the “margin”, the size of that zone of uncertainty.
There’s also a small variation on hinge loss called logistic loss. Logistic loss is reasonably similar to hinge loss, but where hinge loss will eventually treat a sufficiently confident correct answer as “loss free”, logistic loss will always slightly prefer a even more confident correct answer.
The major advantage of logistic loss over hinge loss is that it is somewhat simpler to work with mathematically.