An extremely painful, easily missed issue with machine learning products is that their performance will tend to degrade over time. Generally speaking, the best day of a new model’s life is its last day in development. Performance will likely take a hit the moment it hits production and slowly degrade from there. This is totally normally and simply something to prepare for as your data products become more and more highly developed.
The pain of lost opportunity can be subtle or dramatic. We often spend a lot of time developing data sources and inferential products. We struggled to get them to achieve strong performance in our lab tests. After spending all that time, it can be easy to hold high expectations to the model performance. Really, though, lab performance should be thought of as something closer to a soft upper-bound on live model performance.
In practice, model performance can be severely impacted almost immediately. It can slowly degrade over time in ways that are more subtle but leave just as big of a gap. Even a very advanced model can perform only randomly if the context it has been deployed in changes significantly. Finally, model degradation is difficult and expensive to measure in the lab. It’s possible you won’t even know how bad the degradation problem will be until the model is live. It’ll just show up later in the bottom line.
What causes degradation?
Why do models degrade? Essentially, most models are only able to capture patterns which reflect the training data they’ve seen. A good model captures the essential pieces of this data and ignores the non-essential. This produces strong test or generalization performance, but there are limits to the degree any model can prepare for this.
The truest test of generalization performance is to see how a model performs on real world data over a long period of time. There are at least two major artifacts of this process.
- You’re going to be overfit. Any experienced team is deeply aware of this problem and has built in protection against it. But, if you’ve done any amount of model development you are likely at least partially susceptible to it. You’ll pay for lingering overfitting very quickly after deploying the model to production.
- The world is always changing. It might be your own business context, or real changes in the state of the world. It might even be adversarial changes where feedback loops are formed with the predictions of your model. The data that would train and validate for future performance… simply doesn’t exist yet.
The moment you put your model into production it will likely suffer a loss in performance. Then, that loss will only grow over time as the world continues to change.
Handling model drift
You have to plan for the maintenance of your trained models. Engineering teams inexperienced with the way models work may not anticipate this need, but it’s critical. Without it you may only notice degradation losses in bottom-line business outcomes. A productionized model includes monitoring and maintenance.
Model performance on fresh data sets should be evaluated regularly. It’s standard practice to compare more sophisticated models simple or randomized baseline models. This whole suites of comparison models should be maintained and compared continuously using these fresh data sets. These performance traces should be visualized and compared regularly so that you can identify when it’s time to intervene.
As a model begins to degrade, the simplest method for resolving the performance issues is retraining. A productionized model should be easy to retrain and news training data sets should be regularly constructed for this purpose. Then, the model and all of its comparators should be retrained on these “fresh” data sets periodically. You can either regularly monitor performance and choose reactively when to retrain or use the trace to guess at a suitable regular schedule.
You should consider what kinds of training set sampling processes are needed capture the kinds of behavior important to the business. For example, if you always build retraining data from the last month of data then you might witness model instability. It will always be chasing the behavior of the prior month instead of normalizing to longer-term patterns.
Consider, for instance, a seasonal effect. If you don’t weight your retraining sample to potentially contain data from across the entire year, then your model will always be caught off-guard by the changing season. By including older data in fresh data sets you can offset these kinds of “aliasing” problems.
More generally, however, the problem of maintenance and retraining of models is a data science question in its own right. The reasons for model degradation can be discovered and modeled explicitly. Recurrent temporal effects can be studied, understood, and exploited. This can be a project for the data science team to tackle once a model has gathered sufficient performance metrics. Well, assuming you’ve been tracking them.
Reinforcement learning still requires care
One idea you might have is to use a reinforcement learning algorithm to automate this kind of retraining. On its surface, this is exactly the kind of automation which could reduce or prevent model drift. Reinforcement learners continuously evaluate their own performance and adapt. The most famous example of this is the Multi-Armed Bandit model.
In practice, reinforcement learning techniques can help to automate some amount of model degradation and drift. That said, they’re not a silver bullet. Tuning a reinforcement learner is more complex than tuning a more static machine learning model. In addition to the core model tuning necessary to achieve static performance, hyperparameters such as training rates need to be learned from performance time series data. In other words, you will still need to build and manage performance monitoring systems to properly prepare reinforcement systems.
Monitoring your models in production
All of this adds up to the story that you need to be monitoring model performance in production. Models should be tested consistently using fresh data sets. These data sets should be sampled in a domain-specific fashion that captures the full range of physical phenomena. Then, model performance needs to be compared to baseline models and business outcomes to gauge the damage of this degradation. Beyond that, a procedure for regularly considering these performance metrics and triggering retraining or rebuilding of models is also necessary as without it you’ll be able to see the loss in performance but have no system in place for resolving it.
A model put into production without having these kinds of support systems can still be successful and drive a lot of business value, but missing those system should be seen as a form of tech debt. Eventually, you will be impacted by the loss in performance and it’s better to be prepared as soon as you can afford to be than caught off-guard.