The last year has seen no shortage of unprecedented circumstances. All aspects of our lives, from work to travel to shopping, have changed. During this massive disruption, we have (unfortunately) learned why ML Ops – the practice of machine learning (ML) in production and the management of an ML lifecycle, should not be an afterthought but rather a critical element of getting value from AI.
So – what happened?
Figure 1 below shows a simplified example of an AI model in action. First trained by data – past examples of the environment, the model is then put into the real world to make predictions on new inputs – which are implicitly assumed to be sufficiently similar to what the training examples were. With COVID, many scenarios occurred that were unlike anything that occurred in the past.
For example, last year, I noticed that an online retailer’s website had started recommending baking goods to me regardless of what product I was viewing – even though I had never bought any such product from this retailer. A plausible reason is that the AIs powering the product recommendations has never seen the kind of rampant purchase of baking goods as had recently occurred, and was unable to make reasonable adjustments to recommend good related products given this abrupt sea-change in buying patterns. Is this acceptable or unacceptable?Depends…
Most AI will make predictions for any input data that comes in. Since ML is by definition non-deterministic, a wide range of answers is “acceptable”. However, ML is quite capable of providing very unacceptable answers. The question is when do we go from the edge of acceptable to entirely unacceptable? How do we detect this, and how do we fix it?
Where do MLOps fit in?
While COVID-19 may have brought such events to many companies at the same time, they are expected events in the life of a production ML service. MLOps is the practice of Machine Learning in production, covering, among other things, the behavior and diagnostics of production ML and its relationship to other stages of the ML lifecycle – such as training and origin data.
In initial breakdowns of ML Ops areas – my team and I at ParallelM called this particular area that the COVID failures highlighted – as ML Health – i.e. the notion of ensuring that production ML operates correctly in the face of real-world unexpected issues. ML Health includes monitoring, managing, and root-causing ML issues in production.
COVID-triggered behavior patterns are causing an ML Health issue called Drift. Many types of AI learn from examples. AI studies these examples to learn patterns that are codified as Models. The Models are then used to make new predictions for new data. While this approach is incredibly powerful – the core assumption is that past data contains patterns that are appropriate to use for new predictions. Drift occurs when this core assumption breaks down.
So – how can COVID-19 cause drift? For example, restaurants being closed has likely changed the grocery purchase patterns of many restaurants, resulting in capacity forecasting AI applications getting very different inputs now than what was historically the case for this time of year.
This type of problem does not just occur during worldwide pandemics. Simple mistakes can cause this problem too. For example, if your AI takes temperature as input and was trained on Fahrenheit, accidental entries of temperatures in Celsius will generate drift.
Drift can do anything from triggering hidden bugs in your prediction code to generating sub-optimal predictions. Unlike other types of software that will either fail or generate errors, Drift-caused AI prediction failures are silent, meaning that your AI will continue to make bad predictions, causing downstream applications to behave suboptimally or even generate business or legal risk.
But this will go away when COVID-19 goes away – right?
No. This kind of AI problem is endemic to how AI works. COVID-19 caused a massive business disruption and triggered many instances of Drift, but Drift can occur anytime that a business’ assumptions of the future do not match its history of the past. As we come out of the pandemic, we will be in a third uncharted territory, not like the last year but not exactly like the pre-pandemic world either.
Protecting your business from Drift related risk
For businesses that rely on AI for anything from product recommendations to supply chain or capacity planning, these kinds of Drift can have disastrous fiscal consequences. So what can businesses do?
- The first thing is to make sure that your AI team (from data scientists to ML engineers to ML Ops engineers) has an understanding of Drift types, how they can manifest, and the known methods for detecting Drift. Like many aspects of AI, technologies to detect and mitigate Drift are in their nascent stages.
- Once your team understands Drift, a successful drift mitigation strategy requires that your AI team determine how Drift can manifest in your use cases, and sets in place appropriate testing and response processes if/when it occurs. Good overviews of the Drift problem and related ML Health techniques to detect drift can be found here and here.
- Given the early nature of Drift detection technologies, ensure that your team stays up to date with the latest best practices for Drift Detection. For keeping up with the latest technologies for mitigating Drift, conferences focused on MLOps and production ML, like OpML 2020, are great venues.
- Make Drift Management as part of a Holistic ML Ops Strategy. As more and more AI goes into production, all organizations should have a well-defined ML Ops practice (similar to a DevOps practice) where clear roles and best practices are defined and can be applied to a range of algorithms and toolchains.