Author: Thomas Gaddy
The last few years have seen an explosion in interest around predictive maintenance. But what is predictive maintenance, and is this interest actually justified? In the first part of this blog series we will explore what predictive maintenance is, why there’s so much attention being paid to the area, and how machine learning and AI is playing a crucial role. This first post is meant to give a birds-eye view of the field. Stay tuned for more posts where we make these ideas more concrete with a tutorial in Python and Google Cloud using real data.
We are seeing an overall trend towards automation and data exchange in manufacturing technologies. Increasing integration of digital and physical systems allows for the collection of huge amounts of data. On its own, though, this data isn’t necessarily that useful. Manufacturers are hoping they can use this data to become more efficient, responsive to market demands, and sustainable. In particular, many manufacturers are looking for ways to minimize operating costs, and equipment and machine maintenance can be a significant cost.
Simply put, maintenance is all the things we do to keep our machines and equipment running as they’re supposed to. When we do maintenance there are really two major things we are trying to avoid. The first is the machine wearing down and running inefficiently or unexpectedly breaking. This is probably the more obvious—and often more costly—challenge; an unexpected break could be a severe hazard for workers and the environment, broken machines need to be replaced or fixed, and an inefficient or broken piece of equipment could mean lost revenue. But maintenance itself can be costly and dangerous, and so the second thing we’d like to avoid is doing it unnecessarily and not exploiting the full useful lifetime of our machines.
A common maintenance strategy is run to failure (R2F), sometimes called corrective maintenance. This is basically the simplest possible strategy: if a machine breaks, we fix it. This sort of purely reactive approach can lead to unpredictable outcomes, and by extension to a lot of unexpected machine breaks. Alternatively, we have preventative maintenance where repairs are conducted on a regular, time-oriented schedule. We might be able to reduce the number of unexpected breaks with this approach, but we are being naively proactive. Some of the maintenance we do might be completely unnecessary, and we still might have some unexpected breaks because we aren’t actually considering the state of the machine.
Enter predictive maintenance, a strategy to perform maintenance based on the estimated health of the piece of equipment. Predictive maintenance, also called condition-based maintenance, can allow us to reduce the uncertainty of maintenance activities by being intelligently proactive and performing maintenance at the right time. What makes this possible is the collection of large amounts of data and the ability to process and analyze this data. We use this data—often from embedded sensors or machine logs—to find evidence of machine degradation. Once we’ve made a diagnosis of the machine status based on this collected data, we can use this knowledge to execute appropriate maintenance actions. This leads us to the following basic workflow: gather data, estimate health of machine, and execute appropriate actions.
Notice I haven’t mentioned anything about machine learning, so where does this come in? Machine learning comes into the picture at the second stage of the workflow, using collected data to predict the status of the machine. There are several reasons why we may or may not choose to use machine learning here, and we will get back to these reasons in a short bit.
Clearly, predictive maintenance would seem to be the ideal maintenance strategy. When done correctly we could improve the equipment condition, reduce equipment failure rates, minimize maintenance costs, and maximize the life of the equipment. There’s also less obvious benefits as well. Depending on our approach, we might be able to associate our estimated health status of the machine with a particular component, helping with diagnostic procedures to identify the source of our problems. Overall, predictive maintenance is already having a profound economic impact and is predicted to continue delivering value (see here for more detail).
If predictive maintenance is so great, why isn’t everyone doing it? Well, for one, there needs to be a relevant and feasible use case. Predictive maintenance is also much harder than either of the two other simpler maintenance strategies we mentioned earlier. Many of these difficulties can be traced back to the data we have to work with. A lot of times we are working in the regime of “big data”. Big data is usually characterized by various “v’s”, including volume, variety, velocity, veracity, and value. We will briefly touch on each of these and how they might be related to the data we are likely to encounter in a predictive maintenance use case:
We are usually dealing with a lot of data. Each piece of equipment might have dozens of sensors, each of which is constantly monitoring the machine over its lifetime. A single Boeing jet, for instance, generates as much as 10 terabytes of data during 30 minutes of flight time. Effectively processing and analyzing the sheer amount of data can quickly become a daunting task.
Sensors might be recording data in different formats or at different time scales. This is just for one piece of equipment. We may be dealing with different models or different types of machine in a larger plant.
Data may be generated quickly, leading to problems of how we extract signals or aggregate these readings over time. We may also have to analyze this data in real time.
Data can often be of dubious quality, and some data may be missing. Sensor data is often noisy, and sensors themselves can easily become defective.
There may (or may not!) be useful information in the data that can be extracted to help inform decision making. This is what we want to uncover!
Beyond these big data challenges, there might be some additional challenges. We may be dealing with data of high dimensionality with complex correlations between variables. Reasoning in these high dimensional spaces is difficult and uncovering these correlations can be next to impossible for a human. This is where machine learning can help.
As in any successful data science project, we need to define appropriate baselines and metrics. This helps us take all our great work from an academic exercise to something with real-world impact. Assuming we have all our baselines and metrics in place, what machine learning approach do we take? Like most problems in machine learning, our framing will depend on the available data (in addition to our business strategy we just identified), which in turn might depend on our historical maintenance strategy! One initial categorization that is probably familiar to you is unsupervised vs supervised.
An unsupervised approach is necessary if we only have process information and no maintenance-related data. An example use case could be anomaly detection in which potential anomalies are identified and flagged for further investigation. Of course, like any unsupervised machine learning problem, model evaluation becomes a challenging task. In this case, we also have the added complexity of mapping identified anomalies to specific maintenance actions.
A supervised approach is possible if we have labelled occurrences of failure in the dataset, which would be true if we had used a run to failure maintenance strategy in the past. Given we are doing a supervised approach, we may decide to do classification or regression. Whether we do classification or regression again depends on our data and on how we want the outputs of our machine learning model to inform our maintenance actions (are you noticing a pattern here?).
In a classification set up we are trying to predict a fail vs. no fail label (we could also have a multiclass-setup where we e.g. try to classify between various failure modes). This is a natural framing. Standard classification metrics can map nicely onto the actual problem (if a label of “1” indicates failure then recall is the percentage of failures we catch). There are a couple of problems with this approach. First and foremost, we are probably dealing with highly imbalanced classes (hopefully we don’t have too many failures in our data or we may have a bigger problem!). The second problem is that we would probably like to anticipate failure in order to allow time for maintenance actions, rather than just classify machines as having failed or not. We will look at how to address these problems in the next part. An example of the regression set up is to predict the remaining useful life of the equipment This type of approach side-steps some of the issues with the classification approach at the cost of potentially making evaluation more difficult since predictions don’t map onto the actual problem as neatly.
But which algorithm do we use? Unfortunately, there are few well-established best practices in predictive maintenance and there is no clear “winner” when it comes to machine learning algorithms in this area (there’s no free lunch!). On the bright side, that gives us plenty of room to experiment and test out different approaches we may be familiar with. This includes everything from linear models to SVMs to hidden markov models to data mining techniques to the latest deep learning models (to see the wide variety of machine models used and the equipment they were used for, check table 3 in this paper).
One problem that are likely to encounter no matter our framing is how to deal with time series in the data. There are some questions to ask ourselves here:
There’s no right answer to any of these questions. The best advice is to put in the effort to familiarize yourself with your data and your business requirements. There’s a lot more to consider as well. One key point is interpretability. If models are being used to guide human decision making, it may be necessary for our model to be interpretable in order to gain trust and encourage widespread adoption (this may be a deciding factor in favor of extracting your own time series features and using a simple model as opposed to feeding raw data into a deep learning model). Another is impact. We already mentioned this briefly above but it’s worth repeating: does predictive maintenance provide measurable business impact compared to existing maintenance strategies? Even if we decide we need predictive maintenance, do we need machine learning or will heuristics based on domain knowledge suffice?
For those of you looking for a silver bullet, I’m sorry to disappoint. Not every problem in maintenance will benefit from machine learning, and even if we find an appropriate use case, it may still take a lot of exploration and experimentation before a suitable solution can be found. All that being said, I don’t want to leave you with the impression that machine learning based predictive maintenance is impossible! This post was a little dry and abstract, but in the next blog post we will get our hands dirty with some exploratory data analysis and preliminary modelling on a specific use case—predicting hard drive failure. I won’t spoil too much here, but we’ll see what happens when a simple machine learning model is put to the test versus a manually-defined heuristic and a run to failure maintenance strategy. In part 3 we will see if our model can be improved even further.
Do you have questions or projects you’d like our help with? Get in touch – we’d love to hear from you.