Predictive Maintenance at Fleet Scale with Sparse Failure Data

The customer had a simple idea: if their platform could tell operators which machines were likely to fail before they failed, the customers would pay a premium for that signal. The idea was right. The path to production was not the one anyone planned.

The problem with the obvious approach

Predictive maintenance models learn from failure events. To predict a bearing failure, you need historical data showing what the sensor readings looked like in the hours and days before the bearing actually failed.

The customer’s fleet was 6,000+ industrial companies, across manufacturing, food processing, and logistics. Their platform had been collecting sensor telemetry for four years. On paper, that was a large dataset. In practice, most of the machines in most of those facilities had never failed. The ones that had failed often had gaps in the telemetry that preceded the failure, because the failure itself was frequently the event that caused the monitoring system to stop recording.

So: four years of data, almost all of it negative examples. Failure rates of between 0.3% and 1.8% depending on asset class. Labels that were unreliable in exactly the cases that mattered most.

The standard supervised learning approach would have produced a model that performed excellently in testing and badly in production. We had seen this at two other customers in adjacent verticals. The customer had not.

The real problem

The failure data problem was real, but it was not the root problem. The root problem was organizational.

The engineering team had been building toward a supervised ML pipeline for eighteen months. They had data engineers, ML engineers, and a platform team. They had a roadmap, budget, and executive sponsorship. What they did not have was a mechanism to change direction when the data reality became clear.

By the time we were engaged, three separate engineers had independently concluded that the labelled data was insufficient for the planned approach. None of them had felt empowered to stop the roadmap. Each had filed the concern in a ticket, received an acknowledgment, and continued building.

When the architecture review surfaced this, the engineering lead’s first response was not “we should change approach.” It was “can we get more labelled data?” That question had already been asked and answered: the sales team had surveyed thirty of the largest customers. Only four had records of historical failures sufficient to be useful, and those four were not willing to share them.

The intervention was not technical. It was a single meeting with the CPO where the evidence was laid out plainly: the approach will not produce a model that works, and the people closest to the data have known this for several months. The question is not whether to change approach, but how quickly and with what acknowledgment of the sunk cost.

That conversation took forty-five minutes. The eighteen months that preceded it had been spent building the wrong thing.

What worked instead

The pivot was to a hybrid approach combining anomaly detection with a lightweight supervised layer trained on synthetic failure data.

Anomaly detection does not require labelled failures. It learns the normal operating envelope for a given machine type, then flags deviations from that envelope. This is a cruder signal than “bearing failure in 72 hours,” but it is a signal that works on data that actually exists.

The supervised layer was trained on a combination of real failure events from the four cooperative large customers, synthetic failure signatures generated by domain experts who knew what bearing failures look like in telemetry, and transfer learning from publicly available industrial datasets that had better label quality.

The synthetic data generation required the three most experienced engineers on the team to spend four weeks with two external domain consultants who had twenty years of industrial maintenance experience between them. This was the most valuable four weeks in the programme. The consultants knew things that were not in any dataset: failure modes that are slow enough that sensor data looks normal until the last 48 hours, specific failure signatures that vary significantly by machine age and duty cycle, the difference between a sensor reading that indicates a problem and a sensor reading that indicates the sensor is dirty.

That knowledge, encoded into synthetic training data, made the model work.

What did not work

Three things failed clearly enough to be worth naming.

The deployment timeline was wrong by a factor of three. The first production deployment was scoped for month six. It happened in month nine. Not because of technical delays, but because the customer’s IT security team had not been included in the architecture review process. When deployment approached, they raised concerns about the data pipeline touching systems that were classified as operational technology. The OT/IT boundary in industrial settings is a real constraint, not bureaucracy. We had treated it as bureaucracy.

The CSAT target was right; the measurement was wrong. The 10/10 customer satisfaction score came from a survey of the internal stakeholders who had been involved in the programme. The end users, the maintenance technicians who would actually use the alerts, were not surveyed until month ten. Their feedback was 6/10 and highly specific: the alert interface required too many clicks to acknowledge a false positive, and false positives in the first three months were running at about 35% for one asset class. That asset class turned out to have a known quirk in its temperature sensor behavior that the domain consultants had not mentioned and the training data had not captured. The fix took three weeks. The relationship took longer to repair.

The team size peaked too early. 25 people on a delivery team is a large team for an 11-month programme. By month seven, roughly eight of them had no clear work to do. The roadmap had been front-loaded with discovery and engineering and had not adequately planned the delivery and refinement phases. Those eight people were not idle; they found work to do. Some of it was valuable. Some of it was gold-plating that created maintenance burden.

The lesson about team size is simple and routinely ignored: it is much easier to add people to a program than to remove them. The overhead of managing a team of 25 consumed more of the engineering leads time than any technical problem in the program.

Outcome

The platform shipped. In production, across approximately 800 customer sites in the initial rollout, the model catches 73% of significant failure events with sufficient lead time for the customer to take preventative action. False positive rate is 12% overall and declining as the model accumulates real-world feedback.

The customer extended the engagement by six months for a second asset class. The second engagement started with a two-day domain expert workshop, a deliberate OT/IT boundary review, and a deployment timeline that the security team co-authored.

The 10/10 CSAT number on my engagements page is from the internal programme review at the end of month eleven. I include it because it reflects something real about how the work was received by the people running the programme. It does not reflect the end-user experience in the first three months. Both are true.