Serverless IoT at 200K messages per second, for 1.6M vehicles

Context

A large Nordic telematics provider ran a fleet-management platform for more than 1.6 million commercial vehicles. Each vehicle reports position, diagnostics, and driver behavior several times per minute. At peak, the ingestion pipeline handled about 200,000 messages per second.

The existing stack, a mix of dedicated EC2 instances running Kafka and bespoke Java consumers, was seven years old. It worked. It also woke people up twice a week. The business wanted to add three new vehicle platforms in the next fiscal year, which would roughly triple the throughput. The existing architecture could not get there without doubling the operations budget.

The real problem

The stated problem was throughput. The actual problem was that the ingest layer was organizationally coupled to the team that ran it. New product launches required new clusters. New clusters required new on-call rotations. New rotations competed for the same six engineers. The business was trying to scale a team, not a platform.

This is the pattern you see repeatedly in long-running data infrastructure. The technology is the symptom. The constraint is whoever has to babysit it.

Approach

We proposed a serverless ingest built on managed services: Kinesis Data Streams for buffering, Lambda for transform, DynamoDB and Timestream for state and time-series, EventBridge for downstream fanout. The bet was that we could shrink the team needed to operate the platform from six to two, and make the platform horizontally scalable per product line rather than per cluster.

The first customer on the new pipeline was the smallest vehicle segment, by design. We wanted the failure modes to surface on a population where the business could absorb the mistakes. The largest segment migrated ninth.

What worked

Cost held flat through 3x growth. The managed-service model priced linearly with traffic, not with headcount. Capacity planning stopped being a meeting.
On-call volume dropped by about 70%. The remaining pages were almost entirely downstream, in services owned by product teams.
The smallest segment, migrated first, produced the three most important failure signals. We caught two bugs that would have taken down the largest segment if migrated first.

What did not

We underestimated the cost of observability on a Lambda-heavy pipeline. The first six months of Datadog bills were double what we planned. Fixing that meant rewriting trace emission to sample aggressively in production, which we should have designed in from the start.
The original rollback plan was theatre. In the first real incident, we discovered that rolling back a partial migration would have required replaying 36 hours of Kinesis data against the old consumers. We built a real rollback path in month four, not month zero, and got lucky.

What I would do differently

I would cut the scope of the first milestone in half. The team shipped the first segment at month seven, which was a win on paper. In practice, the next six segments benefited from architectural decisions that had already hardened around the wrong tradeoffs. If I had forced a two-month first segment, even on a smaller slice, the platform shape for segments two through nine would have been better.

I would also have spent more time early with the operations team who owned the old stack. They had absorbed seven years of pattern knowledge about how this data behaved at peak. We mined that knowledge too late.

Outcome

The platform has been in production for three years. It has scaled through two more vehicle-platform additions without adding infrastructure headcount. The operations team that was consumed by the old stack now owns two new high-leverage internal products. The original six-engineer team shrank to two on the platform and four on the new products.

The deal underneath this work was roughly $12M in AWS platform revenue across the three years, on a Professional Services engagement of about $4M. The real return was the on-call burden the customer got back.