What Running Serverless at Scale Actually Teaches You

After building serverless platforms processing hundreds of thousands of messages per second, the real lessons are not about Lambda. They are about organizational design.

The interesting problems in a serverless system at scale do not announce themselves as serverless problems. They announce themselves as pager alerts at 3 AM, as data inconsistencies discovered by a customer three weeks after they occurred, as deployment pipelines that work perfectly in staging and fail silently in production.

When we were running a Nordic telematics platform at 200,000 messages per second across 1.6 million vehicles, the operational challenges had almost nothing to do with Lambda’s execution model. They had to do with observability, organizational design, and the specific ways that distributed systems fail when engineers assume their assumptions are universal.

Cold starts are not the problem you are waiting for

Every enterprise audience I have talked to about serverless asks about cold starts within the first five minutes. It is the question that came from reading, not from operating.

In the telematics deployment, cold starts never showed up in an incident post-mortem. Not once. The problems that showed up were: a downstream service changing its payload schema with a minor version bump that we had not contracted against; a dead-letter queue that filled silently because the alert threshold had been set too high during initial setup and never reviewed; a retry loop that masked a persistent error for eleven hours before the cumulative volume triggered an alert.

None of these are exotic. All of them are preventable. None of them are about cold starts.

The architecture mirrors the org chart, accurately

Conway’s Law is not a warning. It is a description. In a serverless system, where every function boundary is a deployment unit and every event topic is a team contract, the organizational structure becomes visible in the architecture in a way that is harder to obscure than in a monolith.

The telematics platform had three upstream teams contributing to the same event bus. Two of them had alignment meetings. One of them had a different manager who sat in a different part of the building. The function boundary between the two aligned teams was clean and well-documented. The boundary between those two and the third team was the one that caused every cross-team incident.

Reorganizing the event bus topology would not have fixed this. Reorganizing the team reporting structure might have. The architectural conversation was a proxy for an organizational conversation that nobody wanted to have.

Events require more discipline, not less

Asynchronous architectures are sometimes sold as simpler than synchronous ones. They are not simpler. They are different, and the failure modes are harder to see until you have seen them.

In a synchronous system, when a request fails, the failure is usually visible immediately: an error response, a timeout, a circuit breaker opening. The feedback loop is tight.

In an event-driven system, messages can be silently dropped, silently retried, or silently delivered out of order. Data inconsistencies can accumulate over hours before any downstream effect becomes visible. A consumer that processes events non-idempotently can corrupt state across hundreds of records before anyone notices.

The discipline required is specific: dead letter queues with alerting, not just monitoring. Retry policies with jitter and maximum retry counts that are set deliberately, not copied from a tutorial. Idempotency enforced at the message handler level, not assumed. Schema contracts versioned and linted in CI, not documented in a Confluence page nobody reads.

These are not optional enhancements. They are prerequisites for running events reliably in production.

The real cost is cognitive, and it compounds

At the scale we were running, the Lambda compute cost was genuinely lower than any comparable containerized setup. The total cost of ownership was not.

Each function is a deployment unit, a monitoring surface, and a conceptual boundary. At 200 functions deployed across three environments, the cognitive overhead of understanding the system’s current state becomes non-trivial. New engineers joining the team spent weeks before they could confidently trace a request path from ingestion to storage.

The mitigation is not architectural. It is cultural: strong conventions enforced through tooling, shared libraries for the patterns that repeat (telemetry, error handling, schema validation), and ruthless resistance to adding new functions when an existing function can be extended.

Freedom without structure produces a system that only its original authors can understand. Structure without freedom produces a system that cannot adapt. The balance is negotiated differently for every team and every workload, which is why “start with serverless” is the wrong advice. Start with understanding what you are optimizing for.

What I tell teams at the beginning

Get your observability stack right before you have your first incident, not after. Structured logging, distributed tracing, and alerting on dead letter queue depth are not nice-to-haves. They are the instruments you need to fly in the dark.

Define your event schemas before writing business logic. A schema registry is not premature optimization at the scale you are planning for. It is the thing that prevents your third team boundary from becoming your biggest operational risk.

And accept that serverless does not eliminate complexity. It redistributes it from the infrastructure layer to the application layer, where it is harder to see but no less consequential.