📘 Capacity Planning Is Harder Than It Looks

3 Apr 20223 min read

Why “we’re under 50% CPU” is a dangerous lie

Capacity planning is not about how fast your system is.
It’s about how it fails under load.

1️⃣ The Most Common (Wrong) Mental Model

❌ What teams think

“Our service can handle 10k RPS.”

✅ Reality

Capacity is work per second, not requests per second.

Two requests are never equal.

2️⃣ The First Big Mistake — RPS ≠ Load

Example

Request	CPU Cost
GET /profile	1 ms
GET /feed	40 ms
POST /checkout	120 ms

At 1,000 RPS:

All profiles → fine
All checkout → system dead

Traffic shape matters more than traffic volume.

3️⃣ The Real Unit of Capacity: CPU-Seconds

Key Rule

A system has N CPU-seconds per second.

Example:

8 cores
Each core ≈ 1 CPU-second / second

Total budget:

8 CPU-seconds / second

If one request costs:

50 ms CPU = 0.05 CPU-seconds

Max sustainable throughput:

8 / 0.05 = 160 RPS

No amount of async changes this.

4️⃣ Little’s Law (The Law Everyone Ignores)

Concurrency = Throughput × Latency

If:

Throughput = 200 RPS
Latency = 500 ms

Then:

Concurrency = 100 in-flight requests

Increase latency → concurrency explodes → memory explodes.

5️⃣ Why “Headroom” Doesn’t Save You

❌ Common rule

“Keep CPU under 70%”

Why this fails

GC pauses
Kernel scheduling
Cache misses
Lock contention

At ~70–80% CPU:

Context switching skyrockets
Tail latency spikes
Throughput drops

Systems don’t degrade linearly.

6️⃣ The Knee of the Curve (Critical Insight)

Every system has a knee point:

Before knee → stable
After knee → latency explodes
Slight load increase → collapse

7️⃣ Capacity Planning Mistake #2 — Ignoring Variance

Even if average load is safe:

P95 requests
Cache misses
Cold shards
Slow disks

…will dominate P99.

Capacity planning must consider worst-case paths, not averages.

8️⃣ Async, Queues, and the Capacity Illusion

❌ “We added async + a queue”

What actually happened:

Requests stopped blocking
In-flight count increased
Memory usage spiked
Tail latency worsened

9️⃣ The Only Capacity That Matters: Bottlenecks

Your system’s capacity is the minimum of:

CPU
Memory
DB connections
Disk IOPS
Network
External dependencies

You don’t scale a system.
You scale its tightest bottleneck.

1️⃣0️⃣ Practical Capacity Model (Simple & Useful)

For each critical dependency, compute:

Capacity = (Resource Budget) / (Cost per request)

Example:

DB connections = 100
Avg query time = 50 ms

Max DB RPS:

100 / 0.05 = 2000 RPS

Now apply:

Cache miss rate
Retries
Fanout

Real capacity is much lower.

1️⃣1️⃣ Why Load Tests Lie

Load tests often:

Use uniform traffic
Ignore cache warmup
Ignore retries
Ignore long tails

Result:

“It handled 5k RPS in staging!”

1️⃣2️⃣ What Good Capacity Planning Actually Looks Like

✅ Principles

Plan for P99 cost, not average
Assume partial failure
Model fanout
Enforce hard limits
Fail fast beyond capacity

1️⃣3️⃣ Capacity Without Limits Is Fiction

❌ No limit

app.get("/data", async (req, res) => {
  res.send(await work());
});

✅ Capacity-aware

if (inFlight > MAX) {
  return res.status(503).send("Over capacity");
}

Capacity only exists if it’s enforced.

1️⃣4️⃣ The Dark Truth

Capacity planning is not about preventing overload.
It’s about deciding how you fail.

Fail modes:

Slow failure ❌
Cascading failure ❌
Fast, bounded failure ✅