📘 Clock Skew & Time-Based Bugs
When “now” is different everywhere
In distributed systems, time is a guess.
And guesses break correctness.
1️⃣ The False Assumption (Root of All Time Bugs)
“Time moves forward at the same rate everywhere.”
This is never true in distributed systems.
Reasons:
Clock drift (hardware)
NTP adjustments
VM pauses
GC pauses
Leap seconds
Network delays
Two machines disagreeing by milliseconds is normal.
By seconds is common.
By minutes happens in production.
2️⃣ What Is Clock Skew?
Clock Skew = difference between clocks on different machines.
Machine A: 10:00:01
Machine B: 09:59:58
Both are “correct” locally.
3️⃣ Why Time-Based Logic Is Dangerous
Time is often used to:
Order events
Enforce uniqueness
Expire data
Resolve conflicts
Detect staleness
Every one of these can break under skew.
4️⃣ Bug #1 — Time-Based Ordering Is Wrong
❌ Naive Event Ordering
events.sort((a, b) => a.timestamp - b.timestamp);
What you assume
- Earlier timestamp → happened first
Reality
- Event B may have happened after A but recorded earlier
Sorting by timestamp lies about causality.
5️⃣ Bug #2 — “Latest Write Wins” Loses Writes
❌ LWW Conflict Resolution
if (incoming.timestamp > stored.timestamp) {
overwrite();
}
Failure Scenario
Node with fast clock overwrites correct data
Node with slow clock loses updates forever
Correctness depends on clock accuracy, which you don’t control.
6️⃣ Bug #3 — Time-Based IDs Are Not Unique
❌ Timestamp-based IDs
const id = Date.now();
Under concurrency:
Same millisecond
Same ID
Collision
Worse across machines:
- Clock jumps backward → duplicates
7️⃣ Bug #4 — TTL & Expiry Bugs
❌ Local Time Expiry
if (Date.now() > expiresAt) {
invalidate();
}
Failure Modes
Clock jumps forward → premature expiry
Clock jumps backward → never expires
Different nodes disagree on validity
8️⃣ Bug #5 — Timeouts That Aren’t Comparable
❌ Distributed Timeout Logic
if (Date.now() - start > timeout) {
fail();
}
If start was recorded on another machine:
Negative elapsed time
Infinite waits
Premature failures
9️⃣ Why NTP Doesn’t “Fix” This
NTP:
Adjusts clocks gradually
Sometimes jumps time
Can go backward
NTP reduces skew.
It does not eliminate it.
1️⃣0️⃣ The Golden Rule of Time
Never use wall-clock time to establish ordering or correctness.
Use time only for:
Human display
Logging
Approximate expiry (with tolerance)
1️⃣1️⃣ Correct Tool #1 — Monotonic Clocks
Monotonic clocks:
Always move forward
Not affected by NTP
Local only
✅ Correct Timeout Measurement
const start = performance.now();
// work
if (performance.now() - start > timeoutMs) {
fail();
}
Monotonic clocks are safe only locally.
1️⃣2️⃣ Correct Tool #2 — Versioning (Not Time)
Replace timestamps with:
Versions
Counters
CAS tokens
UPDATE doc
SET value = ?, version = version + 1
WHERE id = ? AND version = ?
Correctness without time.
1️⃣3️⃣ Correct Tool #3 — Logical Clocks
Lamport Clock (Simple)
counter++
Attach counter to events.
Guarantee:
- Causal ordering (not real time)
1️⃣4️⃣ Correct Tool #4 — Vector Clocks (Advanced)
Track causality across nodes.
Use when:
Conflict resolution matters
Order matters
Strong correctness required
Cost:
Metadata size
Complexity
1️⃣5️⃣ Correct Tool #5 — Server-Assigned Time
If time is needed:
Assign it in one place
Accept bottleneck
INSERT INTO events (created_at)
VALUES (CURRENT_TIMESTAMP);
Centralized time is slow — but correct.
1️⃣6️⃣ Why “Time-Based Sharding” Is Risky
logs_2026_10
logs_2026_11
Clock skew →
Writes go to wrong shard
Reads miss data
Data loss illusions
Always add overlap or buffers.
1️⃣7️⃣ Production Rulebook (Hard-Won)
Never compare timestamps from different machines
Never rely on client time
Never order events by wall-clock time
Use monotonic clocks for durations
Use versions for correctness
Assume clocks go backward
