Loading...

12-Hour Money-Back Guarantee

How Batch Push to SQS Can Still Melt Your Consumers

How Batch Push to SQS Can Still Melt Your Consumers

How Batch Push to SQS Can Still Melt Your Consumers

4 Apr 20224 min read

At first glance, batching with Amazon Web Services SQS looks like the perfect optimization.

  • Send 10 messages in one API call.

  • Reduce network overhead.

  • Increase throughput.

  • Lower cost.

So teams aggressively batch-produce messages into Amazon Simple Queue Service queues.

Everything works beautifully in staging.

Then production traffic arrives.

Suddenly:

  • Consumers start slowing down.

  • Delete APIs begin throttling.

  • Messages reappear.

  • Duplicate processing starts happening.

  • Queue depth explodes.

And the surprising part?

The bottleneck was not sending messages. It was deleting them.


The Architecture Most Teams Build

A common architecture looks like this:

Producer Service
      |
      | BatchSendMessage (10 msgs/request)
      v
+-------------------+
|       SQS         |
+-------------------+
      |
      v
Consumer Workers
      |
      | process each message
      |
      | DeleteMessage
      v
SQS Acknowledgement

The producer uses:

SendMessageBatch

which sends up to 10 messages per request.

This is efficient.

But many teams unknowingly do this on the consumer side:

for (Message msg : messages) {
    process(msg);
    sqs.deleteMessage(receiptHandle);
}

That means:

  • 1 batch receive call

  • BUT 10 individual delete calls

This becomes dangerous at scale.


The Hidden Problem

Imagine this traffic:

Metric Value
Messages/sec 100,000
Batch size 10
Receive requests/sec 10,000
Delete requests/sec 100,000

Even though producers optimized API usage with batching, consumers accidentally multiplied API traffic again during deletion.

Now your system performs:

  • 10x more delete requests

  • Higher TCP overhead

  • More AWS API throttling

  • Increased latency

  • Retry storms

The queue itself becomes healthy.

But the acknowledgement path collapses.


What Happens During Throttling

When DeleteMessage gets throttled:

Consumer processed message successfully
        |
DeleteMessage failed
        |
Visibility timeout expires
        |
Message becomes visible again
        |
Another consumer reprocesses it

Now duplicate processing begins.

This creates secondary problems:

  • Duplicate payments

  • Duplicate emails

  • Double inventory deduction

  • Repeated notifications

  • Idempotency pressure on downstream systems

The real issue was never SQS delivery.

It was acknowledgement scalability.


The Dangerous Feedback Loop

This creates a nasty feedback cycle.

Delete throttling
      ↓
Messages reappear
      ↓
Consumers receive more messages
      ↓
More delete attempts
      ↓
Even more throttling

Eventually:

  • Consumer CPU spikes

  • Retry queues grow

  • Visibility timeout tuning becomes unstable

  • DLQs start filling

Teams often incorrectly blame:

  • SQS

  • AWS networking

  • Consumer autoscaling

  • Visibility timeout settings

But the root cause is usually:

Individual deletes after batched receives.


The Correct Design

If you batch receive messages:

ReceiveMessage(max=10)

you should also batch delete them:

DeleteMessageBatch

Correct architecture:

Producer
   |
BatchSendMessage
   |
   v
SQS
   |
BatchReceiveMessage
   |
Consumer
   |
Process all successful messages
   |
DeleteMessageBatch

Now instead of:

100,000 delete requests/sec

you get:

10,000 delete requests/sec

That is a massive reduction.


Why This Improves More Than Cost

Most people think batching is only about reducing AWS billing.

But batching also improves:

1. Network Efficiency

Fewer:

  • TLS handshakes

  • TCP packets

  • HTTP requests

2. Better Consumer Throughput

Workers spend less time waiting on acknowledgement APIs.

3. Lower Retry Amplification

Throttling probability drops significantly.

4. More Stable Visibility Timeouts

Messages are acknowledged faster and more consistently.

5. Better Horizontal Scaling

Consumers can scale without overwhelming SQS APIs.


The Production-Grade Consumer Pattern

A robust consumer flow usually looks like this:

1. Receive batch of messages
2. Process in parallel
3. Track successful messages
4. Batch delete only successful ones
5. Retry failed messages later

Pseudo-flow:

List<Message> successful = new ArrayList<>();

for (Message msg : messages) {
    try {
        process(msg);
        successful.add(msg);
    } catch (Exception ex) {
        log.error("processing failed");
    }
}

sqs.deleteMessageBatch(successful);

This avoids:

  • deleting failed messages

  • unnecessary retries

  • excessive API calls


Another Common Mistake

Some systems do this:

Receive 10 messages
Process 1 message
Immediately delete 1 message
Repeat

This destroys batching benefits entirely.

Instead:

  • accumulate acknowledgements

  • flush periodically

  • batch deletes intelligently

Many high-scale systems maintain:

  • in-memory delete buffers

  • timed flush intervals

  • max batch thresholds

Exactly like how Kafka producers batch writes internally.


Real-World Scaling Insight

At large scale, queue systems are rarely bottlenecked by:

  • enqueue throughput

  • storage

  • message delivery

They are bottlenecked by:

  • acknowledgements

  • retries

  • visibility timeout churn

  • duplicate processing amplification

The “delete path” becomes the real scalability limit.

That is why production-grade messaging systems obsess over:

  • batch acknowledgements

  • offset commits

  • checkpointing

  • ack aggregation

Even Apache Kafka fundamentally optimizes around efficient acknowledgements using offset commits.


Final Takeaway

Batching only the producer side is half an optimization.

If you:

  • batch send

  • batch receive

  • BUT individually delete

then your architecture still behaves like a high-request-rate system.

The real optimization comes when the entire pipeline becomes batch-aware:

Batch Produce
    ↓
Batch Consume
    ↓
Batch Acknowledge

In distributed systems, the slowest path is often not processing.

It is coordination.

And in SQS-based systems, deletion is coordination.