Introduction
Apache Kafka is designed for high availability, but failures still happen—network issues, broker crashes, or cluster downtime. To ensure message delivery, applications must implement retry mechanisms. However, retries behave differently in traditional (blocking) vs. reactive (non-blocking) Kafka producers.
This guide covers:
✅ Kafka’s built-in retries (retries
, retry.backoff.ms
)
✅ Blocking vs. non-blocking retry strategies
✅ Reactive Kafka retries with backoff
✅ Fallback strategies for guaranteed delivery
✅ Real-world failure scenarios and fixes
1. Kafka Producer Retry Basics
When Do Retries Happen?
Kafka producers automatically retry on:
Network errors (e.g., broker disconnect)
Leader election (e.g., broker restart)
Temporary errors (e.g.,
NOT_ENOUGH_REPLICAS
)
Key Configuration Properties
Property | Default | Description |
---|---|---|
retries | 0 | Number of retries for transient failures. |
retry.backoff.ms | 100 | Delay (ms) between retries. |
delivery.timeout.ms | 120000 | Total timeout for a send (including retries). |
max.block.ms | 60000 | Max time to wait for metadata or buffer full. |
2. Blocking (Traditional) Kafka Producer
Example Configuration
spring:
kafka:
producer:
bootstrap-servers: localhost:9092
retries: 3
retry.backoff.ms: 1000
max.block.ms: 2000 # Fail fast if Kafka is down
Behavior
Initial send → Fails if Kafka is unreachable.
Kafka retries 3 times (1s delay between attempts).
If still failing:
Throws
TimeoutException
orKafkaException
.
Problem: Thread Blocking
The
KafkaTemplate.send()
blocks the thread until:The message is acknowledged, or
delivery.timeout.ms
is reached.
Risk: Thread starvation in high-load scenarios.
Workaround: Async Send with Callbacks
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
public void sendMessage(String topic, String message) {
ListenableFuture<SendResult<String, String>> future =
kafkaTemplate.send(topic, message);
future.addCallback(
result -> log.info("Sent: {}", result),
ex -> log.error("Failed: {}", ex.getMessage())
);
}
Pros: Non-blocking.
Cons: No built-in retry control.
3. Reactive Kafka Producer
Example Configuration
spring:
kafka:
producer:
bootstrap-servers: localhost:9092
retries: 3
retry.backoff.ms: 1000
max.block.ms: 2000
Retry + Fallback Code
@Autowired
private ReactiveKafkaProducerTemplate<String, String> reactiveKafkaProducer;
public Mono<Void> sendMessage(String topic, String message) {
return reactiveKafkaProducer.send(topic, message)
.retryWhen(Retry.backoff(3, Duration.ofSeconds(1)))
.onErrorResume(e -> {
log.error("Failed after retries: {}", e.getMessage());
return storeInFallback(topic, message); // Fallback to DB
});
}
private Mono<Void> storeInFallback(String topic, String message) {
// Save to DB/Redis for later reprocessing
return Mono.fromRunnable(() -> log.warn("Stored in fallback: {}", message));
}
Key Advantages
✅ Non-blocking: No thread starvation.
✅ Exponential backoff: Smarter retry delays.
✅ Fallback integration: Store failed messages safely.
4. Real-World Failure Scenarios
Case 1: Kafka Cluster Down for 10 Seconds
Approach | Behavior |
---|---|
Blocking | Retries 3 times (1s delay), then throws exception. Thread blocked. |
Reactive | Retries 3 times (Kafka) + 3 times (Reactive), then falls back to DB. |
Case 2: Network Partition (50% Packet Loss)
Approach | Behavior |
---|---|
Blocking | Some messages succeed; others fail after retries. |
Reactive | Automatically retries + applies backpressure. |
Case 3: Broker Overload (High Latency)
Approach | Behavior |
---|---|
Blocking | Threads stuck waiting for acks. |
Reactive | Queues requests without blocking. |
5. Best Practices
For Blocking Producers
Use
max.block.ms=2000
to avoid long blocks.Prefer async
ListenableFuture
overget()
.Log failures for monitoring.
For Reactive Producers
Use
retryWhen
for application-level retries.Always implement a fallback (e.g., DB, dead-letter queue).
Monitor reactive streams for backpressure.
Universal Rules
Set
delivery.timeout.ms
≥retries * retry.backoff.ms
.Avoid
retries = Integer.MAX_VALUE
(could retry forever).Use idempotent producers (
enable.idempotence=true
) to avoid duplicates.
Conclusion
Blocking Kafka: Simple but risks thread starvation. Use async callbacks.
Reactive Kafka: Resilient, non-blocking, and scalable.
Always combine:
Kafka’s low-level retries (
retries
) +Application-level retries (
retryWhen
) +Fallback storage.
Comments
Post a Comment