Skip to main content

Handling Kafka Retries in Spring Boot: Blocking vs. Reactive Approaches

 

Introduction

Apache Kafka is designed for high availability, but failures still happen—network issues, broker crashes, or cluster downtime. To ensure message delivery, applications must implement retry mechanisms. However, retries behave differently in traditional (blocking) vs. reactive (non-blocking) Kafka producers.

This guide covers:
✅ Kafka’s built-in retries (retriesretry.backoff.ms)
✅ Blocking vs. non-blocking retry strategies
✅ Reactive Kafka retries with backoff
✅ Fallback strategies for guaranteed delivery
✅ Real-world failure scenarios and fixes


1. Kafka Producer Retry Basics

When Do Retries Happen?

Kafka producers automatically retry on:

  • Network errors (e.g., broker disconnect)

  • Leader election (e.g., broker restart)

  • Temporary errors (e.g., NOT_ENOUGH_REPLICAS)

Key Configuration Properties

Property Default Description
retries 0 Number of retries for transient failures.
retry.backoff.ms 100 Delay (ms) between retries.
delivery.timeout.ms 120000 Total timeout for a send (including retries).
max.block.ms 60000 Max time to wait for metadata or buffer full.

2. Blocking (Traditional) Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000  # Fail fast if Kafka is down

Behavior

  1. Initial send → Fails if Kafka is unreachable.

  2. Kafka retries 3 times (1s delay between attempts).

  3. If still failing:

    • Throws TimeoutException or KafkaException.

Problem: Thread Blocking

  • The KafkaTemplate.send() blocks the thread until:

    • The message is acknowledged, or

    • delivery.timeout.ms is reached.

  • Risk: Thread starvation in high-load scenarios.

Workaround: Async Send with Callbacks

java

@Autowired
private KafkaTemplate<String, String> kafkaTemplate;

public void sendMessage(String topic, String message) {
    ListenableFuture<SendResult<String, String>> future = 
        kafkaTemplate.send(topic, message);
    
    future.addCallback(
        result -> log.info("Sent: {}", result),
        ex -> log.error("Failed: {}", ex.getMessage())
    );
}

Pros: Non-blocking.
Cons: No built-in retry control.


3. Reactive Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000

Retry + Fallback Code

java

@Autowired
private ReactiveKafkaProducerTemplate<String, String> reactiveKafkaProducer;

public Mono<Void> sendMessage(String topic, String message) {
    return reactiveKafkaProducer.send(topic, message)
        .retryWhen(Retry.backoff(3, Duration.ofSeconds(1)))
        .onErrorResume(e -> {
            log.error("Failed after retries: {}", e.getMessage());
            return storeInFallback(topic, message); // Fallback to DB
        });
}

private Mono<Void> storeInFallback(String topic, String message) {
    // Save to DB/Redis for later reprocessing
    return Mono.fromRunnable(() -> log.warn("Stored in fallback: {}", message));
}

Key Advantages

✅ Non-blocking: No thread starvation.
✅ Exponential backoff: Smarter retry delays.
✅ Fallback integration: Store failed messages safely.


4. Real-World Failure Scenarios

Case 1: Kafka Cluster Down for 10 Seconds

Approach Behavior
Blocking Retries 3 times (1s delay), then throws exception. Thread blocked.
Reactive Retries 3 times (Kafka) + 3 times (Reactive), then falls back to DB.

Case 2: Network Partition (50% Packet Loss)

Approach Behavior
Blocking Some messages succeed; others fail after retries.
Reactive Automatically retries + applies backpressure.

Case 3: Broker Overload (High Latency)

Approach Behavior
Blocking Threads stuck waiting for acks.
Reactive Queues requests without blocking.

5. Best Practices

For Blocking Producers

  • Use max.block.ms=2000 to avoid long blocks.

  • Prefer async ListenableFuture over get().

  • Log failures for monitoring.

For Reactive Producers

  • Use retryWhen for application-level retries.

  • Always implement a fallback (e.g., DB, dead-letter queue).

  • Monitor reactive streams for backpressure.

Universal Rules

  1. Set delivery.timeout.ms ≥ retries * retry.backoff.ms.

  2. Avoid retries = Integer.MAX_VALUE (could retry forever).

  3. Use idempotent producers (enable.idempotence=true) to avoid duplicates.


Conclusion

  • Blocking Kafka: Simple but risks thread starvation. Use async callbacks.

  • Reactive Kafka: Resilient, non-blocking, and scalable.

  • Always combine:

    • Kafka’s low-level retries (retries) +

    • Application-level retries (retryWhen) +

    • Fallback storage.

Comments

Popular posts from this blog

Mastering Java Logging: A Guide to Debug, Info, Warn, and Error Levels

Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Logging is an essential aspect of application development and maintenance. It helps developers track application behavior and troubleshoot issues effectively. Java provides various logging levels to categorize messages based on their severity and purpose. This article covers all major logging levels: Trace , Debug , Info , Warn , Error , and Fatal , along with how these levels impact log printing. 1. Trace The Trace level is the most detailed logging level. It is typically used for granular debugging, such as tracking every method call or step in a complex computation. Use this level sparingly, as it ...

Advanced Kafka Resilience: Dead-Letter Queues, Circuit Breakers, and Exactly-Once Delivery

Introduction In distributed systems, failures are inevitable—network partitions, broker crashes, or consumer lag can disrupt data flow. While retries help recover from transient issues, you need stronger guarantees for mission-critical systems. This guide covers three advanced Kafka resilience patterns: Dead-Letter Queues (DLQs) – Handle poison pills and unprocessable messages. Circuit Breakers – Prevent cascading failures when Kafka is unhealthy. Exactly-Once Delivery – Avoid duplicates in financial/transactional systems. Let's dive in! 1. Dead-Letter Queues (DLQs) in Kafka What is a DLQ? A dedicated Kafka topic where "failed" messages are sent after max retries (e.g., malformed payloads, unrecoverable errors). ...