Skip to main content

Handling Kafka Retries in Spring Boot: Blocking vs. Reactive Approaches

 

Introduction

Apache Kafka is designed for high availability, but failures still happen—network issues, broker crashes, or cluster downtime. To ensure message delivery, applications must implement retry mechanisms. However, retries behave differently in traditional (blocking) vs. reactive (non-blocking) Kafka producers.

This guide covers:
✅ Kafka’s built-in retries (retriesretry.backoff.ms)
✅ Blocking vs. non-blocking retry strategies
✅ Reactive Kafka retries with backoff
✅ Fallback strategies for guaranteed delivery
✅ Real-world failure scenarios and fixes


1. Kafka Producer Retry Basics

When Do Retries Happen?

Kafka producers automatically retry on:

  • Network errors (e.g., broker disconnect)

  • Leader election (e.g., broker restart)

  • Temporary errors (e.g., NOT_ENOUGH_REPLICAS)

Key Configuration Properties

Property Default Description
retries 0 Number of retries for transient failures.
retry.backoff.ms 100 Delay (ms) between retries.
delivery.timeout.ms 120000 Total timeout for a send (including retries).
max.block.ms 60000 Max time to wait for metadata or buffer full.

2. Blocking (Traditional) Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000  # Fail fast if Kafka is down

Behavior

  1. Initial send → Fails if Kafka is unreachable.

  2. Kafka retries 3 times (1s delay between attempts).

  3. If still failing:

    • Throws TimeoutException or KafkaException.

Problem: Thread Blocking

  • The KafkaTemplate.send() blocks the thread until:

    • The message is acknowledged, or

    • delivery.timeout.ms is reached.

  • Risk: Thread starvation in high-load scenarios.

Workaround: Async Send with Callbacks

java

@Autowired
private KafkaTemplate<String, String> kafkaTemplate;

public void sendMessage(String topic, String message) {
    ListenableFuture<SendResult<String, String>> future = 
        kafkaTemplate.send(topic, message);
    
    future.addCallback(
        result -> log.info("Sent: {}", result),
        ex -> log.error("Failed: {}", ex.getMessage())
    );
}

Pros: Non-blocking.
Cons: No built-in retry control.


3. Reactive Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000

Retry + Fallback Code

java

@Autowired
private ReactiveKafkaProducerTemplate<String, String> reactiveKafkaProducer;

public Mono<Void> sendMessage(String topic, String message) {
    return reactiveKafkaProducer.send(topic, message)
        .retryWhen(Retry.backoff(3, Duration.ofSeconds(1)))
        .onErrorResume(e -> {
            log.error("Failed after retries: {}", e.getMessage());
            return storeInFallback(topic, message); // Fallback to DB
        });
}

private Mono<Void> storeInFallback(String topic, String message) {
    // Save to DB/Redis for later reprocessing
    return Mono.fromRunnable(() -> log.warn("Stored in fallback: {}", message));
}

Key Advantages

✅ Non-blocking: No thread starvation.
✅ Exponential backoff: Smarter retry delays.
✅ Fallback integration: Store failed messages safely.


4. Real-World Failure Scenarios

Case 1: Kafka Cluster Down for 10 Seconds

Approach Behavior
Blocking Retries 3 times (1s delay), then throws exception. Thread blocked.
Reactive Retries 3 times (Kafka) + 3 times (Reactive), then falls back to DB.

Case 2: Network Partition (50% Packet Loss)

Approach Behavior
Blocking Some messages succeed; others fail after retries.
Reactive Automatically retries + applies backpressure.

Case 3: Broker Overload (High Latency)

Approach Behavior
Blocking Threads stuck waiting for acks.
Reactive Queues requests without blocking.

5. Best Practices

For Blocking Producers

  • Use max.block.ms=2000 to avoid long blocks.

  • Prefer async ListenableFuture over get().

  • Log failures for monitoring.

For Reactive Producers

  • Use retryWhen for application-level retries.

  • Always implement a fallback (e.g., DB, dead-letter queue).

  • Monitor reactive streams for backpressure.

Universal Rules

  1. Set delivery.timeout.ms ≥ retries * retry.backoff.ms.

  2. Avoid retries = Integer.MAX_VALUE (could retry forever).

  3. Use idempotent producers (enable.idempotence=true) to avoid duplicates.


Conclusion

  • Blocking Kafka: Simple but risks thread starvation. Use async callbacks.

  • Reactive Kafka: Resilient, non-blocking, and scalable.

  • Always combine:

    • Kafka’s low-level retries (retries) +

    • Application-level retries (retryWhen) +

    • Fallback storage.

Comments

Popular posts from this blog

Mastering Java Logging: A Guide to Debug, Info, Warn, and Error Levels

Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Logging is an essential aspect of application development and maintenance. It helps developers track application behavior and troubleshoot issues effectively. Java provides various logging levels to categorize messages based on their severity and purpose. This article covers all major logging levels: Trace , Debug , Info , Warn , Error , and Fatal , along with how these levels impact log printing. 1. Trace The Trace level is the most detailed logging level. It is typically used for granular debugging, such as tracking every method call or step in a complex computation. Use this level sparingly, as it ...

Choosing Between Envoy and NGINX Ingress Controllers for Kubernetes

As Kubernetes has become the standard for deploying containerized applications, ingress controllers play a critical role in managing how external traffic is routed to services within the cluster. Envoy and NGINX are two of the most popular options for ingress controllers, and each has its strengths, weaknesses, and ideal use cases. In this blog, we’ll explore: How both ingress controllers work. A detailed comparison of their features. When to use Envoy vs. NGINX for ingress management. What is an Ingress Controller? An ingress controller is a specialized load balancer that: Manages incoming HTTP/HTTPS traffic. Routes traffic to appropriate services based on rules defined in Kubernetes ingress resources. Provides features like TLS termination, path-based routing, and host-based routing. How Envoy Ingress Controller Works Envoy , initial...