Skip to main content

Handling Kafka Retries in Spring Boot: Blocking vs. Reactive Approaches

 

Introduction

Apache Kafka is designed for high availability, but failures still happen—network issues, broker crashes, or cluster downtime. To ensure message delivery, applications must implement retry mechanisms. However, retries behave differently in traditional (blocking) vs. reactive (non-blocking) Kafka producers.

This guide covers:
✅ Kafka’s built-in retries (retriesretry.backoff.ms)
✅ Blocking vs. non-blocking retry strategies
✅ Reactive Kafka retries with backoff
✅ Fallback strategies for guaranteed delivery
✅ Real-world failure scenarios and fixes


1. Kafka Producer Retry Basics

When Do Retries Happen?

Kafka producers automatically retry on:

  • Network errors (e.g., broker disconnect)

  • Leader election (e.g., broker restart)

  • Temporary errors (e.g., NOT_ENOUGH_REPLICAS)

Key Configuration Properties

PropertyDefaultDescription
retries0Number of retries for transient failures.
retry.backoff.ms100Delay (ms) between retries.
delivery.timeout.ms120000Total timeout for a send (including retries).
max.block.ms60000Max time to wait for metadata or buffer full.

2. Blocking (Traditional) Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000  # Fail fast if Kafka is down

Behavior

  1. Initial send → Fails if Kafka is unreachable.

  2. Kafka retries 3 times (1s delay between attempts).

  3. If still failing:

    • Throws TimeoutException or KafkaException.

Problem: Thread Blocking

  • The KafkaTemplate.send() blocks the thread until:

    • The message is acknowledged, or

    • delivery.timeout.ms is reached.

  • Risk: Thread starvation in high-load scenarios.

Workaround: Async Send with Callbacks

java

@Autowired
private KafkaTemplate<String, String> kafkaTemplate;

public void sendMessage(String topic, String message) {
    ListenableFuture<SendResult<String, String>> future = 
        kafkaTemplate.send(topic, message);
    
    future.addCallback(
        result -> log.info("Sent: {}", result),
        ex -> log.error("Failed: {}", ex.getMessage())
    );
}

Pros: Non-blocking.
Cons: No built-in retry control.


3. Reactive Kafka Producer

Example Configuration

yaml

spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      retries: 3
      retry.backoff.ms: 1000
      max.block.ms: 2000

Retry + Fallback Code

java

@Autowired
private ReactiveKafkaProducerTemplate<String, String> reactiveKafkaProducer;

public Mono<Void> sendMessage(String topic, String message) {
    return reactiveKafkaProducer.send(topic, message)
        .retryWhen(Retry.backoff(3, Duration.ofSeconds(1)))
        .onErrorResume(e -> {
            log.error("Failed after retries: {}", e.getMessage());
            return storeInFallback(topic, message); // Fallback to DB
        });
}

private Mono<Void> storeInFallback(String topic, String message) {
    // Save to DB/Redis for later reprocessing
    return Mono.fromRunnable(() -> log.warn("Stored in fallback: {}", message));
}

Key Advantages

✅ Non-blocking: No thread starvation.
✅ Exponential backoff: Smarter retry delays.
✅ Fallback integration: Store failed messages safely.


4. Real-World Failure Scenarios

Case 1: Kafka Cluster Down for 10 Seconds

ApproachBehavior
BlockingRetries 3 times (1s delay), then throws exception. Thread blocked.
ReactiveRetries 3 times (Kafka) + 3 times (Reactive), then falls back to DB.

Case 2: Network Partition (50% Packet Loss)

ApproachBehavior
BlockingSome messages succeed; others fail after retries.
ReactiveAutomatically retries + applies backpressure.

Case 3: Broker Overload (High Latency)

ApproachBehavior
BlockingThreads stuck waiting for acks.
ReactiveQueues requests without blocking.

5. Best Practices

For Blocking Producers

  • Use max.block.ms=2000 to avoid long blocks.

  • Prefer async ListenableFuture over get().

  • Log failures for monitoring.

For Reactive Producers

  • Use retryWhen for application-level retries.

  • Always implement a fallback (e.g., DB, dead-letter queue).

  • Monitor reactive streams for backpressure.

Universal Rules

  1. Set delivery.timeout.ms ≥ retries * retry.backoff.ms.

  2. Avoid retries = Integer.MAX_VALUE (could retry forever).

  3. Use idempotent producers (enable.idempotence=true) to avoid duplicates.


Conclusion

  • Blocking Kafka: Simple but risks thread starvation. Use async callbacks.

  • Reactive Kafka: Resilient, non-blocking, and scalable.

  • Always combine:

    • Kafka’s low-level retries (retries) +

    • Application-level retries (retryWhen) +

    • Fallback storage.

Comments

Popular posts from this blog

Mastering Java Logging: A Guide to Debug, Info, Warn, and Error Levels

Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Comprehensive Guide to Java Logging Levels: Trace, Debug, Info, Warn, Error, and Fatal Logging is an essential aspect of application development and maintenance. It helps developers track application behavior and troubleshoot issues effectively. Java provides various logging levels to categorize messages based on their severity and purpose. This article covers all major logging levels: Trace , Debug , Info , Warn , Error , and Fatal , along with how these levels impact log printing. 1. Trace The Trace level is the most detailed logging level. It is typically used for granular debugging, such as tracking every method call or step in a complex computation. Use this level sparingly, as it can generate a large volume of log data. 2. Debug The Debug level provides detailed information useful during dev...

Choosing Between Envoy and NGINX Ingress Controllers for Kubernetes

As Kubernetes has become the standard for deploying containerized applications, ingress controllers play a critical role in managing how external traffic is routed to services within the cluster. Envoy and NGINX are two of the most popular options for ingress controllers, and each has its strengths, weaknesses, and ideal use cases. In this blog, we’ll explore: How both ingress controllers work. A detailed comparison of their features. When to use Envoy vs. NGINX for ingress management. What is an Ingress Controller? An ingress controller is a specialized load balancer that: Manages incoming HTTP/HTTPS traffic. Routes traffic to appropriate services based on rules defined in Kubernetes ingress resources. Provides features like TLS termination, path-based routing, and host-based routing. How Envoy Ingress Controller Works Envoy , initially built by Lyft, is a high-performance, modern service proxy and ingress solution. Here's how it operates in Kubernetes: Ingress Resource : You d...

Understanding API Parameters in Spring Boot

Understanding API Parameters in Spring Boot When designing APIs in Spring Boot, it's essential to understand how to handle different types of parameters. These parameters define how data is sent from the client to the server. Let's break down the common types of parameters used in API development, with examples and cURL commands. 1. Types of Parameters Parameter Type Location Use Case Example Format Annotation in Spring Boot Query Param URL after `?` Filtering, Pagination ?key=value @RequestParam Path Param In the URL path Identifying specific resource /resource/{id} @PathVariable Form Param Form-encoded body Simple form submissions ...