Apache Kafka is known for its high-throughput, fault-tolerant message streaming system. At the heart of Kafka's data pipeline is the Producer—responsible for publishing data to Kafka topics. This blog dives deep into the internal workings of the Kafka Producer, especially what happens under the hood when send()
is called. We'll also break down different delivery guarantees and transactional semantics with diagrams.
๐ง Table of Contents
- Kafka Producer Architecture Overview
- What Happens When
send()
is Called - Delivery Semantics
- Kafka Transactions & Idempotence
- Error Handling and Retries
- Diagram: Kafka Producer Internals
- Conclusion
๐️ Kafka Producer Architecture Overview
Kafka Producer is composed of the following core components:
- Serializer: Converts key/value to bytes.
- Partitioner: Determines which partition a record should go to.
- Accumulator: Buffers the records in memory before sending.
- Sender Thread: Batches and sends data to brokers asynchronously.
- Metadata Manager: Keeps broker and partition metadata up to date.
- NetworkClient: Handles the actual communication over TCP.
๐ค What Happens When send()
is Called?
producer.send(new ProducerRecord<>("topic", key, value));
Internal Flow (Simplified):
- Serialization – Key and value are serialized using configured serializers.
- Partition Selection – The partitioner chooses a partition.
- Buffering in Accumulator – The record is added to a batch buffer.
- Sender Thread Wakeup – A background thread sends the batch.
- Broker Acknowledgment – Broker persists and acknowledges.
- Callback Execution – Optional callback is executed after response.
๐ฆ Delivery Semantics
Kafka provides 3 main delivery guarantees:
✅ At Most Once
Data is sent once, no retries. Can be lost if errors occur.
Config:
acks=0
retries=0
Use Case: Logs, analytics
Pros: Low latency
Cons: No durability guarantee
๐ At Least Once
Retries if acknowledgment not received → duplicates possible.
Config:
acks=all
retries=3
Use Case: Payment events
Pros: Reliable delivery
Cons: Possible duplicates
๐งพ Exactly Once (EOS)
Guarantees no duplicates or loss, even during retries.
Config:
enable.idempotence=true
acks=all
retries=3
transactional.id=your-producer-id
Use Case: Financial systems
Pros: Precise semantics
Cons: Slightly higher latency
๐ Kafka Transactions & Idempotence
✅ Idempotent Producer
- Each producer gets a unique Producer ID (PID)
- Each message has a sequence number
- Broker deduplicates based on PID + sequence
๐ Transactions
producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction(); // or abortTransaction()
Transactional state is maintained via Kafka’s internal transaction log. Consumers will only see committed transactions if configured with isolation.level=read_committed
.
๐ก️ Error Handling & Retries
Failure Scenario | Kafka Producer Response |
---|---|
Network timeout | Retries if enabled |
Leader not available | Refresh metadata and retry |
Serialization failure | Fails immediately |
Batch too large | Logs and throws exception |
Idempotent retry mismatch | Producer is fenced |
๐งญ Diagram: Kafka Producer Internal Flow
+-------------------------+
| Kafka Producer |
+-------------------------+
|
v
+-------------------------+ +-------------+
| Serializer | -----> | Serialized |
| (Key + Value) | | Record |
+-------------------------+ +-------------+
|
v
+-------------------------+
| Partitioner | -----> Select Partition
+-------------------------+
|
v
+-------------------------+
| Record Accumulator | -----> Batches by Partition
+-------------------------+
|
v
+-------------------------+
| Sender Thread | -----> Sends Batch
+-------------------------+
|
v
+-------------------------+
| NetworkClient | -----> Kafka Broker
+-------------------------+
|
v
+-------------------------+
| Broker Acknowledges |
+-------------------------+
|
v
+-------------------------+
| Callback Executed |
+-------------------------+
✅ Conclusion
Understanding Kafka Producer internals is essential for building reliable, performant data pipelines. Whether you’re dealing with logs, transactions, or real-time processing, choose the appropriate delivery semantics and configure retries/acks wisely.
Key Takeaways:
send()
is asynchronous- Data flows through serialization → partitioning → buffering → batching
- Kafka supports at-most-once, at-least-once, and exactly-once guarantees
- Use idempotence and transactions for full exactly-once support
✍️ Feel free to share this blog or connect with me for more deep-dives into Kafka, streaming, and distributed systems!
Comments
Post a Comment