Managing Complex Application Architectures with Apache Kafka
Table of Contents
- Summary
- The Growing Complexity of Application Structures
- What is Kafka?
- Key Features of Kafka
- High Availability: Partition Replication
- Kafka Partitioner
- Consumer Lag
Summary
Reasons to use Kafka:
- Decouples source and target applications
- Scalable & Fault-Tolerant architecture
- Supports high-throughput data streaming
- Flexible partitioning and replication
- Handles real-time data processing efficiently
The Growing Complexity of Application Structures
As the number of source (sending) and target (receiving) applications increases, data pipelines become more complex, requiring efficient data processing systems.
@Figure 1: Increasing complexity in application data pipelines
Challenges of Complex Data Pipelines:
- Difficult Deployment: More dependencies lead to complex release cycles.
- Harder Issue Detection: Identifying failure points becomes challenging.
- Protocol Diversity: Different applications use different communication methods, increasing processing workload.
What is Kafka?
Apache Kafka is an open-source distributed event streaming platform originally developed by LinkedIn. If you’re interested in Kafka’s implementation at LinkedIn, check out Kafka Ecosystem at LinkedIn.
Why Use Kafka?
Kafka’s primary goal is to decouple source and target applications, ensuring:
- Producers (source applications) only need to send data to Kafka.
- Consumers (target applications) only need to fetch data from Kafka.
For example, in an e-commerce system:
- Source Applications (Producers): Log user actions (e.g., product clicks, purchases).
- Target Applications (Consumers): Store and process logs.
- Kafka: Acts as a central hub, reducing system dependencies.
Benefits of Kafka
- Fault Tolerance – Handles failures without losing data.
- Low Latency & High Throughput – Efficiently processes large volumes of real-time data.
@Figure 2: Kafka simplifies the architecture by providing a unified event streaming platform. (reference)
@Figure 3: How Kafka facilitates data transmission between producers and consumers.
Key Features of Kafka
1. Topics
Kafka organizes data into topics (message queues).
- Producers send messages to topics.
- Consumers subscribe to topics to retrieve messages.
@Figure 4: Kafka Topics - Message queues for data processing.
Characteristics of Kafka Topics:
- Multiple topics can exist in Kafka, similar to database tables or file system folders.
- Topics have unique names to ensure clarity and maintainability.
2. Partitions
Kafka splits topics into partitions to distribute data across multiple brokers.
@Figure 5: Kafka partitions - distributing data across multiple consumers.
Advantages of Partitions:
- Fault Tolerance: If a server fails, Kafka can recover data.
- Parallel Processing: Multiple consumers can read different partitions simultaneously.
- Multi-System Integration: The same data can be ingested into multiple platforms (e.g., Elasticsearch, Hadoop).
How Kafka Assigns Data to Partitions:
- Without a key: Data is assigned using Round Robin.
- With a key: Data is sent to a partition based on its hash value.
Important while managing partitions:
- Increasing partitions is allowed.
- Decreasing partitions is not possible once they are created.
When to Increase Partitions:
- If consumer count increases, partition count should also increase.
- More partitions -> More consumers -> Faster parallel processing.
3. Partition Retention & Deletion
Messages are not immediately deleted after being read. Kafka retains messages for new consumers or in case of re-processing.
Conditions for Retaining Messages:
- A different consumer group requests the data.
- The consumer offset reset policy allows access (
auto.offset.reset
= earliest).
Message Deletion is Configurable:
log.retention.ms
: Maximum time a message is stored.log.retention.bytes
: Maximum size of stored messages.
High Availability: Partition Replication
1. Brokers
A Kafka broker is a server running Kafka. Recommended setup: At least 3 brokers for redundancy.
Message Broker vs. Event Broker:
Type | Function |
---|---|
Message Broker | Deletes messages after processing |
Event Broker (Kafka) | Retains messages for later use |
2. Replication
Kafka replicates partitions across multiple brokers to prevent data loss.
@Figure 6: Kafka’s In-Sync Replica (ISR) mechanism for high availability.
Leader and Followers
- Leader Partition: Handles all writes and reads.
- Follower Partitions: Replicated copies of the leader.
Acknowledgment (acks) Levels
acks Setting | Behavior | Speed | Data Loss Risk |
---|---|---|---|
0 | No confirmation from leader or followers | Fastest | Highest risk |
1 | Leader confirms receipt, but followers may not replicate | Moderate | Some risk |
all | Leader and followers confirm replication | Slowest | Minimal risk |
In the best practice, set replication.factor = 3
for 3+ brokers.
Kafka Partitioner
The partitioner decides how messages are assigned to partitions.
@Figure 7: Kafka Producer Partitioning Strategy.
Message Key | Assignment Strategy | Characteristics |
---|---|---|
Yes | Uses a hash function on the key | Ordered processing |
No | Round-robin distribution | Balanced distribution |
Example Use Case:
- For weather logs, using “Seoul” as a key ensures all “Seoul” logs go to the same partition, preserving order.
Consumer Lag
Consumer Lag occurs when producers send data faster than consumers can process it.
@Figure 8: Consumer lag – The gap between producer and consumer processing speed.
In the consumer lag diagram, the left side represents the last offset read by the consumer, while the right side shows the last offset delivered by the producer.
Why is Consumer Lag a Problem?
- Causes delays in data processing.
- Can overload Kafka, reducing performance.
Monitoring Consumer Lag
- Avoid consumer-based monitoring, as failures prevent lag reporting.
- Use Burrow (by LinkedIn) for real-time lag monitoring.
Burrow
- Monitors multiple Kafka clusters.
- Categorizes consumer status as ERROR, WARNING, OK.
- Provides HTTP APIs for integration.