Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kafka Streams
Build real-time stream processing applications with the Kafka Streams API and ksqlDB.What is Kafka Streams?
Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. Think of it as a spreadsheet that updates itself in real time: instead of running a batch job to recompute totals every night, Kafka Streams continuously processes events the moment they arrive. Unlike Spark or Flink, there is no separate cluster to manage — Kafka Streams runs inside your application, the same way a JDBC driver runs inside your app. You scale by running more instances of your application, and Kafka handles partition assignment automatically.Library, not Cluster
Runs in your application (Java/Scala), not on Kafka brokers
Scalable
Elastic scaling based on partitions
Stateful
Built-in state management (RocksDB)
Exactly-Once
Guaranteed processing semantics
Core Concepts
Streams vs Tables
The duality between streams and tables is the single most important concept in stream processing. Think of it like this: a bank statement (stream) is a list of every deposit and withdrawal. Your account balance (table) is the result of applying all those transactions. You can always derive one from the other.- KStream: An infinite stream of records (insert-only). Each record is a fact that happened — it does not replace previous records with the same key.
- Example: Credit card transactions. Two purchases by the same user are two separate events.
- KTable: A changelog stream (upsert). Represents the current state for each key. New records with the same key replace older ones.
- Example: User account balances. You only care about the latest balance, not every intermediate state.
Topology
A graph of processing nodes (sources, processors, sinks).Kafka Streams API (Java)
Stateless Transformations
Operations that don’t require memory of previous events (e.g., filter, map). These are the cheapest operations in stream processing because they examine each record in isolation — no state store, no disk I/O, no checkpoint overhead. If your logic can be expressed as a stateless transformation, prefer it.Stateful Transformations
Operations that require state (e.g., count, aggregate, join). These are where Kafka Streams truly shines — and where most of the complexity lives. Under the hood, each stateful operation creates a local RocksDB instance and a changelog topic in Kafka. The changelog topic is your safety net: if the application crashes, a new instance can restore state by replaying the changelog.ksqlDB: Streaming SQL
ksqlDB allows you to write stream processing applications using SQL syntax. If Kafka Streams is a general-purpose programming language, ksqlDB is a domain-specific language — it trades flexibility for dramatically lower development effort. For teams where data engineers know SQL but not Java, ksqlDB can reduce stream processing development time from weeks to hours.Create a Stream
Create a Table (State)
Stream-Table Join
Windowed Aggregation
Architecture
Partitioning & Scaling
Kafka Streams automatically handles load balancing. If you run multiple instances of your application, they will share the partitions of the input topics. The maximum parallelism equals the number of input partitions — if your topic has 12 partitions, you can run up to 12 stream tasks across however many application instances you want. This is the same consumer group model you already know, which means scaling a Kafka Streams app is identical to scaling a consumer group: just start more instances.State Stores
Stateful operations (likecount()) use local RocksDB instances to store state on disk. This state is also backed up to a Kafka “changelog topic” for fault tolerance. Think of it as a local database with an automatic backup to Kafka — if an instance crashes and restarts on a different machine, it rebuilds its local state by replaying the changelog topic. This is why Kafka Streams can be stateful without requiring an external database.
Use Cases
- Real-time Fraud Detection: Filter and analyze transaction streams. Flag transactions that exceed a threshold within a time window (e.g., more than 5 purchases in 60 seconds from the same card).
- Enrichment: Join a fast-moving event stream with a slowly-changing lookup table (e.g., enrich click events with user profile data). This replaces what would otherwise be a database lookup on every event.
- Monitoring: Aggregate logs and metrics in real-time. Build dashboards that show request counts, error rates, and latency percentiles without waiting for a batch pipeline.
- ETL: Transform, filter, and reshape data before loading into a data warehouse. Unlike traditional batch ETL, data arrives in the warehouse within seconds of being produced.
Kafka Streams vs Flink vs Spark Streaming
| Feature | Kafka Streams | Apache Flink | Spark Structured Streaming |
|---|---|---|---|
| Deployment | Library (no cluster) | Dedicated cluster | Spark cluster |
| Latency | Milliseconds | Milliseconds | Seconds to minutes |
| State Management | RocksDB + changelog | RocksDB + checkpoints | In-memory + checkpoint |
| Exactly-Once | Yes (Kafka-only) | Yes (any source/sink) | Yes (with limitations) |
| Language | Java/Scala | Java/Scala/Python/SQL | Scala/Java/Python/SQL |
| Best For | Kafka-to-Kafka pipelines | Complex event processing | Batch + streaming unification |
Common Pitfalls
Key Takeaways
- Kafka Streams is a library, not a cluster. It runs inside your application.
- KStream (events) and KTable (state) are dual representations of the same data.
- Stateless operations (filter, map) are cheap. Stateful operations (count, join) use RocksDB and changelog topics.
- ksqlDB gives you SQL over streams — ideal for simple transformations without writing Java.
- Maximum parallelism equals the number of input partitions.
- For Kafka-to-Kafka pipelines, Kafka Streams is the simplest production choice. For non-Kafka sources or complex event-time processing, consider Flink.
Next: Kafka Operations →