In 2013, Google faced a critical challenge: existing batch processing frameworks (like MapReduce and Spark) couldn’t handle real-time streaming data properly.The industry had two bad options:
Batch Processing (MapReduce, early Spark):
Wait hours for results
Can’t handle continuous data
Good for historical analysis, terrible for real-time
Micro-Batching (Spark Streaming):
Chop stream into tiny batches
Latency measured in seconds (at best)
Pretends batches are streams
Event time processing is a hack
What was missing? A framework that treats streaming as the default, not an afterthought.
Full Citation:
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing”. VLDB ‘15.
Storm provided low-latency streaming (but no correctness guarantees)
Spark Streaming used micro-batching (high latency)
No framework handled out-of-order, unbounded data with event time semantics
The Problem Statement from the Paper:
“Existing systems force an uncomfortable choice: either sacrifice latency (batch systems), or sacrifice correctness (streaming systems that ignore event time).”
3. WHEN in processing time do you materialize results?
Copy
// WHEN: Emit results when watermark passes end of window// (This is the DEFAULT trigger in Flink)// Or: Emit early results every 1 minute, then final result at watermarkDataStream<Tuple2<String, Integer>> earlyResults = events .map(event -> Tuple2.of(event.type, 1)) .keyBy(0) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .trigger( ContinuousEventTimeTrigger.of(Time.minutes(1)) ) .sum(1);
Triggers control when windows emit results (early, on-time, late).
Definition: When an event is processed by the streaming system.
Copy
// Processing time example (Flink assigns timestamp on arrival)DataStream<SensorReading> stream = env .addSource(new KafkaSource<>(...)) .assignTimestampsAndWatermarks( WatermarkStrategy.forMonotonousTimestamps() );// Each event gets processing time = System.currentTimeMillis()
“A watermark is a monotonically increasing timestamp indicating that no more events with timestamps less than the watermark will arrive.”
In Plain English:A watermark is the streaming system saying: “I’m confident that all events with timestamps before time T have arrived. I can now safely compute results for time T.”
// Example: Reading from a Kafka topic with known partitions// and timestamps are perfectly ordered within each partitionWatermarkStrategy<Event> perfectStrategy = WatermarkStrategy .forMonotonousTimestamps() // Assumes perfect order .withTimestampAssigner((event, timestamp) -> event.eventTime);
When to Use: Controlled environments, pre-sorted data, append-only logs.Downside: One late event = watermark stuck forever!
DataStream<Tuple2<String, Integer>> counts = events .keyBy(event -> event.sensorId) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .allowedLateness(Time.minutes(1)) // Accept events up to 1 min late .sideOutputLateData(lateDataTag) // Capture very late events .sum("value");// Get the very late events that were rejectedDataStream<Event> veryLateEvents = counts.getSideOutput(lateDataTag);
Behavior:
Watermark passes end of window → Window closes and emits result
Late events (within allowed lateness) → Window reopens, updates result
Very late events (beyond allowed lateness) → Sent to side output
From the research paper “Lightweight Asynchronous Snapshots for Distributed Dataflows” (Paris Carbone et al., 2015):Flink implements the Chandy-Lamport distributed snapshot algorithm for exactly-once processing.How it Works (Simplified):
Copy
1. JobManager injects "barrier" into source2. Barrier flows through the DAG with data3. When operator receives barrier from all inputs: - Snapshots its state to durable storage - Forwards barrier downstream4. When all operators snapshot: Checkpoint complete5. On failure: Restore from last checkpoint
Result: Even if a node crashes mid-processing, each event is processed exactly once.
Copy
// Enable checkpointingenv.enableCheckpointing(60000); // Every 60 secondsenv.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);// Flink guarantees:// - No duplicates in output// - No lost events// - Consistent state
Misconception 3: “Exactly-once means no duplicates in external systems”
Reality: Exactly-once is within Flink’s state. External sinks (databases, files) may still see duplicates on retries.Solution: Use idempotent sinks or two-phase commit sinks (Flink’s KafkaSink, JDBCSink with XA).
Q1: Explain event time vs processing time. When would you use each?Answer:
Event Time: Timestamp embedded in the event (when it happened). Use when events can arrive out-of-order or with delays (mobile, IoT, distributed logs).
Processing Time: Timestamp when Flink processes the event. Use when events arrive in order and low latency matters more than correctness.
Q2: What is a watermark? Why is it necessary?Answer:
A watermark is a monotonically increasing timestamp indicating “all events before time T have (probably) arrived.” Necessary because:
Infinite streams have no natural “end”
Need to know when to close windows and emit results
Allows handling late data gracefully
Q3: How does Flink achieve exactly-once semantics?Answer:
Flink uses the Chandy-Lamport distributed snapshot algorithm:
Periodically injects barriers into the stream
Operators snapshot state when barrier arrives
On failure, restores from last successful checkpoint
Replays events from checkpoint point
Q4: Why is Flink better than Spark Streaming for low-latency use cases?Answer:
Q: How would you handle a data stream where 10% of events arrive > 1 hour late?Answer:
Copy
OutputTag<Event> veryLateTag = new OutputTag<Event>("very-late"){};SingleOutputStreamOperator<Result> mainStream = events .keyBy(...) .window(...) .allowedLateness(Time.minutes(10)) // Reasonable lateness .sideOutputLateData(veryLateTag) // Capture very late events .aggregate(...);// Process very late data separately (e.g., batch reprocessing)DataStream<Event> veryLateData = mainStream.getSideOutput(veryLateTag);veryLateData.addSink(new LateDataSink());
✅ Dataflow Model Foundations: The four questions (What, Where, When, How)
✅ Event Time Processing: Why it matters, how it works
✅ Watermarks: The core innovation for handling infinite streams
✅ Flink Architecture: JobManager, TaskManager, task slots, parallelism
✅ Exactly-Once Semantics: Chandy-Lamport snapshots
✅ Hands-On Examples: Word count, sensor monitoring, late data handling
Practice Time: Spend 4-6 hours implementing the examples in this module with your own data sources (Kafka, files, sockets) to truly internalize these concepts.