Skip to main content

Kafka Ecosystem

Kafka is more than just a message broker. It’s a complete event streaming platform.

1. Kafka Connect

Problem: Writing custom code to move data between Kafka and other systems (Databases, S3, Elasticsearch) is tedious and error-prone. Solution: Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.

Architecture

  • Source Connectors: Pull data from a system (e.g., PostgreSQL) into Kafka.
  • Sink Connectors: Push data from Kafka into a system (e.g., Elasticsearch).

Example: PostgreSQL Source Connector

Capture changes from a database table and stream them to Kafka (CDC - Change Data Capture).
{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname": "inventory",
    "topic.prefix": "db-server1"
  }
}

2. Schema Registry

Problem: How do you ensure producers and consumers agree on the data format? What if the schema changes? Solution: A centralized registry for managing schemas (Avro, Protobuf, JSON Schema).

How it works

  1. Producer checks Schema Registry before sending a message.
  2. Producer serializes data (e.g., to Avro) and includes the Schema ID in the message.
  3. Consumer reads the Schema ID, fetches the schema from the Registry, and deserializes the data.

Benefits

  • Data Governance: Enforce schema compatibility rules (Backward, Forward, Full).
  • Smaller Payloads: Schema is stored once in the registry, not in every message.

3. ksqlDB

Problem: Stream processing in Java/Scala is complex. Solution: ksqlDB allows you to build stream processing applications using SQL.
-- Create a stream from a Kafka topic
CREATE STREAM user_clicks (
    user_id INT,
    url VARCHAR,
    timestamp VARCHAR
) WITH (KAFKA_TOPIC='clicks', VALUE_FORMAT='JSON');

-- Real-time aggregation
CREATE TABLE clicks_per_user AS
    SELECT user_id, COUNT(*) AS click_count
    FROM user_clicks
    GROUP BY user_id;

4. Kafka vs RabbitMQ vs ActiveMQ

FeatureKafkaRabbitMQActiveMQ
ModelLog-based (Pull)Queue-based (Push)Queue-based (Push)
ThroughputExtremely HighHighModerate
PersistenceLong-term (Days/Years)Short-term (Until consumed)Short-term
Use CaseEvent Streaming, Logs, AnalyticsComplex Routing, Task QueuesEnterprise Integration

Key Takeaways

  • Use Kafka Connect to integrate with external systems without writing code.
  • Use Schema Registry to manage data contracts and evolution.
  • Use ksqlDB for simple, SQL-based stream processing.
  • Choose Kafka for high-throughput event streaming, and RabbitMQ for complex routing.

Next: Kafka Architecture β†’