Apache Kafka

What is Apache Kafka?

Apache Kafka is a distributed streaming platform that enables applications to publish, subscribe to, store, and process streams of events in real time. Kafka organizes events into topics (categories), which are partitioned and distributed across a cluster of brokers for scalability and fault tolerance. Producers write events to topics, and consumers read from topics at their own pace—Kafka retains events for configurable periods (hours to years) regardless of consumption. This publish-subscribe model combined with durable storage makes Kafka fundamentally different from traditional message queues—it's a distributed commit log optimized for sequential writes and reads.

Kafka achieves exceptional throughput (millions of messages per second) through sequential disk I/O, zero-copy transfers, and batching. Each topic partition is replicated across multiple brokers for fault tolerance—if a broker fails, consumers seamlessly switch to replicas. Kafka Connect provides pre-built connectors for integrating with databases, cloud storage, and other systems. Kafka Streams and ksqlDB enable stream processing directly within Kafka—transforming, aggregating, and joining event streams without external processing frameworks. This comprehensive ecosystem makes Kafka the de facto standard for building event-driven architectures, microservices communication, and real-time data platforms.

Core Features and Capabilities

Event Streaming Fundamentals

Topics and partitions - Organize events into categories and distribute for scalability
Durable storage - Retain events for hours, days, or indefinitely
Event replay - Consumers reprocess historical events from any offset
Producer acknowledgments - Configure durability vs latency tradeoffs
Consumer groups - Distribute partition consumption across multiple consumers
Exactly-once semantics - Transactional guarantees for critical workflows
Compacted topics - Retain only latest value per key for changelog semantics
Time-based indexing - Access events by timestamp for time-travel queries

Scalability and Fault Tolerance

Horizontal scaling - Add brokers to increase throughput and storage
Partition replication - Configurable replication factor for redundancy
Leader election - Automatic failover when brokers fail
Rack awareness - Distribute replicas across failure domains
Multi-datacenter replication - MirrorMaker for cross-cluster streaming
Tiered storage - Offload old data to S3/GCS while keeping recent data local
Elastic scaling - Dynamically add/remove brokers without downtime
High throughput - Millions of messages/second per broker

Stream Processing and Integration

Kafka Streams - Java library for stateful stream processing
ksqlDB - SQL interface for stream processing and materialized views
Kafka Connect - Connectors for databases, S3, Elasticsearch, HDFS
Schema Registry - Manage Avro/Protobuf/JSON schemas with compatibility checking
Exactly-once processing - Transactions for atomicity across streams
Windowing - Tumbling, hopping, sliding, session windows for aggregations
State stores - Local key-value stores for stateful transformations
Interactive queries - Query materialized views from stream processors

Apache Kafka for AI/ML Applications

Kafka is essential for AI/ML data pipelines and real-time systems:

Feature pipelines - Stream features from source systems to feature stores
Real-time inference - Stream prediction requests to ML models at scale
Training data ingestion - Collect labeled examples for continuous learning
Model monitoring - Stream predictions and actuals for drift detection
Event-driven retraining - Trigger model updates based on performance metrics
A/B testing infrastructure - Route traffic across model versions
Online learning - Update models with streaming data in real time
Data lake ingestion - Stream raw data to S3/GCS for batch processing
Metrics aggregation - Collect model performance metrics across services
Change data capture - Stream database changes for feature computation

Use Cases and Applications

Event sourcing - Store all state changes as immutable event log
Log aggregation - Centralized logging from distributed services
Metrics collection - Time-series metrics for monitoring and alerting
Stream processing - Real-time transformations, aggregations, enrichment
CDC (Change Data Capture) - Replicate database changes to downstream systems
Microservices communication - Asynchronous event-driven messaging
Activity tracking - User behavior, clickstreams, application events
IoT data ingestion - Telemetry from millions of devices
Real-time analytics - Dashboards updated with millisecond latency
Data integration - Connect heterogeneous systems with Kafka as backbone

Apache Kafka vs RabbitMQ and Other Solutions

Compared to RabbitMQ (traditional message broker), Kafka excels at high-throughput event streaming, long-term message retention, and stream processing. Kafka can handle millions of messages per second with durable storage, while RabbitMQ focuses on flexible routing and lower latency for transactional messaging. RabbitMQ provides richer routing (topic exchanges, headers), request/reply patterns, and message prioritization. Kafka is better for log aggregation, event sourcing, and analytics; RabbitMQ for task queues, RPC, and complex routing.

Compared to cloud-native services (AWS Kinesis, Google Pub/Sub, Azure Event Hubs), Kafka offers vendor neutrality, on-premises deployment, and richer ecosystem (Kafka Streams, ksqlDB, Connect). Managed Kafka services (Confluent Cloud, AWS MSK, Azure HDInsight) provide Kafka's power with cloud convenience. For applications requiring maximum throughput, event replay, and stream processing, Kafka is typically the best choice. For simpler use cases with cloud-native requirements, managed alternatives may suffice.

Getting Started with Apache Kafka

Install Kafka locally with Docker Compose (Kafka + ZooKeeper/KRaft) or download binaries. Start ZooKeeper: `bin/zookeeper-server-start.sh config/zookeeper.properties`, then Kafka: `bin/kafka-server-start.sh config/server.properties`. Create topic: `bin/kafka-topics.sh --create --topic test --bootstrap-server localhost:9092`. Produce messages: `bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092`. Consume: `bin/kafka-console-consumer.sh --topic test --from-beginning --bootstrap-server localhost:9092`. Use official clients (Java: kafka-clients, Python: kafka-python, Node.js: kafkajs).

For production, deploy multi-broker cluster (minimum 3 nodes), configure replication factor ≥3, set up monitoring with Prometheus/Grafana, implement proper topic retention policies, and secure with TLS + SASL authentication. Managed Kafka services (Confluent Cloud, AWS MSK, Aiven) handle infrastructure and operations. Use Schema Registry for schema management and validation. Start with Kafka documentation and tutorials for understanding partitioning, consumer groups, and performance tuning.

Integration with 21medien Services

21medien implements Apache Kafka for event-driven AI/ML architectures. We use Kafka for real-time feature pipelines, streaming inference requests to ML models, collecting training data continuously, and monitoring model performance at scale. Our team provides Kafka consulting, architecture design (topic design, partitioning strategy, retention policies), performance tuning (throughput optimization, latency reduction), and managed operations. We specialize in Kafka for building real-time ML systems, event-driven microservices, and scalable data platforms for AI applications. We help clients migrate from batch to streaming architectures, implement Kafka Connect pipelines, and build stream processing applications with Kafka Streams or ksqlDB.

Pricing and Access

Apache Kafka is open-source and free (Apache 2.0 license). Self-hosting costs are infrastructure only. Managed services pricing: Confluent Cloud charges per GB ingress (~$0.11/GB), egress (~$0.09/GB), and storage (~$0.10/GB-month), typical costs $100-2000+/month. AWS MSK ~$0.21/hour per broker (kafka.t3.small) to $9.36/hour (kafka.m5.24xlarge), plus storage $0.10/GB-month, typical $300-5000+/month for production clusters. Aiven for Kafka starts ~$120/month for small clusters, $500-3000+/month for production. Self-hosted on cloud VMs: $200-2000/month for small clusters (3-5 nodes), $2000-10,000+/month for high-throughput deployments. For AI/ML workloads with real-time feature streaming, budget $500-2000/month managed, $200-1000/month self-hosted for moderate scale.

What is Apache Kafka?

Core Features and Capabilities

Event Streaming Fundamentals

Scalability and Fault Tolerance

Stream Processing and Integration

Apache Kafka for AI/ML Applications

Use Cases and Applications

Apache Kafka vs RabbitMQ and Other Solutions

Getting Started with Apache Kafka

Integration with 21medien Services

Pricing and Access

Official Resources

Related Technologies

RabbitMQ

Elasticsearch

Docker

Cookie Settings

Necessary Cookies

External Services