Logo
Sign in

Apache Samza is a distributed stream processing framework that enables stateful applications to process real-time data from multiple sources with low latency and high throughput. It supports flexible deployment and integrates with systems like Kafka, HDFS, and cloud services.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

samza-arch4.png
layered-arch.png
samza_state.png
Product details

Apache Samza

Apache Samza is a distributed stream processing framework designed for building stateful applications that process real-time data from multiple sources. It is battle-tested at scale and supports flexible deployment options, including running on YARN, Kubernetes, or as an embedded library. Samza integrates seamlessly with systems like Apache Kafka, HDFS, AWS Kinesis, and Azure EventHubs.

Features

  • High-performance stream processing with low latency and high throughput.
  • Horizontal scalability with support for terabytes of state and thousands of cores.
  • Rich APIs including Streams DSL, Samza SQL, Apache Beam, and low-level task APIs.
  • Unified API for both batch and streaming data.
  • Pluggable architecture for integrating with various data sources and sinks.
  • Flexible deployment: standalone, embedded, or managed via cluster managers.
  • Fault-tolerant with host-affinity and incremental checkpointing.
  • Asynchronous processing for high-throughput remote I/O.

Capabilities

  • Real-time data processing from multiple sources with guaranteed at-least-once delivery.
  • Stateful stream processing using scalable, fault-tolerant local state stores.
  • Stream partitioning and parallel task execution for efficient scaling.
  • Event-time and processing-time semantics for accurate time-based operations.
  • Dynamic task migration and recovery using changelogs and host-affinity.
  • Embedded library mode for lightweight integration into existing applications.
  • Managed service mode for large-scale deployments using YARN or Kubernetes.

Benefits

  • Scalability and reliability proven in production by companies like LinkedIn, Uber, and Slack.
  • Flexibility to run in diverse environments from cloud to bare-metal.
  • Efficient resource usage with incremental state flushing and local storage.
  • Simplified development with declarative and imperative APIs.
  • Resilience to failures with fast recovery and minimal downtime.
  • Open-source and community-driven under the Apache Software Foundation.