Apache Iceberg for Event-Driven Architectures

Apache Iceberg for Event-Driven Architectures

Apache Iceberg serves as the ideal durable event store for event-driven architectures. Unlike Kafka—which has limited retention and no SQL query interface—Iceberg on S3 retains events indefinitely, supports time travel to replay any historical period, and exposes full SQL analytics. RisingWave bridges the two: consuming Kafka events in real time and materializing them into queryable Iceberg tables with exactly-once semantics.

The Gap in Traditional Event-Driven Architectures

Event-driven systems excel at decoupling producers from consumers. Kafka handles the real-time bus. But Kafka has two fundamental limitations for analytics:

  1. Bounded retention: Kafka topics typically retain data for 7 days. Querying events from 6 months ago requires a separate archival layer.
  2. No SQL analytics: Kafka Streams and KSQL handle simple transformations but cannot run complex analytical queries across the full event history.

Adding Iceberg as an event store solves both problems. Events flow through Kafka into Iceberg with full history. Any SQL engine—Trino, Spark, DuckDB, or RisingWave itself—can query the entire event timeline.

Architecture: Event-Driven + Lakehouse

The pattern combines three systems:

  • Kafka: Real-time event bus, short-term retention (hours to days)
  • RisingWave: Stream processor, computes derived state from event streams
  • Iceberg on S3: Durable event archive, long-term retention, full SQL queryable

RisingWave simultaneously serves real-time consumers (via materialized views) and archives raw events + derived state to Iceberg for historical analytics.

Step 1: Connect to the Event Stream

CREATE SOURCE domain_events (
    event_id       VARCHAR,
    aggregate_id   VARCHAR,
    aggregate_type VARCHAR,
    event_type     VARCHAR,
    payload        VARCHAR,
    metadata       VARCHAR,
    occurred_at    TIMESTAMPTZ,
    sequence_num   BIGINT
)
WITH (
    connector = 'kafka',
    topic = 'domain.events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

Step 2: Compute Aggregate State as a Materialized View

In event sourcing, the current state of an aggregate is derived by replaying its events. RisingWave maintains this derived state continuously:

CREATE MATERIALIZED VIEW order_aggregate_state AS
SELECT
    aggregate_id                                AS order_id,
    MAX(sequence_num)                           AS latest_sequence,
    MAX(occurred_at)                            AS last_updated,
    COUNT(*)                                    AS total_events,
    SUM(CASE WHEN event_type = 'OrderPlaced'
             THEN 1 ELSE 0 END)                 AS placed_count,
    SUM(CASE WHEN event_type = 'OrderShipped'
             THEN 1 ELSE 0 END)                 AS shipped_count,
    SUM(CASE WHEN event_type = 'OrderCancelled'
             THEN 1 ELSE 0 END)                 AS cancelled_count,
    MAX(CASE WHEN event_type = 'OrderPlaced'
             THEN payload END)                  AS placed_payload
FROM domain_events
WHERE aggregate_type = 'Order'
GROUP BY aggregate_id;

Step 3: Archive Raw Events to Iceberg

CREATE SINK raw_events_archive AS
SELECT
    event_id,
    aggregate_id,
    aggregate_type,
    event_type,
    payload,
    metadata,
    occurred_at,
    sequence_num
FROM domain_events
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog:8181',
    warehouse.path  = 's3://event-store/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'events',
    table.name      = 'domain_events'
);

Step 4: Sink Derived State to Iceberg

CREATE SINK order_state_sink AS
SELECT * FROM order_aggregate_state
WITH (
    connector       = 'iceberg',
    type            = 'upsert',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog:8181',
    warehouse.path  = 's3://event-store/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'events',
    table.name      = 'order_state'
);

Comparing Event Store Approaches

ApproachRetentionQuery InterfaceReplay SupportCost at Scale
Kafka onlyDaysLimited (KSQL)Yes (within retention)Medium
Kafka + ElasticsearchDays + indefiniteFull-text + limited SQLPartialHigh
Kafka + PostgreSQLIndefiniteFull SQLFullHigh (compute)
Event sourcing DB (EventStoreDB)IndefiniteProprietaryFullMedium
Kafka + Iceberg + RisingWaveIndefiniteFull SQLFullLow

Event Replay and Backfilling

One of the most powerful capabilities of Iceberg as an event store is supporting event replay. When you add a new downstream consumer that needs historical events, replay directly from Iceberg:

-- Query events for a specific aggregate to replay
SELECT
    aggregate_id,
    event_type,
    payload,
    occurred_at,
    sequence_num
FROM iceberg_scan(
    's3://event-store/warehouse',
    'events',
    'domain_events'
)
WHERE aggregate_id = 'order-12345'
ORDER BY sequence_num ASC;

For full-history projections, RisingWave v2.8 can query Iceberg directly, enabling new materialized views that backfill from the complete event archive.

Handling Event Ordering Guarantees

Event-driven systems require strict event ordering per aggregate. Iceberg's append-only nature and RisingWave's sequence_num column preserve the original ordering. The MAX(sequence_num) aggregation in the materialized view ensures that out-of-order deliveries (common in distributed Kafka consumers) do not corrupt aggregate state.

For strict ordering requirements, use HOP windows to detect sequence gaps:

CREATE MATERIALIZED VIEW event_gap_detector AS
SELECT
    aggregate_id,
    window_start,
    MAX(sequence_num) - MIN(sequence_num) + 1 AS expected_count,
    COUNT(*)                                   AS actual_count,
    MAX(sequence_num) - MIN(sequence_num) + 1
        != COUNT(*)                            AS has_gap
FROM HOP(domain_events, occurred_at, INTERVAL '1 MINUTE', INTERVAL '5 MINUTES')
GROUP BY aggregate_id, window_start
HAVING MAX(sequence_num) - MIN(sequence_num) + 1 != COUNT(*);

FAQ

Q: Can Iceberg replace Kafka for event streaming? A: No. Iceberg is a storage format optimized for batch reads and streaming writes, not for low-latency pub/sub messaging. Keep Kafka for real-time event distribution and use Iceberg for durable archival and analytics.

Q: How do I implement the outbox pattern with RisingWave and Iceberg? A: Use a postgres-cdc source to capture outbox table inserts, materialize them in RisingWave, and sink to both a Kafka topic (for downstream consumers) and an Iceberg table (for auditing and replay).

Q: Does Iceberg support event sourcing projections? A: Iceberg stores data; projections are computed by query engines. Use RisingWave materialized views as real-time projections and query Iceberg directly for point-in-time historical projections.

Q: How many events per second can RisingWave write to Iceberg? A: In benchmarks, RisingWave sustains 1–10 million events/second into Iceberg sinks depending on payload size and hardware. The bottleneck is typically S3 throughput, not RisingWave compute.

Q: How do I correlate events across multiple aggregate types? A: Use RisingWave's temporal join (FOR SYSTEM_TIME AS OF) to join events from different aggregate types at a consistent point in time, preserving causal ordering across domain boundaries.

Get Started

Build your event-driven lakehouse with the RisingWave quickstart guide. Discuss event-driven architecture patterns with the team on Slack.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.