Apache Iceberg for IoT Data: Real-Time Lakehouse Architecture

Apache Iceberg combined with RisingWave provides the ideal foundation for IoT data lakehouses. RisingWave ingests high-frequency sensor streams via Kafka, applies real-time transformations using materialized views, and sinks enriched data into Iceberg tables on S3—delivering both sub-second freshness and petabyte-scale analytics without a traditional data warehouse.

Why IoT Data Demands a Lakehouse Architecture

IoT deployments generate relentless, high-cardinality data. A single industrial plant can produce millions of sensor readings per minute. Traditional architectures force a painful choice: low-latency pipelines that can't scale, or data warehouses that can't keep up with ingestion volume.

The lakehouse pattern resolves this tension. Open table formats like Apache Iceberg sit on cheap object storage (S3, GCS), handle schema evolution gracefully, and expose ACID transactions. Add RisingWave as the stream processor and you get a pipeline that is fresh, cheap, and queryable.

Architecture Overview

A production IoT lakehouse with RisingWave and Iceberg has four layers:

Ingestion — MQTT brokers or edge gateways publish to Kafka topics
Stream Processing — RisingWave reads Kafka, joins reference data, aggregates in real time
Storage — Iceberg tables on S3 store raw and aggregated facts
Analytics — Query engines (Trino, Spark, or RisingWave itself in v2.8) read Iceberg directly

Setting Up the IoT Pipeline

Step 1: Ingest Raw Sensor Data from Kafka

CREATE SOURCE iot_sensor_events (
    device_id     VARCHAR,
    location_id   VARCHAR,
    metric        VARCHAR,
    value         DOUBLE PRECISION,
    unit          VARCHAR,
    event_time    TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'iot.sensor.events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

Step 2: Create a Reference Table for Device Metadata

CREATE TABLE device_registry (
    device_id    VARCHAR PRIMARY KEY,
    device_type  VARCHAR,
    plant_id     VARCHAR,
    threshold    DOUBLE PRECISION
);

Step 3: Build a Materialized View for Anomaly Detection

RisingWave's TUMBLE window function groups readings into fixed intervals. Joining against device_registry enriches each window with metadata and enables threshold-based alerting inline.

CREATE MATERIALIZED VIEW iot_window_metrics AS
SELECT
    w.device_id,
    d.device_type,
    d.plant_id,
    w.metric,
    w.window_start,
    w.window_end,
    AVG(w.value)                          AS avg_value,
    MAX(w.value)                          AS max_value,
    MIN(w.value)                          AS min_value,
    COUNT(*)                              AS reading_count,
    MAX(w.value) > MAX(d.threshold)       AS threshold_exceeded
FROM TUMBLE(iot_sensor_events, event_time, INTERVAL '1 MINUTE') AS w
JOIN device_registry AS d
  ON w.device_id = d.device_id
GROUP BY
    w.device_id, d.device_type, d.plant_id,
    w.metric, w.window_start, w.window_end;

Step 4: Sink Aggregated Metrics to Iceberg

CREATE SINK iot_metrics_iceberg AS
SELECT * FROM iot_window_metrics
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog:8181',
    warehouse.path  = 's3://my-iot-lakehouse/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'iot',
    table.name      = 'window_metrics'
);

Comparing IoT Storage Approaches

Approach	Latency	Cost at Scale	Schema Evolution	Query Flexibility
Traditional Data Warehouse	Minutes	High (compute always on)	Difficult	SQL only
Raw S3 / Parquet	Hours (batch)	Low	Manual	Requires catalog
Kafka + KSQL	Seconds	Medium	Limited	Kafka only
Iceberg + RisingWave	Seconds	Low (S3 pricing)	Built-in	Any engine
Time-Series DB	Sub-second	High at scale	None	Limited

Handling Late-Arriving IoT Data

IoT devices lose connectivity. Data arrives late. Iceberg's hidden partitioning and RisingWave's watermark mechanism handle this elegantly. Configure watermarks on your source to tolerate expected delays:

CREATE MATERIALIZED VIEW iot_late_data_view AS
SELECT
    device_id,
    DATE_TRUNC('hour', event_time) AS hour_bucket,
    metric,
    AVG(value)                      AS avg_value,
    COUNT(*)                        AS sample_count
FROM iot_sensor_events
GROUP BY device_id, DATE_TRUNC('hour', event_time), metric;

Iceberg handles the underlying partition rewrites when late data lands in earlier partitions, keeping downstream queries consistent.

Querying Iceberg Tables Directly in RisingWave v2.8

Starting in v2.8, RisingWave can query Iceberg tables directly—no Spark or Trino required for ad hoc analytics:

-- Query historical Iceberg data directly from RisingWave
SELECT
    plant_id,
    metric,
    AVG(avg_value) AS daily_avg,
    MAX(max_value) AS daily_peak
FROM iceberg_scan(
    's3://my-iot-lakehouse/warehouse',
    'iot',
    'window_metrics'
)
WHERE window_start >= NOW() - INTERVAL '7 days'
GROUP BY plant_id, metric
ORDER BY daily_peak DESC;

This enables joining live streaming data with historical Iceberg data in a single query.

Production Considerations

Partitioning strategy: Partition Iceberg tables by plant_id and DATE(window_start). This aligns with the most common query patterns (per-plant daily reports) and minimizes scan costs.

File compaction: Configure Iceberg's compaction service (or use Spark's rewriteDataFiles) to merge small files produced by streaming writes. RisingWave's commit intervals default to 60 seconds; tune commit_checkpoint_interval based on your latency/file-count tradeoff.

Schema evolution: Add new sensor metrics without pipeline restarts. Iceberg's column addition is a metadata-only operation—existing data reads correctly with null values for the new column.

FAQ

Q: How many Kafka partitions should I use for high-volume IoT? A: Match Kafka partitions to RisingWave parallelism. For 1M events/second, 32–64 partitions is typical. RisingWave scales compute nodes independently of storage.

Q: Does RisingWave support exactly-once delivery to Iceberg? A: Yes. RisingWave uses a two-phase commit protocol with its checkpoint mechanism, ensuring exactly-once semantics when writing to Iceberg sinks.

Q: Can I mix upsert and append-only sinks for different IoT tables? A: Yes. Use type = 'append-only' for raw event tables and type = 'upsert' for aggregated tables with primary keys (e.g., latest device state).

Q: How do I handle device schema changes (new sensor types)? A: Use Iceberg schema evolution to add columns. In RisingWave, ALTER MATERIALIZED VIEW or recreate the sink with updated column definitions.

Q: What is the minimum viable infrastructure for this architecture? A: A single Kafka broker, one RisingWave node, an Iceberg REST catalog (e.g., Polaris or Nessie), and an S3-compatible bucket. You can run this on a single cloud VM for development.

Get Started

Ready to build your IoT lakehouse? Follow the RisingWave quickstart guide to spin up a cluster in minutes. Join the community on Slack to share your architecture and get help from the RisingWave team.