Apache Iceberg combined with RisingWave provides the ideal foundation for IoT data lakehouses. RisingWave ingests high-frequency sensor streams via Kafka, applies real-time transformations using materialized views, and sinks enriched data into Iceberg tables on S3—delivering both sub-second freshness and petabyte-scale analytics without a traditional data warehouse.
Why IoT Data Demands a Lakehouse Architecture
IoT deployments generate relentless, high-cardinality data. A single industrial plant can produce millions of sensor readings per minute. Traditional architectures force a painful choice: low-latency pipelines that can't scale, or data warehouses that can't keep up with ingestion volume.
The lakehouse pattern resolves this tension. Open table formats like Apache Iceberg sit on cheap object storage (S3, GCS), handle schema evolution gracefully, and expose ACID transactions. Add RisingWave as the stream processor and you get a pipeline that is fresh, cheap, and queryable.
Architecture Overview
A production IoT lakehouse with RisingWave and Iceberg has four layers:
- Ingestion — MQTT brokers or edge gateways publish to Kafka topics
- Stream Processing — RisingWave reads Kafka, joins reference data, aggregates in real time
- Storage — Iceberg tables on S3 store raw and aggregated facts
- Analytics — Query engines (Trino, Spark, or RisingWave itself in v2.8) read Iceberg directly
Setting Up the IoT Pipeline
Step 1: Ingest Raw Sensor Data from Kafka
CREATE SOURCE iot_sensor_events (
device_id VARCHAR,
location_id VARCHAR,
metric VARCHAR,
value DOUBLE PRECISION,
unit VARCHAR,
event_time TIMESTAMPTZ
)
WITH (
connector = 'kafka',
topic = 'iot.sensor.events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;
Step 2: Create a Reference Table for Device Metadata
CREATE TABLE device_registry (
device_id VARCHAR PRIMARY KEY,
device_type VARCHAR,
plant_id VARCHAR,
threshold DOUBLE PRECISION
);
Step 3: Build a Materialized View for Anomaly Detection
RisingWave's TUMBLE window function groups readings into fixed intervals. Joining against device_registry enriches each window with metadata and enables threshold-based alerting inline.
CREATE MATERIALIZED VIEW iot_window_metrics AS
SELECT
w.device_id,
d.device_type,
d.plant_id,
w.metric,
w.window_start,
w.window_end,
AVG(w.value) AS avg_value,
MAX(w.value) AS max_value,
MIN(w.value) AS min_value,
COUNT(*) AS reading_count,
MAX(w.value) > MAX(d.threshold) AS threshold_exceeded
FROM TUMBLE(iot_sensor_events, event_time, INTERVAL '1 MINUTE') AS w
JOIN device_registry AS d
ON w.device_id = d.device_id
GROUP BY
w.device_id, d.device_type, d.plant_id,
w.metric, w.window_start, w.window_end;
Step 4: Sink Aggregated Metrics to Iceberg
CREATE SINK iot_metrics_iceberg AS
SELECT * FROM iot_window_metrics
WITH (
connector = 'iceberg',
type = 'append-only',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-catalog:8181',
warehouse.path = 's3://my-iot-lakehouse/warehouse',
s3.region = 'us-east-1',
database.name = 'iot',
table.name = 'window_metrics'
);
Comparing IoT Storage Approaches
| Approach | Latency | Cost at Scale | Schema Evolution | Query Flexibility |
| Traditional Data Warehouse | Minutes | High (compute always on) | Difficult | SQL only |
| Raw S3 / Parquet | Hours (batch) | Low | Manual | Requires catalog |
| Kafka + KSQL | Seconds | Medium | Limited | Kafka only |
| Iceberg + RisingWave | Seconds | Low (S3 pricing) | Built-in | Any engine |
| Time-Series DB | Sub-second | High at scale | None | Limited |
Handling Late-Arriving IoT Data
IoT devices lose connectivity. Data arrives late. Iceberg's hidden partitioning and RisingWave's watermark mechanism handle this elegantly. Configure watermarks on your source to tolerate expected delays:
CREATE MATERIALIZED VIEW iot_late_data_view AS
SELECT
device_id,
DATE_TRUNC('hour', event_time) AS hour_bucket,
metric,
AVG(value) AS avg_value,
COUNT(*) AS sample_count
FROM iot_sensor_events
GROUP BY device_id, DATE_TRUNC('hour', event_time), metric;
Iceberg handles the underlying partition rewrites when late data lands in earlier partitions, keeping downstream queries consistent.
Querying Iceberg Tables Directly in RisingWave v2.8
Starting in v2.8, RisingWave can query Iceberg tables directly—no Spark or Trino required for ad hoc analytics:
-- Query historical Iceberg data directly from RisingWave
SELECT
plant_id,
metric,
AVG(avg_value) AS daily_avg,
MAX(max_value) AS daily_peak
FROM iceberg_scan(
's3://my-iot-lakehouse/warehouse',
'iot',
'window_metrics'
)
WHERE window_start >= NOW() - INTERVAL '7 days'
GROUP BY plant_id, metric
ORDER BY daily_peak DESC;
This enables joining live streaming data with historical Iceberg data in a single query.
Production Considerations
Partitioning strategy: Partition Iceberg tables by plant_id and DATE(window_start). This aligns with the most common query patterns (per-plant daily reports) and minimizes scan costs.
File compaction: Configure Iceberg's compaction service (or use Spark's rewriteDataFiles) to merge small files produced by streaming writes. RisingWave's commit intervals default to 60 seconds; tune commit_checkpoint_interval based on your latency/file-count tradeoff.
Schema evolution: Add new sensor metrics without pipeline restarts. Iceberg's column addition is a metadata-only operation—existing data reads correctly with null values for the new column.
FAQ
Q: How many Kafka partitions should I use for high-volume IoT? A: Match Kafka partitions to RisingWave parallelism. For 1M events/second, 32–64 partitions is typical. RisingWave scales compute nodes independently of storage.
Q: Does RisingWave support exactly-once delivery to Iceberg? A: Yes. RisingWave uses a two-phase commit protocol with its checkpoint mechanism, ensuring exactly-once semantics when writing to Iceberg sinks.
Q: Can I mix upsert and append-only sinks for different IoT tables?
A: Yes. Use type = 'append-only' for raw event tables and type = 'upsert' for aggregated tables with primary keys (e.g., latest device state).
Q: How do I handle device schema changes (new sensor types)?
A: Use Iceberg schema evolution to add columns. In RisingWave, ALTER MATERIALIZED VIEW or recreate the sink with updated column definitions.
Q: What is the minimum viable infrastructure for this architecture? A: A single Kafka broker, one RisingWave node, an Iceberg REST catalog (e.g., Polaris or Nessie), and an S3-compatible bucket. You can run this on a single cloud VM for development.
Get Started
Ready to build your IoT lakehouse? Follow the RisingWave quickstart guide to spin up a cluster in minutes. Join the community on Slack to share your architecture and get help from the RisingWave team.

