Iceberg + Trino + RisingWave: End-to-End Data Platform Architecture

Combining RisingWave, Apache Iceberg, and Trino creates a complete open-source data platform: RisingWave continuously processes streams and writes fresh data to Iceberg, while Trino provides fast, interactive SQL queries over the same tables — with no data duplication and no proprietary lock-in.

Why This Trio Works

Each tool does one thing exceptionally well:

RisingWave handles streaming ingestion and continuous transformations. It speaks standard SQL and maintains materialized views that are always up to date.
Apache Iceberg provides the open, ACID-compliant table format that both RisingWave (for writes) and Trino (for reads) understand natively.
Trino delivers interactive, distributed SQL queries over Iceberg tables at petabyte scale — ideal for dashboards, ad-hoc exploration, and scheduled reports.

The integration point is Apache Iceberg: RisingWave writes to it, Trino reads from it, and the REST catalog coordinates metadata. No ETL copies, no intermediate staging tables, no format conversions.

Architecture Overview

[Kafka / PostgreSQL CDC]
         │
         ▼
   [RisingWave]
   • CREATE SOURCE
   • CREATE MATERIALIZED VIEW
   • CREATE SINK (iceberg)
         │
         ▼
[Iceberg REST Catalog] ←──── metadata
         │
         ▼
  [S3 / GCS / MinIO]  ←──── Parquet files
         │
         ▼
     [Trino]
   • Interactive queries
   • BI tool connectivity
   • Scheduled reports

Trino and RisingWave never communicate directly. They share data exclusively through Iceberg — a clean separation of concerns that lets you upgrade, scale, or replace either engine independently.

Component Comparison

Component	Role	Query Latency	Write Latency	Best For
RisingWave	Stream processor	Seconds	Sub-second	Continuous transforms, CDC
Trino	Query engine	Seconds–minutes	N/A (read-only)	Ad-hoc analytics, reports
Iceberg	Table format	N/A (storage)	N/A (storage)	Open format, time travel
Kafka	Message bus	Milliseconds	Milliseconds	Event streaming

Step 1: Build the Streaming Layer

Ingest events and apply continuous transformations in RisingWave:

-- Raw clickstream from Kafka
CREATE SOURCE clickstream (
    session_id   VARCHAR,
    user_id      BIGINT,
    page_path    VARCHAR,
    referrer     VARCHAR,
    device_type  VARCHAR,
    event_ts     TIMESTAMPTZ
)
WITH (
    connector        = 'kafka',
    topic            = 'web.clickstream',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

-- Session-level aggregation using HOP windows
CREATE MATERIALIZED VIEW session_summary AS
SELECT
    session_id,
    user_id,
    window_start                              AS session_start,
    window_end                                AS session_end,
    COUNT(*)                                  AS page_views,
    COUNT(DISTINCT page_path)                 AS unique_pages,
    MIN(page_path)                            AS entry_page,
    MAX(event_ts)                             AS last_event,
    MODE() WITHIN GROUP (ORDER BY device_type) AS primary_device
FROM HOP(clickstream, event_ts, INTERVAL '1 MINUTE', INTERVAL '30 MINUTES')
GROUP BY session_id, user_id, window_start, window_end;

Step 2: Sink to Iceberg

Write the session summaries to an Iceberg table:

CREATE SINK session_summary_sink AS
SELECT * FROM session_summary
WITH (
    connector      = 'iceberg',
    type           = 'upsert',
    primary_key    = 'session_id,session_start',
    catalog.type   = 'rest',
    catalog.uri    = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-platform/warehouse',
    s3.region      = 'us-east-1',
    database.name  = 'web_analytics',
    table.name     = 'session_summary'
);

Also sink raw events for long-term retention (append-only for immutability):

CREATE SINK clickstream_archive_sink AS
SELECT session_id, user_id, page_path, referrer, device_type, event_ts
FROM clickstream
WITH (
    connector      = 'iceberg',
    type           = 'append-only',
    catalog.type   = 'rest',
    catalog.uri    = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-platform/warehouse',
    s3.region      = 'us-east-1',
    database.name  = 'web_analytics',
    table.name     = 'clickstream_events'
);

Step 3: Query with Trino

Connect Trino to the same Iceberg REST catalog. In etc/catalog/iceberg.properties:

connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-catalog:8181
iceberg.file-format=PARQUET

Now Trino can query the same tables RisingWave is writing to:

-- In Trino: daily active users from the last 7 days
SELECT
    DATE(session_start) AS day,
    COUNT(DISTINCT user_id) AS dau,
    SUM(page_views) AS total_page_views,
    AVG(page_views) AS avg_pages_per_session
FROM iceberg.web_analytics.session_summary
WHERE session_start >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY DATE(session_start)
ORDER BY day DESC;

Trino sees data that RisingWave committed in the last checkpoint interval — typically 30–60 seconds ago. For dashboards, this is effectively real-time.

Trino + Iceberg Time Travel for Historical Analysis

Trino's Iceberg connector supports time travel queries:

-- Compare today's DAU to the same day last year
SELECT
    'current' AS period,
    COUNT(DISTINCT user_id) AS dau
FROM iceberg.web_analytics.session_summary
FOR TIMESTAMP AS OF CURRENT_TIMESTAMP
WHERE DATE(session_start) = CURRENT_DATE

UNION ALL

SELECT
    'prior_year' AS period,
    COUNT(DISTINCT user_id) AS dau
FROM iceberg.web_analytics.session_summary
FOR TIMESTAMP AS OF CURRENT_TIMESTAMP - INTERVAL '365' DAY
WHERE DATE(session_start) = CURRENT_DATE - INTERVAL '365' DAY;

This query is impossible in most streaming architectures where historical data has been overwritten. Iceberg's immutable snapshot chain makes year-over-year comparisons trivial.

Operational Considerations

Compaction: RisingWave writes many small Parquet files (one per checkpoint per partition). Trino performs best with fewer, larger files. Run Iceberg's optimize command in Trino periodically:

-- In Trino: compact small files
ALTER TABLE iceberg.web_analytics.session_summary
EXECUTE optimize(file_size_threshold => '128MB');

Snapshot expiration: Snapshots accumulate over time. Expire old snapshots to reclaim S3 storage:

-- In Trino: expire snapshots older than 30 days
ALTER TABLE iceberg.web_analytics.session_summary
EXECUTE expire_snapshots(retention_threshold => '30d');

Catalog HA: The Iceberg REST catalog is a critical shared component. Deploy it with at least 2 replicas behind a load balancer and back its metadata store with a managed database (RDS, Cloud SQL).

FAQ

Q: Can I use RisingWave to query Iceberg tables that Trino created? A: Yes. Starting in RisingWave v2.8, you can run lakehouse queries against Iceberg tables regardless of which engine created them, as long as the catalog is reachable.

Q: Does Trino see data in real time as RisingWave writes it? A: Trino sees committed Iceberg snapshots. New data appears in Trino after RisingWave commits a checkpoint (typically every 30–60 seconds). For sub-second latency, query RisingWave materialized views directly.

Q: What catalog should I use for a production deployment? A: For open-source setups, the Apache Iceberg REST catalog reference implementation or Project Nessie works well. For AWS, integrate with Glue Data Catalog via a REST adapter. For large multi-team setups, Tabular or Polaris offer enterprise features.

Q: Can I use Superset or Metabase with this stack? A: Yes. Both tools support Trino as a data source via their Trino/Presto connectors. Connect them to Trino, and they can query all Iceberg tables managed by RisingWave.

Q: How do I handle schema changes across both RisingWave and Trino? A: Iceberg's schema evolution is the coordination point. Evolve the schema in the catalog (via Trino's ALTER TABLE or the catalog API), then update your RisingWave materialized view and sink definition to match.

Get Started

Build your open data platform with RisingWave, Iceberg, and Trino: