Gaming Data Lakehouse with RisingWave and Apache Iceberg

Gaming Data Lakehouse with RisingWave and Apache Iceberg

A gaming data lakehouse with RisingWave and Apache Iceberg gives you the best of two worlds: RisingWave processes live event streams with sub-second latency, sinking continuously to Iceberg tables on object storage — so your ML engineers, data scientists, and BI analysts query months of match history at the same time your operations team reads real-time dashboards. One unified architecture, no Lambda-layer complexity.

Why Gaming Needs a Lakehouse Architecture

Game studios accumulate enormous volumes of historical event data: years of match telemetry, billions of player interactions, millions of daily transactions. This data is invaluable for ML model training (LTV prediction, matchmaking ranking, churn models) and long-horizon design retrospectives.

But studios also need real-time metrics: active concurrent players, today's revenue, current server health. These two workloads — historical batch analytics and real-time streaming — have historically required completely separate stacks. The Lambda architecture (parallel batch and streaming paths merged at query time) addressed this but at enormous operational cost.

The modern lakehouse pattern solves this more elegantly: a streaming layer writes to an open table format (Iceberg) that both streaming and batch engines can query. RisingWave is the streaming layer; Iceberg is the storage layer.

Building the Streaming Ingest Layer

Start by ingesting the primary game event streams:

CREATE SOURCE match_events (
    match_id        VARCHAR,
    player_id       BIGINT,
    event_type      VARCHAR,
    hero_id         VARCHAR,
    map_id          VARCHAR,
    value           NUMERIC,
    region          VARCHAR,
    event_time      TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'match-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

CREATE SOURCE revenue_events (
    player_id       BIGINT,
    product_id      VARCHAR,
    revenue_usd     NUMERIC,
    event_time      TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'revenue-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Real-Time Aggregation Before Lakehouse Write

Rather than writing raw events to Iceberg (which creates enormous small-file problems), pre-aggregate in RisingWave first:

CREATE MATERIALIZED VIEW hourly_match_stats AS
SELECT
    region,
    map_id,
    hero_id,
    window_start               AS hour_start,
    window_end                 AS hour_end,
    COUNT(DISTINCT match_id)   AS matches_played,
    COUNT(DISTINCT player_id)  AS unique_players,
    COUNT(*) FILTER (WHERE event_type = 'kill') AS total_kills,
    AVG(value)                 AS avg_event_value,
    SUM(value)                 AS total_score
FROM TUMBLE(match_events, event_time, INTERVAL '1 hour')
GROUP BY region, map_id, hero_id, window_start, window_end;
CREATE MATERIALIZED VIEW daily_revenue_summary AS
SELECT
    window_start::DATE         AS revenue_date,
    product_id,
    COUNT(DISTINCT player_id)  AS paying_players,
    COUNT(*)                   AS transactions,
    SUM(revenue_usd)           AS total_revenue,
    AVG(revenue_usd)           AS avg_transaction_value,
    MAX(revenue_usd)           AS max_transaction
FROM TUMBLE(revenue_events, event_time, INTERVAL '1 day')
GROUP BY window_start, product_id;

Sinking to Apache Iceberg

Write the aggregated views directly to Iceberg tables in your object store:

CREATE SINK match_stats_to_iceberg
FROM hourly_match_stats
WITH (
    connector = 'iceberg',
    warehouse.path = 's3://game-lakehouse/warehouse/',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-rest-catalog:8181',
    catalog.credential = 'access-key:secret-key',
    database.name = 'game_analytics',
    table.name = 'hourly_match_stats'
) FORMAT PLAIN ENCODE AVRO (
    schema.registry = 'http://schema-registry:8081'
);
CREATE SINK revenue_to_iceberg
FROM daily_revenue_summary
WITH (
    connector = 'iceberg',
    warehouse.path = 's3://game-lakehouse/warehouse/',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-rest-catalog:8181',
    catalog.credential = 'access-key:secret-key',
    database.name = 'game_analytics',
    table.name = 'daily_revenue_summary'
) FORMAT PLAIN ENCODE AVRO (
    schema.registry = 'http://schema-registry:8081'
);

Every completed hourly window is written as a new Iceberg partition, immediately queryable by Spark, Trino, or AWS Athena.

Lakehouse Architecture Comparison

LayerLambda ArchitectureKappa ArchitectureRisingWave + Iceberg Lakehouse
Real-time analyticsStreaming layerStreaming layerRisingWave materialized views
Historical analyticsBatch layerReplay from streamIceberg tables via Spark/Trino
ConsistencyMerge complexitySingle sourceIceberg ACID transactions
Operational complexityVery highHighModerate (SQL-centric)
ML training dataBatch exportStream replayDirect Iceberg query
Time-travel queriesNoNoYes (Iceberg snapshots)
Open format (vendor lock-in)PartialPartialYes (Iceberg is open)

Querying the Lakehouse from Spark

Data scientists query Iceberg tables directly from Spark without touching RisingWave:

# PySpark example — reads Iceberg table written by RisingWave
spark.read \
    .format("iceberg") \
    .load("s3://game-lakehouse/warehouse/game_analytics.hourly_match_stats") \
    .filter("hero_id = 'ranger' AND hour_start >= '2026-01-01'") \
    .groupBy("map_id") \
    .agg({"avg_event_value": "avg", "unique_players": "sum"}) \
    .show()

Iceberg's time-travel feature lets ML engineers reproduce training datasets from any historical snapshot, which is critical for model reproducibility.

FAQ

Q: Does writing to Iceberg from RisingWave introduce latency in the real-time dashboards? A: No. The Iceberg sink is asynchronous. RisingWave materialized views serve real-time queries directly; the sink writes completed window results to Iceberg in the background. The two serving paths are independent.

Q: How do we handle schema evolution (adding new event fields) without breaking the Iceberg tables? A: Iceberg supports additive schema evolution (adding nullable columns) natively. Update your RisingWave source definition and materialized view, then add the new column to the Iceberg table. Existing Iceberg snapshots remain readable with the old schema.

Q: Can we do time-travel queries on game events for debugging match replay or fraud investigation? A: Yes. Iceberg maintains snapshot history. SELECT * FROM game_analytics.hourly_match_stats FOR SYSTEM_VERSION AS OF <snapshot_id> in Trino or Spark returns the table state at a previous snapshot — invaluable for match auditing or investigating reported cheating incidents.

Q: What catalog should we use for Iceberg in a cloud gaming environment? A: AWS Glue Catalog is the most common choice for AWS-hosted games. Nessie and Polaris (Apache Iceberg REST catalog) are open-source options. RisingWave's Iceberg sink supports REST catalog, Glue, and Hive Metastore configurations.

Q: How do we manage the RisingWave state store and Iceberg storage costs at scale? A: RisingWave stores only incremental aggregation state (not raw events). The main storage cost is in S3 for Iceberg. Iceberg's table maintenance (compaction, snapshot expiry) controls S3 costs over time. Run Iceberg's OPTIMIZE (Spark) or REWRITE_DATA_FILES procedure periodically.

Build a Future-Proof Gaming Data Platform

The gaming data lakehouse pattern with RisingWave and Iceberg eliminates the need to choose between real-time and historical analytics. You get both, on open formats, with SQL throughout.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.