A gaming data lakehouse with RisingWave and Apache Iceberg gives you the best of two worlds: RisingWave processes live event streams with sub-second latency, sinking continuously to Iceberg tables on object storage — so your ML engineers, data scientists, and BI analysts query months of match history at the same time your operations team reads real-time dashboards. One unified architecture, no Lambda-layer complexity.
Why Gaming Needs a Lakehouse Architecture
Game studios accumulate enormous volumes of historical event data: years of match telemetry, billions of player interactions, millions of daily transactions. This data is invaluable for ML model training (LTV prediction, matchmaking ranking, churn models) and long-horizon design retrospectives.
But studios also need real-time metrics: active concurrent players, today's revenue, current server health. These two workloads — historical batch analytics and real-time streaming — have historically required completely separate stacks. The Lambda architecture (parallel batch and streaming paths merged at query time) addressed this but at enormous operational cost.
The modern lakehouse pattern solves this more elegantly: a streaming layer writes to an open table format (Iceberg) that both streaming and batch engines can query. RisingWave is the streaming layer; Iceberg is the storage layer.
Building the Streaming Ingest Layer
Start by ingesting the primary game event streams:
CREATE SOURCE match_events (
match_id VARCHAR,
player_id BIGINT,
event_type VARCHAR,
hero_id VARCHAR,
map_id VARCHAR,
value NUMERIC,
region VARCHAR,
event_time TIMESTAMPTZ
) WITH (
connector = 'kafka',
topic = 'match-events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
CREATE SOURCE revenue_events (
player_id BIGINT,
product_id VARCHAR,
revenue_usd NUMERIC,
event_time TIMESTAMPTZ
) WITH (
connector = 'kafka',
topic = 'revenue-events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
Real-Time Aggregation Before Lakehouse Write
Rather than writing raw events to Iceberg (which creates enormous small-file problems), pre-aggregate in RisingWave first:
CREATE MATERIALIZED VIEW hourly_match_stats AS
SELECT
region,
map_id,
hero_id,
window_start AS hour_start,
window_end AS hour_end,
COUNT(DISTINCT match_id) AS matches_played,
COUNT(DISTINCT player_id) AS unique_players,
COUNT(*) FILTER (WHERE event_type = 'kill') AS total_kills,
AVG(value) AS avg_event_value,
SUM(value) AS total_score
FROM TUMBLE(match_events, event_time, INTERVAL '1 hour')
GROUP BY region, map_id, hero_id, window_start, window_end;
CREATE MATERIALIZED VIEW daily_revenue_summary AS
SELECT
window_start::DATE AS revenue_date,
product_id,
COUNT(DISTINCT player_id) AS paying_players,
COUNT(*) AS transactions,
SUM(revenue_usd) AS total_revenue,
AVG(revenue_usd) AS avg_transaction_value,
MAX(revenue_usd) AS max_transaction
FROM TUMBLE(revenue_events, event_time, INTERVAL '1 day')
GROUP BY window_start, product_id;
Sinking to Apache Iceberg
Write the aggregated views directly to Iceberg tables in your object store:
CREATE SINK match_stats_to_iceberg
FROM hourly_match_stats
WITH (
connector = 'iceberg',
warehouse.path = 's3://game-lakehouse/warehouse/',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-rest-catalog:8181',
catalog.credential = 'access-key:secret-key',
database.name = 'game_analytics',
table.name = 'hourly_match_stats'
) FORMAT PLAIN ENCODE AVRO (
schema.registry = 'http://schema-registry:8081'
);
CREATE SINK revenue_to_iceberg
FROM daily_revenue_summary
WITH (
connector = 'iceberg',
warehouse.path = 's3://game-lakehouse/warehouse/',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-rest-catalog:8181',
catalog.credential = 'access-key:secret-key',
database.name = 'game_analytics',
table.name = 'daily_revenue_summary'
) FORMAT PLAIN ENCODE AVRO (
schema.registry = 'http://schema-registry:8081'
);
Every completed hourly window is written as a new Iceberg partition, immediately queryable by Spark, Trino, or AWS Athena.
Lakehouse Architecture Comparison
| Layer | Lambda Architecture | Kappa Architecture | RisingWave + Iceberg Lakehouse |
| Real-time analytics | Streaming layer | Streaming layer | RisingWave materialized views |
| Historical analytics | Batch layer | Replay from stream | Iceberg tables via Spark/Trino |
| Consistency | Merge complexity | Single source | Iceberg ACID transactions |
| Operational complexity | Very high | High | Moderate (SQL-centric) |
| ML training data | Batch export | Stream replay | Direct Iceberg query |
| Time-travel queries | No | No | Yes (Iceberg snapshots) |
| Open format (vendor lock-in) | Partial | Partial | Yes (Iceberg is open) |
Querying the Lakehouse from Spark
Data scientists query Iceberg tables directly from Spark without touching RisingWave:
# PySpark example — reads Iceberg table written by RisingWave
spark.read \
.format("iceberg") \
.load("s3://game-lakehouse/warehouse/game_analytics.hourly_match_stats") \
.filter("hero_id = 'ranger' AND hour_start >= '2026-01-01'") \
.groupBy("map_id") \
.agg({"avg_event_value": "avg", "unique_players": "sum"}) \
.show()
Iceberg's time-travel feature lets ML engineers reproduce training datasets from any historical snapshot, which is critical for model reproducibility.
FAQ
Q: Does writing to Iceberg from RisingWave introduce latency in the real-time dashboards? A: No. The Iceberg sink is asynchronous. RisingWave materialized views serve real-time queries directly; the sink writes completed window results to Iceberg in the background. The two serving paths are independent.
Q: How do we handle schema evolution (adding new event fields) without breaking the Iceberg tables? A: Iceberg supports additive schema evolution (adding nullable columns) natively. Update your RisingWave source definition and materialized view, then add the new column to the Iceberg table. Existing Iceberg snapshots remain readable with the old schema.
Q: Can we do time-travel queries on game events for debugging match replay or fraud investigation?
A: Yes. Iceberg maintains snapshot history. SELECT * FROM game_analytics.hourly_match_stats FOR SYSTEM_VERSION AS OF <snapshot_id> in Trino or Spark returns the table state at a previous snapshot — invaluable for match auditing or investigating reported cheating incidents.
Q: What catalog should we use for Iceberg in a cloud gaming environment? A: AWS Glue Catalog is the most common choice for AWS-hosted games. Nessie and Polaris (Apache Iceberg REST catalog) are open-source options. RisingWave's Iceberg sink supports REST catalog, Glue, and Hive Metastore configurations.
Q: How do we manage the RisingWave state store and Iceberg storage costs at scale?
A: RisingWave stores only incremental aggregation state (not raw events). The main storage cost is in S3 for Iceberg. Iceberg's table maintenance (compaction, snapshot expiry) controls S3 costs over time. Run Iceberg's OPTIMIZE (Spark) or REWRITE_DATA_FILES procedure periodically.
Build a Future-Proof Gaming Data Platform
The gaming data lakehouse pattern with RisingWave and Iceberg eliminates the need to choose between real-time and historical analytics. You get both, on open formats, with SQL throughout.
- Start building your gaming lakehouse with RisingWave: https://docs.risingwave.com/get-started
- Discuss Iceberg integration and gaming data architecture: https://risingwave.com/slack

