Real-Time Game Telemetry Processing with RisingWave

Real-Time Game Telemetry Processing with RisingWave

Real-time game telemetry processing with RisingWave means ingesting millions of client-reported performance events — frame rates, load times, crash signals, network latency — and continuously aggregating them into materialized views that surface regressions, device-specific issues, and geographic bottlenecks within seconds of their occurrence.

Telemetry Is Only Valuable When It's Fresh

Game telemetry answers the most important operational question in live-service gaming: is the game actually working well for players right now? Frame rate drops, loading screen freezes, network desync events, and memory crashes are invisible from the server side unless clients are sending telemetry and that telemetry is being processed continuously.

The industry standard has been to batch-process telemetry overnight in a data warehouse. This catches long-term trends but misses the acute incidents that matter most: a patch that tanks GPU performance on a popular device, a server region routing change that degrades latency for 20% of players, or a memory leak that starts crashing clients 45 minutes into a session.

With RisingWave, telemetry flows from Kafka into materialized views that are always current. Your on-call engineer's dashboard shows the real frame rate distribution as of 10 seconds ago — not yesterday.

Setting Up the Telemetry Source

Client telemetry should be batched on-device and sent to a Kafka producer every 30 seconds. Define the source in RisingWave:

CREATE SOURCE game_telemetry (
    client_id       VARCHAR,
    player_id       BIGINT,
    session_id      VARCHAR,
    platform        VARCHAR,
    device_model    VARCHAR,
    os_version      VARCHAR,
    game_version    VARCHAR,
    region          VARCHAR,
    metric_type     VARCHAR,
    metric_value    FLOAT,
    level_id        VARCHAR,
    recorded_at     TIMESTAMPTZ
)
WITH (
    connector     = 'kafka',
    topic         = 'game.telemetry.client',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

Live Performance Percentiles

Raw averages hide the tail latency that ruins player experience. Use window aggregations to compute percentile distributions:

CREATE MATERIALIZED VIEW telemetry_performance AS
SELECT
    window_start,
    window_end,
    game_version,
    platform,
    region,
    metric_type,
    COUNT(*)                                                AS sample_count,
    AVG(metric_value)                                       AS avg_value,
    MIN(metric_value)                                       AS min_value,
    MAX(metric_value)                                       AS max_value,
    PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY metric_value) AS p50,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY metric_value) AS p90,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY metric_value) AS p99
FROM TUMBLE(game_telemetry, recorded_at, INTERVAL '1 minute')
GROUP BY window_start, window_end, game_version, platform, region, metric_type;

For frame-rate telemetry (metric_type = 'fps'), a p99 below 30 on a high-end device after a patch release is a regression signal. For load times (metric_type = 'load_ms'), a rising p90 indicates a specific level or asset pack is causing problems.

Detecting Crash Clusters

Crash events in telemetry often cluster around specific game versions, levels, or device models. Build a materialized view that surfaces crash hotspots:

CREATE MATERIALIZED VIEW crash_cluster_detection AS
SELECT
    window_start,
    window_end,
    game_version,
    platform,
    device_model,
    level_id,
    COUNT(*) FILTER (WHERE metric_type = 'crash')          AS crash_count,
    COUNT(DISTINCT player_id) FILTER (WHERE metric_type = 'crash') AS affected_players,
    COUNT(DISTINCT session_id)                              AS total_sessions,
    ROUND(
        COUNT(*) FILTER (WHERE metric_type = 'crash')::DECIMAL /
        NULLIF(COUNT(DISTINCT session_id), 0) * 100, 2
    )                                                      AS crash_rate_pct
FROM HOP(game_telemetry, recorded_at, INTERVAL '5 minutes', INTERVAL '30 minutes')
GROUP BY window_start, window_end, game_version, platform, device_model, level_id
HAVING COUNT(*) FILTER (WHERE metric_type = 'crash') >= 5;

When crash_rate_pct exceeds 1% for a specific (game_version, device_model, level_id) combination, your alerting pipeline knows within 30 minutes of the first crashes appearing — not the next morning.

Sinking Telemetry Aggregates to the Data Warehouse

Real-time monitoring handles incident response. Long-term trend analysis requires archival. Sink telemetry aggregates to both a live dashboard and a data warehouse:

CREATE SINK telemetry_to_iceberg
FROM telemetry_performance
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'storage',
    warehouse.path = 's3://game-data-lake/telemetry',
    database.name = 'game_telemetry',
    table.name = 'performance_minutely'
);

The Iceberg sink writes aggregated telemetry to object storage in a format queryable by Spark, Trino, or Athena for historical trend analysis, while RisingWave continues serving the live dashboard.

Comparison: Telemetry Processing Architectures

ApproachFreshnessPercentile SupportCrash Detection SpeedStorage Cost
Nightly warehouse ETL24 hoursYes (offline)Next dayLow
APM tools (Datadog, Instana)1 minuteApproximateMinutesHigh
Custom Flink pipelineSecondsCustomSecondsMedium
RisingWave streaming SQLSub-secondExactSub-minuteLow

Version Regression Detection

Compare current game version telemetry against the previous version to automatically detect regressions:

CREATE MATERIALIZED VIEW version_regression_check AS
SELECT
    curr.window_start,
    curr.game_version                                       AS current_version,
    curr.platform,
    curr.metric_type,
    curr.p99                                                AS current_p99,
    prev.p99                                                AS prev_p99,
    ROUND((curr.p99 - prev.p99) / NULLIF(prev.p99, 0) * 100, 2) AS pct_change
FROM telemetry_performance curr
JOIN telemetry_performance prev
    ON curr.platform    = prev.platform
    AND curr.metric_type = prev.metric_type
    AND curr.game_version != prev.game_version
WHERE curr.sample_count > 1000
  AND prev.sample_count > 1000;

A pct_change greater than 10% on frame rate or load time metrics triggers an automated alert to the engineering team.

FAQ

Q: How do I handle telemetry from players with poor network connections who send delayed events? A: RisingWave supports configurable watermarks per source. Set the watermark delay to accommodate late arrivals (e.g., 2 minutes) and the window functions will wait for late data before finalizing window results.

Q: Can I correlate client telemetry with server-side performance metrics? A: Yes. Create a second source ingesting server metrics from a separate Kafka topic and join the two sources in a materialized view on session_id or region to correlate client-perceived performance with server load.

Q: What is the recommended Kafka topic partitioning strategy for high-volume telemetry? A: Partition by player_id or client_id to ensure ordered processing per player. Use at least as many partitions as RisingWave compute nodes to maximize parallelism.

Q: How do I prevent outlier devices (rooted phones, jailbroken consoles) from skewing percentiles? A: Add a filtering step in the materialized view. Exclude values outside a plausible range for the metric type: WHERE metric_value BETWEEN 1 AND 300 for FPS, WHERE metric_value BETWEEN 100 AND 60000 for load time milliseconds.

Q: Is RisingWave suitable for telemetry from mobile games with 50 million daily active users? A: Yes. At 50M DAU generating one telemetry event per 30 seconds, that is approximately 1.7 million events per minute — well within RisingWave's processing capacity on a standard cluster.

Know What's Happening in Every Player's Game

Telemetry without real-time processing is just expensive storage. With RisingWave, every frame rate reading, load time measurement, and crash report becomes actionable intelligence the moment it arrives.

Begin at https://docs.risingwave.com/get-started and discuss telemetry patterns with other engineers at https://risingwave.com/slack.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.