Iceberg Hidden Partitioning Explained: Why It Matters for Streaming

Introduction

You have an events table partitioned by date. A new analyst joins the team and writes WHERE event_date = '2026-03-15' instead of WHERE event_date_partition = '2026-03'. The query scans every partition. Nobody notices until the cloud bill arrives.

This scenario plays out constantly in Hive-style partitioned tables, where users must know the exact partition column names and values to get efficient queries. It gets worse in streaming pipelines, where data arrives continuously and partition management becomes an ongoing operational burden.

Apache Iceberg solves this with hidden partitioning, a design that separates how data is physically organized from how users query it. Users write queries against the original columns. Iceberg handles partition pruning automatically behind the scenes. This article explains how hidden partitioning works, why it changes the game for streaming workloads, and how RisingWave leverages Iceberg partition specs when sinking streaming data. All SQL examples target RisingWave v2.3.

What Is Iceberg Hidden Partitioning?

Hidden partitioning is Iceberg's approach to organizing data files without exposing partition details to query writers. Instead of requiring separate partition columns in the table schema, Iceberg applies partition transforms to existing columns and tracks the mapping in metadata.

A partition transform is a function that derives a partition value from a source column. Iceberg supports several transforms:

Transform	Input	Output Example	Use Case
`year`	`2026-03-15 09:30:00`	`2026`	Low-cardinality time grouping
`month`	`2026-03-15 09:30:00`	`2026-03`	Monthly rollups
`day`	`2026-03-15 09:30:00`	`2026-03-15`	Daily partitions
`hour`	`2026-03-15 09:30:00`	`2026-03-15-09`	High-volume streaming
`bucket(N)`	`customer_id = 42`	`bucket 7 of 16`	Even distribution
`truncate(N)`	`product_sku = 'ABC123'`	`'ABC'` (width 3)	String prefix grouping

When you define a partition spec like PARTITIONED BY (day(event_time), bucket(16, customer_id)), Iceberg writes data files into directory structures based on these derived values. But the table schema still only contains event_time and customer_id. No synthetic partition columns clutter the schema.

How Partition Pruning Works Automatically

When a query engine reads an Iceberg table and encounters a filter like WHERE event_time > '2026-03-01', Iceberg's planning layer checks the partition spec, applies the day transform to the filter bounds, and eliminates data files in partitions outside the range. The query writer never references a partition column directly.

This automatic pruning relies on Iceberg's metadata layer. Each data file in a manifest records its partition tuple (the transformed values for that file's data). The query planner reads manifests, compares partition tuples against query predicates, and skips files that cannot contain matching rows.

How Does Hive-Style Partitioning Compare?

To appreciate hidden partitioning, consider the problems it replaces.

The Hive Partitioning Model

In Hive-style partitioning, partition values are encoded in directory paths:

/data/events/year=2026/month=03/day=15/
/data/events/year=2026/month=03/day=16/

The table schema includes explicit partition columns (year, month, day) that users must reference in queries. If a user forgets the partition column and filters on the raw timestamp instead, the engine performs a full table scan.

Problems with Hive-Style Partitioning

Silent correctness issues: Users must produce partition values in the exact format the table expects. Filtering by month = '3' instead of month = '03' returns zero rows without any error.

Schema pollution: Partition columns are added to the table schema even though they are derived from existing columns. A table with event_time, year, month, and day columns has three redundant columns that exist only for partitioning.

Rigid layouts: Changing the partition scheme (say, switching from daily to hourly partitions as volume grows) requires rewriting the entire table. There is no way to evolve the partition strategy without a full data migration.

Streaming complexity: In streaming pipelines, producers must compute partition values and include them in every record. If the partitioning logic changes, every producer needs updating.

Side-by-Side Comparison

Aspect	Hive-Style	Iceberg Hidden
Partition columns in schema	Yes (explicit)	No (derived from transforms)
User must know partition layout	Yes	No
Wrong filter causes full scan	Yes	No, auto-pruning
Partition evolution	Requires full rewrite	Metadata-only change
Producer complexity	Must compute partition values	Writes raw values

Why Does Hidden Partitioning Matter for Streaming?

Streaming pipelines have characteristics that amplify the problems of traditional partitioning and make hidden partitioning particularly valuable.

Continuous Data Arrival

Streaming data arrives without natural batch boundaries. In Hive-style systems, the streaming writer must compute partition values for every record and manage directory creation in real time. With Iceberg hidden partitioning, the writer sends raw column values, and the Iceberg library applies transforms during the write. This eliminates a category of bugs where partition computation logic diverges between the producer and the table definition.

Schema and Partition Evolution

Streaming pipelines run continuously for weeks or months. During that time, data volumes change. A table partitioned by day might need hour partitioning after traffic doubles. With Hive-style partitioning, this requires stopping the pipeline, rewriting all historical data, and updating all producers.

Iceberg supports partition evolution as a metadata-only operation. You can change the partition spec, and new data files use the updated spec while existing files retain their original partitioning. Both old and new files remain queryable through the same table. The streaming pipeline keeps running without interruption.

For example, you can evolve from daily to hourly partitioning:

-- Original spec: partitioned by day(event_time)
-- Evolve to hourly partitioning (metadata-only change):
ALTER TABLE events SET PARTITION SPEC (hour(event_time));

After this change, new data files are written with hourly partitions. Existing daily-partitioned files continue to work. Iceberg's query planning handles both partition specs transparently.

Consumer Simplicity

Downstream consumers of streaming data should not need to understand partition layouts. An analyst querying a real-time dashboard table should write:

SELECT region, COUNT(*) as order_count
FROM orders
WHERE order_time >= NOW() - INTERVAL '1 hour';

Not:

SELECT region, COUNT(*) as order_count
FROM orders
WHERE order_year = '2026'
  AND order_month = '03'
  AND order_day = '29'
  AND order_hour >= '14';

Hidden partitioning makes the first query efficient automatically. The second query pattern is fragile, error-prone, and creates tight coupling between application code and physical storage layout.

How Does RisingWave Leverage Iceberg Partition Specs?

RisingWave can sink streaming results directly into Apache Iceberg tables with partition-aware writes. When you create a sink in RisingWave targeting an Iceberg table, you specify the partition spec using the partition_by parameter.

Creating a Partitioned Iceberg Sink

Here is a complete example that streams order data from RisingWave into an Iceberg table with hidden partitioning:

-- Step 1: Create source table in RisingWave
CREATE TABLE order_events (
    order_id INT,
    customer_id INT,
    product_category VARCHAR,
    amount DECIMAL,
    region VARCHAR,
    order_time TIMESTAMP
);

-- Step 2: Create a materialized view for pre-aggregation
CREATE MATERIALIZED VIEW hourly_orders AS
SELECT
    region,
    product_category,
    window_start,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM TUMBLE(order_events, order_time, INTERVAL '1 hour')
GROUP BY region, product_category, window_start;

-- Step 3: Sink to Iceberg with hidden partitioning
CREATE SINK orders_iceberg_sink FROM hourly_orders
WITH (
    connector = 'iceberg',
    type = 'append-only',
    force_append_only = 'true',
    catalog.type = 'rest',
    catalog.name = 'my_catalog',
    catalog.uri = 'http://iceberg-rest:8181',
    warehouse.path = 's3a://my-warehouse/iceberg',
    database.name = 'analytics',
    table.name = 'hourly_orders',
    s3.endpoint = 'http://minio:9000',
    s3.access.key = 'admin',
    s3.secret.key = 'password',
    s3.region = 'us-east-1',
    create_table_if_not_exists = 'true',
    partition_by = 'day(window_start), truncate(3, region)'
);

In this example, the partition_by = 'day(window_start), truncate(3, region)' parameter tells RisingWave to apply Iceberg's day transform on the window_start column and truncate(3) on the region column. RisingWave handles the partitioning automatically during writes. Downstream query engines reading the Iceberg table get automatic partition pruning on both time-based and region-based queries.

Partition Transforms Available in RisingWave Sinks

RisingWave supports the same partition transforms as Iceberg when creating sinks:

year(column), month(column), day(column), hour(column) for time-based partitioning
bucket(N, column) for hash-based distribution
truncate(N, column) for string or numeric prefix grouping

You can combine multiple transforms in a single partition_by parameter, separated by commas. For detailed configuration options, see the RisingWave Iceberg sink documentation.

Upsert Mode with Partitioning

For use cases that require updates (such as CDC pipelines), RisingWave supports upsert sinks to Iceberg:

CREATE SINK customer_profiles_sink FROM customer_profiles_mv
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'customer_id',
    catalog.type = 'rest',
    catalog.name = 'my_catalog',
    catalog.uri = 'http://iceberg-rest:8181',
    warehouse.path = 's3a://my-warehouse/iceberg',
    database.name = 'analytics',
    table.name = 'customer_profiles',
    s3.endpoint = 'http://minio:9000',
    s3.access.key = 'admin',
    s3.secret.key = 'password',
    s3.region = 'us-east-1',
    create_table_if_not_exists = 'true',
    partition_by = 'bucket(32, customer_id)'
);

This sink uses bucket(32, customer_id) to distribute customer profiles evenly across 32 partitions, ensuring balanced write throughput for high-volume streaming updates.

How Does Partition Evolution Work in Practice?

Partition evolution is one of Iceberg's most powerful features for long-running streaming pipelines. It allows you to change how new data is partitioned without affecting existing data.

A Real-World Evolution Scenario

Consider an IoT sensor pipeline. You start with daily partitioning because initial data volumes are low:

Partition spec v0: day(reading_time)

Six months later, sensor count triples and daily partitions contain too many files. You evolve to hourly partitioning:

Partition spec v1: hour(reading_time)

After the evolution:

Existing data files from v0 retain their daily partition metadata
New data files written by RisingWave (or any writer) use hourly partitions
A single query spanning both periods works correctly: Iceberg's planner applies the appropriate partition spec for each file based on when it was written

What This Means for Streaming Pipelines

In a streaming context, partition evolution is particularly valuable because:

No pipeline downtime: The streaming sink (RisingWave, Flink, or Spark Streaming) keeps running. The next commit simply uses the updated partition spec.
No historical rewrite: Old data stays in place. You do not need to reprocess months of data to change the partitioning strategy.
Gradual rollout: You can monitor query performance on new partitions before committing to the change permanently.

For a deeper look at building streaming pipelines that sink to Iceberg, see the RisingWave lakehouse integration guide.

What Are Best Practices for Iceberg Partitioning in Streaming?

Choosing the right partition strategy for streaming workloads requires balancing write throughput, query performance, and operational simplicity.

Match Partition Granularity to Query Patterns

If most queries filter by hour, partition by hour(event_time). If queries typically filter by day, use day(event_time). Over-partitioning (hourly partitions when queries only filter by month) creates excessive metadata overhead and small files.

Use Bucket Partitioning for High-Cardinality Keys

For tables frequently queried by a high-cardinality column (like user_id or device_id), add a bucket(N, column) transform. Choose N based on your expected data volume: too few buckets create large files, too many create small files. A good starting point is 16-64 buckets.

Combine Time and Key Partitioning

For streaming tables, a combined partition spec often works best:

partition_by = 'day(event_time), bucket(32, device_id)'

This gives you time-based pruning for range queries and hash-based pruning for point lookups, covering the two most common streaming query patterns.

Plan for Evolution

Start with a coarser partition granularity (daily) and evolve to finer granularity (hourly) as data volume grows. Iceberg makes this a metadata-only change. Starting too fine and needing to coarsen is also possible but less common.

Monitor File Sizes

Target data files between 128 MB and 512 MB. In streaming pipelines, small files accumulate quickly due to frequent commits. Use Iceberg's compaction procedures to merge small files periodically. RisingWave's commit_checkpoint_interval parameter controls how often data is flushed to Iceberg, directly affecting file sizes.

FAQ

What is hidden partitioning in Apache Iceberg?

Hidden partitioning is Iceberg's method of organizing data files using partition transforms applied to existing columns, without adding explicit partition columns to the table schema. Users query the original columns directly, and Iceberg automatically applies partition pruning during query planning. This eliminates the need for users to know or reference the physical partition layout.

How does Iceberg hidden partitioning differ from Hive partitioning?

Hive partitioning requires explicit partition columns in the table schema and directory paths, and users must filter on those exact columns for efficient queries. Iceberg hidden partitioning derives partition values from source columns using transforms (year, month, day, hour, bucket, truncate), keeps them out of the user-facing schema, and applies pruning automatically regardless of how the user writes their WHERE clause.

Can you change Iceberg partition specs without rewriting data?

Yes. Iceberg supports partition evolution as a metadata-only operation. You can change the partition spec for a table, and new data files use the updated spec while existing files retain their original partitioning. Both old and new partition layouts remain queryable through the same table, and query planners handle the mixed specs transparently.

How does RisingWave handle Iceberg partitioning when sinking data?

RisingWave applies Iceberg partition transforms during sink writes using the partition_by parameter in the CREATE SINK statement. You specify transforms like day(column), bucket(N, column), or truncate(N, column), and RisingWave automatically organizes output data files according to the partition spec. This works with both append-only and upsert sink modes.

Conclusion

Iceberg hidden partitioning eliminates the operational burden and query fragility of Hive-style partitioning. For streaming pipelines, the benefits are even more pronounced:

No partition column management: Streaming writers send raw column values. Iceberg handles the rest.
Automatic query pruning: Consumers write natural filters on source columns. No need to know the partition layout.
Partition evolution without downtime: Change partitioning strategies while the streaming pipeline keeps running.
RisingWave integration: Sink streaming SQL results directly into partitioned Iceberg tables using familiar transform syntax.

The combination of RisingWave's streaming SQL and Iceberg's hidden partitioning gives you a lakehouse architecture where data is always fresh and always queryable, without the operational complexity of managing partition columns across producers and consumers.

Ready to try streaming to Iceberg with hidden partitioning? Get started with RisingWave in 5 minutes. Quickstart

Join our Slack community to ask questions and connect with other stream processing developers.