How to Choose Between Kafka Connect, Flink, and RisingWave for Apache Iceberg

Introduction

You have decided to use Apache Iceberg as your table format. Now you need to get data into it. Not in hourly batches, but continuously, from Kafka topics, CDC streams, and event pipelines. The question is: which streaming tool should write the data?

Three options dominate this space. Kafka Connect offers simplicity with sink connectors. Apache Flink provides a full stream processing framework with native Iceberg support. RisingWave delivers SQL-based stream processing with a built-in Iceberg sink and automatic compaction. Each makes fundamentally different trade-offs between power, complexity, and operational cost.

This guide compares all three approaches across more than ten dimensions, with concrete examples and clear recommendations for when each tool is the right choice. If you are evaluating how to stream data into Iceberg, this is the comparison you need.

What Are the Three Approaches to Streaming Into Iceberg?

Before comparing them, let's clarify what each tool is and how it connects to Iceberg.

Kafka Connect + Iceberg sink connector

Kafka Connect is a framework for moving data between Apache Kafka and external systems. It runs connectors: source connectors pull data into Kafka, and sink connectors push data out. For Iceberg, you deploy an Iceberg sink connector (such as the one from Tabular/Apache Iceberg project) that reads records from Kafka topics and writes them to Iceberg tables.

Kafka Connect does not process data. It moves it. You can apply Simple Message Transforms (SMTs) for field-level operations like renaming columns or converting timestamps, but you cannot join streams, compute aggregations, or maintain stateful logic.

Apache Flink + Iceberg connector

Apache Flink is a distributed stream processing framework. It reads from sources (Kafka, files, databases), applies transformations defined in Java, Scala, or Flink SQL, and writes to sinks. Flink has native Iceberg support through the Flink Iceberg connector, which handles both reading from and writing to Iceberg tables.

Flink is a general-purpose processing engine. You can build arbitrary streaming pipelines with complex stateful operations, custom operators, and exactly-once guarantees. The trade-off is operational complexity: Flink clusters require JVM tuning, checkpoint configuration, state backend management, and dedicated platform expertise.

RisingWave + Iceberg sink

RisingWave is a streaming database that uses PostgreSQL-compatible SQL for defining streaming pipelines. You create sources, materialized views, and sinks entirely in SQL. RisingWave's built-in Iceberg sink writes processed data to Iceberg tables with exactly-once guarantees.

RisingWave stores streaming state on S3-compatible object storage instead of local SSDs, which changes the cost model. It also offers an Iceberg Table Engine that manages CDC ingestion, Iceberg writes, and file compaction in a single SQL statement.

How Do They Compare? A Detailed Breakdown

1. Setup and deployment

Kafka Connect: Deploy as part of your existing Kafka infrastructure. Add the Iceberg sink connector JAR, create a connector configuration in JSON, and submit it via the REST API. If you already run Kafka Connect for other connectors, adding an Iceberg sink is minimal additional work.

{
  "name": "iceberg-sink",
  "config": {
    "connector.class": "org.apache.iceberg.connect.IcebergSinkConnector",
    "tasks.max": "2",
    "topics": "orders",
    "iceberg.tables": "analytics.orders",
    "iceberg.catalog.type": "rest",
    "iceberg.catalog.uri": "https://polaris.example.com/api/catalog",
    "iceberg.catalog.warehouse": "s3://data-lake/warehouse"
  }
}

Flink: Deploy a Flink cluster (JobManager + TaskManagers), configure the state backend (RocksDB for production), set up checkpointing, and submit your Flink job. For Kubernetes deployments, you also need the Flink Kubernetes operator.

-- Flink SQL
CREATE TABLE iceberg_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL(10,2),
    order_time TIMESTAMP(3)
) WITH (
    'connector' = 'iceberg',
    'catalog-type' = 'hive',
    'catalog-name' = 'iceberg_catalog',
    'warehouse' = 's3://data-lake/warehouse',
    'database' = 'analytics',
    'table' = 'orders'
);

INSERT INTO iceberg_orders
SELECT * FROM kafka_orders;

RisingWave: Deploy a single binary or use RisingWave Cloud. Connect via any PostgreSQL client and define your pipeline in SQL. No JVM, no state backend configuration, no separate orchestrator.

-- RisingWave SQL
CREATE SOURCE kafka_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL,
    order_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

CREATE SINK orders_to_iceberg FROM kafka_orders
WITH (
    connector = 'iceberg',
    type = 'append-only',
    database.name = 'analytics',
    table.name = 'orders',
    catalog.type = 'glue',
    catalog.name = 'my_catalog',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

2. Transformation capabilities

This is where the three tools diverge most.

Capability	Kafka Connect	Apache Flink	RisingWave
Field renaming	Yes (SMT)	Yes	Yes
Type casting	Yes (SMT)	Yes	Yes
Filtering	Yes (SMT, limited)	Yes	Yes
Stateless mapping	Yes (SMT)	Yes	Yes
Stream-stream joins	No	Yes	Yes
Stream-table joins	No	Yes	Yes
Windowed aggregations	No	Yes	Yes
Deduplication	No	Yes	Yes
Custom UDFs	No	Yes (Java/Python)	Yes (Python/Java/Rust)
Pattern matching (CEP)	No	Yes	Limited

If your pipeline is "Kafka topic to Iceberg table, no changes needed," Kafka Connect is the simplest choice. If you need to join, aggregate, or enrich data before it lands in Iceberg, you need Flink or RisingWave.

3. Exactly-once delivery

All three support exactly-once semantics for Iceberg writes, but the mechanisms differ:

Kafka Connect: Depends on the specific connector implementation. The Apache Iceberg sink connector supports exactly-once using Iceberg's transactional commit protocol and Kafka Connect's offset tracking.

Flink: Uses distributed snapshots (Chandy-Lamport algorithm) coordinated by the JobManager. Checkpoints capture the state of all operators and the positions in source streams. On recovery, Flink restores from the last successful checkpoint and replays from the recorded positions.

RisingWave: Uses barrier-based checkpointing similar to Flink, but stores checkpoint data on S3 instead of local disk. Iceberg commits are coordinated with checkpoints, ensuring exactly-once delivery by default (is_exactly_once = true).

4. State management and cost

State management is a critical differentiator for workloads that involve joins or aggregations.

Kafka Connect: No state management needed. Connectors track offsets but do not maintain application state. This keeps the operational model simple but limits what you can do.

Flink: Uses RocksDB as its state backend for production deployments. State is stored on local SSDs attached to TaskManager nodes. This provides fast state access but means you must provision SSD storage proportional to your state size. For large-state workloads (wide joins, long windows), SSD costs can become the dominant infrastructure expense.

RisingWave: Stores streaming state on S3-compatible object storage. This decouples state size from compute node storage. You do not need to provision local SSDs, and state can grow without resizing compute instances. The trade-off is slightly higher per-access latency compared to local SSD, but RisingWave uses in-memory caching to mitigate this. For most SQL workloads, the cost savings of S3 over provisioned SSDs delivers up to 10x cost efficiency compared to Flink.

5. Operational complexity

Operational aspect	Kafka Connect	Apache Flink	RisingWave
Cluster components	Workers (1 type)	JobManager + TaskManagers	Single binary or managed cloud
Configuration surface	Connector JSON config	JVM settings, checkpointing, state backends, parallelism	SQL DDL parameters
Scaling	Add workers	Adjust parallelism (may require restart)	Auto-scaling (cloud) or add nodes
Monitoring	Kafka Connect REST API, JMX	Flink Web UI, JMX, metrics reporters	PostgreSQL-compatible monitoring, Prometheus
Failure recovery	Automatic (offset-based)	Checkpoint-based restart	Checkpoint-based restart
Upgrades	Rolling connector updates	Job savepoint + restart	Rolling upgrades
JVM tuning required	Minimal	Extensive (GC, memory, buffer pools)	None (Rust-based)
Team skills required	Kafka administration	JVM + Flink expertise	SQL

6. Iceberg-specific features

Feature	Kafka Connect	Apache Flink	RisingWave
Append writes	Yes	Yes	Yes
Upsert writes	Connector-dependent	Yes	Yes
Partition transforms	Connector-dependent	Yes	Yes (identity, bucket, truncate, time)
Schema evolution	Limited	Yes	Yes
Iceberg V2 support	Connector-dependent	Yes	Yes
Small file compaction	No (external process)	No (external process)	Yes (built-in with Table Engine)
Catalog support	REST, Hive	REST, Hive, Glue, Nessie	REST, Glue, JDBC
CDC to Iceberg	Debezium + SMTs	Flink CDC (native)	Built-in CDC connectors
Iceberg source (read)	No	Yes	Yes (batch reads)

7. CDC to Iceberg pipeline

A common pattern is capturing changes from an OLTP database and streaming them to Iceberg. Each tool handles this differently.

Kafka Connect: Requires two connectors in series. A Debezium source connector captures CDC events into a Kafka topic. An Iceberg sink connector reads the topic and writes to Iceberg. You need to configure Debezium's event format and the sink connector's interpretation of INSERT/UPDATE/DELETE operations. SMTs handle the format transformation between Debezium's envelope format and the Iceberg table schema.

Flink: Flink CDC reads the database WAL directly (no Kafka required) and writes to Iceberg. This simplifies the architecture but means Flink is responsible for both CDC capture and processing. In Java:

// Flink Java API for CDC to Iceberg
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
    .hostname("prod-db")
    .port(3306)
    .databaseList("ecommerce")
    .tableList("ecommerce.orders")
    .deserializer(new JsonDebeziumDeserializationSchema())
    .build();

Or in Flink SQL:

CREATE TABLE mysql_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL(10,2),
    updated_at TIMESTAMP(3),
    PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'prod-db',
    'port' = '3306',
    'username' = 'cdc_user',
    'password' = '${MYSQL_PASSWORD}',
    'database-name' = 'ecommerce',
    'table-name' = 'orders'
);

INSERT INTO iceberg_orders SELECT * FROM mysql_orders;

RisingWave: Built-in CDC connectors read directly from PostgreSQL, MySQL, and other databases. The Iceberg Table Engine can handle the entire pipeline in a single SQL statement:

CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'cdc_user',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce'
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    amount DECIMAL,
    order_status VARCHAR,
    updated_at TIMESTAMPTZ
)
FROM pg_source TABLE 'public.orders'
ENGINE = iceberg;

This single statement sets up CDC capture, streaming writes to Iceberg, and automatic compaction. No Kafka topic in between, no connector orchestration.

8. Performance characteristics

Metric	Kafka Connect	Apache Flink	RisingWave
Throughput (simple pass-through)	High	High	High
Throughput (with aggregations)	N/A	High	High
Latency (event to Iceberg commit)	Seconds to minutes	Seconds to minutes	30-60 seconds (configurable)
State access latency	N/A	Sub-millisecond (local SSD)	Milliseconds (S3 + cache)
Startup time	Seconds	Minutes (JVM + state restore)	Seconds
Memory footprint	Low	High (JVM heap + off-heap)	Moderate (Rust, no GC pauses)

9. Ecosystem and maturity

Aspect	Kafka Connect	Apache Flink	RisingWave
Project age	10+ years	10+ years	4 years
Community size	Very large	Very large	Growing
Commercial support	Confluent, Aiven, others	Confluent (Flink Cloud), Ververica, Immerok	RisingWave Labs
Iceberg connector maturity	Moderate (newer connectors)	Mature	Mature
Documentation	Extensive	Extensive	Good
Managed cloud options	Confluent Cloud, Aiven	Confluent Cloud, AWS Managed Flink	RisingWave Cloud

10. Total cost of ownership

Cost depends on your workload, but the patterns are predictable:

Kafka Connect: Lowest cost for simple data movement. You pay for Kafka Connect workers (typically small instances) and Kafka storage. No additional compute for transformations. If you already run Kafka, the incremental cost is one more connector.

Flink: Highest cost for stateful workloads due to SSD-backed state stores. A Flink cluster processing 50,000 events/second with joins requires provisioned SSDs on every TaskManager. JVM memory overhead further increases instance sizes. Managed Flink services (Confluent Cloud, AWS Managed Flink) reduce operational cost but charge a premium.

RisingWave: Lower cost than Flink for stateful workloads because state lives on S3 (roughly 10x cheaper per GB than SSD). Compute instances can be smaller since they do not need local storage. The managed cloud service (RisingWave Cloud) includes auto-scaling, which avoids over-provisioning.

When Should You Use Each Tool?

Choose Kafka Connect when:

You need simple, direct data movement from Kafka to Iceberg
Data is already in its final form (no joins, aggregations, or enrichment needed)
You are already running Kafka Connect for other connectors
Your team has Kafka expertise but limited stream processing experience
Budget is tight and the workload is straightforward

Choose Apache Flink when:

You need complex, custom stream processing that goes beyond SQL
Your team has strong JVM and Flink expertise
You are already running Flink for other workloads
You need advanced patterns like complex event processing (CEP) or custom operators
You require the deepest ecosystem integration and broadest community support

Choose RisingWave when:

Your transformations are expressible in SQL (joins, aggregations, windows, filters)
You want to minimize operational complexity and avoid JVM tuning
Cost efficiency matters, especially for large-state workloads
You prefer streaming without Java
You want built-in Iceberg compaction and CDC handling
Your team has SQL expertise and wants to avoid learning a new framework

Can You Combine These Tools?

Yes. These tools are not mutually exclusive. A common pattern uses Kafka Connect for simple data movement and RisingWave (or Flink) for workloads that require transformation:

Kafka Connect sinks raw event data from Kafka to Iceberg (append-only, no transformation)
RisingWave reads from additional Kafka topics, joins and aggregates the data, and sinks the enriched results to separate Iceberg tables
Query engines (Trino, Spark) read all Iceberg tables regardless of how the data arrived

This hybrid approach uses each tool for its strength: Kafka Connect for volume, RisingWave for intelligence.

FAQ

Which tool has the lowest latency for streaming to Iceberg?

All three tools can achieve comparable Iceberg commit latency (30 to 60 seconds), since the dominant factor is the Iceberg commit interval rather than processing speed. For raw event-to-commit latency, Kafka Connect can be slightly faster for simple pass-through since it has no processing overhead. Flink and RisingWave add minimal processing latency (milliseconds) for transformations.

Can I switch between these tools without losing data?

Yes, if you design your pipeline correctly. Since all three tools write to standard Apache Iceberg tables, the downstream query layer is agnostic to the ingestion tool. Switching requires setting up the new tool's connector, pointing it at the same Kafka topics (replaying from a known offset), and decommissioning the old tool. The Iceberg tables remain unchanged.

Do I need Kafka to stream data into Iceberg?

Not always. Kafka Connect requires Kafka by definition. Flink can read from Kafka but also supports direct CDC capture (Flink CDC) and file sources. RisingWave supports Kafka, direct CDC from PostgreSQL and MySQL, S3 sources, and other connectors. If you want to stream CDC data to Iceberg without Kafka, both Flink and RisingWave can do this directly.

How do these tools handle schema evolution in Iceberg?

All three tools support Iceberg's schema evolution to varying degrees. Flink and RisingWave handle column additions and type widening through their respective Iceberg connectors. Kafka Connect's schema evolution support depends on the specific connector implementation and the schema registry configuration. For complex schema changes (column renames, reordering), you may need to update the streaming pipeline definition in addition to evolving the Iceberg schema.

Conclusion

Choosing the right tool for streaming data into Apache Iceberg depends on three factors: the complexity of your transformations, your team's expertise, and your cost constraints.

Key takeaways:

Kafka Connect is the simplest option for direct, no-transformation data movement from Kafka to Iceberg. Use it when the data is already in its final form.
Apache Flink is the most powerful option for complex, custom stream processing with the broadest ecosystem support. Use it when you need Java-based custom operators or advanced patterns.
RisingWave offers the best balance of capability and simplicity for SQL-expressible workloads, with lower cost for stateful processing and built-in Iceberg compaction.
The tools are complementary, not exclusive. Many architectures use Kafka Connect for simple movement and a streaming engine for enriched workloads.
Iceberg's open format ensures portability: you can switch ingestion tools without affecting downstream query engines.

For more detail on how RisingWave compares to Flink specifically, see our RisingWave vs. Apache Flink comparison. To try streaming SQL to Iceberg yourself, check the Iceberg sink documentation.

Ready to stream data into Iceberg with SQL? Try RisingWave Cloud free, with no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.