How to Choose Between Kafka Connect, Flink, and RisingWave for Apache Iceberg

How to Choose Between Kafka Connect, Flink, and RisingWave for Apache Iceberg

Introduction

You have decided to use Apache Iceberg as your table format. Now you need to get data into it. Not in hourly batches, but continuously, from Kafka topics, CDC streams, and event pipelines. The question is: which streaming tool should write the data?

Three options dominate this space. Kafka Connect offers simplicity with sink connectors. Apache Flink provides a full stream processing framework with native Iceberg support. RisingWave delivers SQL-based stream processing with a built-in Iceberg sink and automatic compaction. Each makes fundamentally different trade-offs between power, complexity, and operational cost.

This guide compares all three approaches across more than ten dimensions, with concrete examples and clear recommendations for when each tool is the right choice. If you are evaluating how to stream data into Iceberg, this is the comparison you need.

What Are the Three Approaches to Streaming Into Iceberg?

Before comparing them, let's clarify what each tool is and how it connects to Iceberg.

Kafka Connect + Iceberg sink connector

Kafka Connect is a framework for moving data between Apache Kafka and external systems. It runs connectors: source connectors pull data into Kafka, and sink connectors push data out. For Iceberg, you deploy an Iceberg sink connector (such as the one from Tabular/Apache Iceberg project) that reads records from Kafka topics and writes them to Iceberg tables.

Kafka Connect does not process data. It moves it. You can apply Simple Message Transforms (SMTs) for field-level operations like renaming columns or converting timestamps, but you cannot join streams, compute aggregations, or maintain stateful logic.

Apache Flink is a distributed stream processing framework. It reads from sources (Kafka, files, databases), applies transformations defined in Java, Scala, or Flink SQL, and writes to sinks. Flink has native Iceberg support through the Flink Iceberg connector, which handles both reading from and writing to Iceberg tables.

Flink is a general-purpose processing engine. You can build arbitrary streaming pipelines with complex stateful operations, custom operators, and exactly-once guarantees. The trade-off is operational complexity: Flink clusters require JVM tuning, checkpoint configuration, state backend management, and dedicated platform expertise.

RisingWave + Iceberg sink

RisingWave is a streaming database that uses PostgreSQL-compatible SQL for defining streaming pipelines. You create sources, materialized views, and sinks entirely in SQL. RisingWave's built-in Iceberg sink writes processed data to Iceberg tables with exactly-once guarantees.

RisingWave stores streaming state on S3-compatible object storage instead of local SSDs, which changes the cost model. It also offers an Iceberg Table Engine that manages CDC ingestion, Iceberg writes, and file compaction in a single SQL statement.

How Do They Compare? A Detailed Breakdown

1. Setup and deployment

Kafka Connect: Deploy as part of your existing Kafka infrastructure. Add the Iceberg sink connector JAR, create a connector configuration in JSON, and submit it via the REST API. If you already run Kafka Connect for other connectors, adding an Iceberg sink is minimal additional work.

{
  "name": "iceberg-sink",
  "config": {
    "connector.class": "org.apache.iceberg.connect.IcebergSinkConnector",
    "tasks.max": "2",
    "topics": "orders",
    "iceberg.tables": "analytics.orders",
    "iceberg.catalog.type": "rest",
    "iceberg.catalog.uri": "https://polaris.example.com/api/catalog",
    "iceberg.catalog.warehouse": "s3://data-lake/warehouse"
  }
}

Flink: Deploy a Flink cluster (JobManager + TaskManagers), configure the state backend (RocksDB for production), set up checkpointing, and submit your Flink job. For Kubernetes deployments, you also need the Flink Kubernetes operator.

-- Flink SQL
CREATE TABLE iceberg_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL(10,2),
    order_time TIMESTAMP(3)
) WITH (
    'connector' = 'iceberg',
    'catalog-type' = 'hive',
    'catalog-name' = 'iceberg_catalog',
    'warehouse' = 's3://data-lake/warehouse',
    'database' = 'analytics',
    'table' = 'orders'
);

INSERT INTO iceberg_orders
SELECT * FROM kafka_orders;

RisingWave: Deploy a single binary or use RisingWave Cloud. Connect via any PostgreSQL client and define your pipeline in SQL. No JVM, no state backend configuration, no separate orchestrator.

-- RisingWave SQL
CREATE SOURCE kafka_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL,
    order_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

CREATE SINK orders_to_iceberg FROM kafka_orders
WITH (
    connector = 'iceberg',
    type = 'append-only',
    database.name = 'analytics',
    table.name = 'orders',
    catalog.type = 'glue',
    catalog.name = 'my_catalog',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

2. Transformation capabilities

This is where the three tools diverge most.

CapabilityKafka ConnectApache FlinkRisingWave
Field renamingYes (SMT)YesYes
Type castingYes (SMT)YesYes
FilteringYes (SMT, limited)YesYes
Stateless mappingYes (SMT)YesYes
Stream-stream joinsNoYesYes
Stream-table joinsNoYesYes
Windowed aggregationsNoYesYes
DeduplicationNoYesYes
Custom UDFsNoYes (Java/Python)Yes (Python/Java/Rust)
Pattern matching (CEP)NoYesLimited

If your pipeline is "Kafka topic to Iceberg table, no changes needed," Kafka Connect is the simplest choice. If you need to join, aggregate, or enrich data before it lands in Iceberg, you need Flink or RisingWave.

3. Exactly-once delivery

All three support exactly-once semantics for Iceberg writes, but the mechanisms differ:

Kafka Connect: Depends on the specific connector implementation. The Apache Iceberg sink connector supports exactly-once using Iceberg's transactional commit protocol and Kafka Connect's offset tracking.

Flink: Uses distributed snapshots (Chandy-Lamport algorithm) coordinated by the JobManager. Checkpoints capture the state of all operators and the positions in source streams. On recovery, Flink restores from the last successful checkpoint and replays from the recorded positions.

RisingWave: Uses barrier-based checkpointing similar to Flink, but stores checkpoint data on S3 instead of local disk. Iceberg commits are coordinated with checkpoints, ensuring exactly-once delivery by default (is_exactly_once = true).

4. State management and cost

State management is a critical differentiator for workloads that involve joins or aggregations.

Kafka Connect: No state management needed. Connectors track offsets but do not maintain application state. This keeps the operational model simple but limits what you can do.

Flink: Uses RocksDB as its state backend for production deployments. State is stored on local SSDs attached to TaskManager nodes. This provides fast state access but means you must provision SSD storage proportional to your state size. For large-state workloads (wide joins, long windows), SSD costs can become the dominant infrastructure expense.

RisingWave: Stores streaming state on S3-compatible object storage. This decouples state size from compute node storage. You do not need to provision local SSDs, and state can grow without resizing compute instances. The trade-off is slightly higher per-access latency compared to local SSD, but RisingWave uses in-memory caching to mitigate this. For most SQL workloads, the cost savings of S3 over provisioned SSDs delivers up to 10x cost efficiency compared to Flink.

5. Operational complexity

Operational aspectKafka ConnectApache FlinkRisingWave
Cluster componentsWorkers (1 type)JobManager + TaskManagersSingle binary or managed cloud
Configuration surfaceConnector JSON configJVM settings, checkpointing, state backends, parallelismSQL DDL parameters
ScalingAdd workersAdjust parallelism (may require restart)Auto-scaling (cloud) or add nodes
MonitoringKafka Connect REST API, JMXFlink Web UI, JMX, metrics reportersPostgreSQL-compatible monitoring, Prometheus
Failure recoveryAutomatic (offset-based)Checkpoint-based restartCheckpoint-based restart
UpgradesRolling connector updatesJob savepoint + restartRolling upgrades
JVM tuning requiredMinimalExtensive (GC, memory, buffer pools)None (Rust-based)
Team skills requiredKafka administrationJVM + Flink expertiseSQL

6. Iceberg-specific features

FeatureKafka ConnectApache FlinkRisingWave
Append writesYesYesYes
Upsert writesConnector-dependentYesYes
Partition transformsConnector-dependentYesYes (identity, bucket, truncate, time)
Schema evolutionLimitedYesYes
Iceberg V2 supportConnector-dependentYesYes
Small file compactionNo (external process)No (external process)Yes (built-in with Table Engine)
Catalog supportREST, HiveREST, Hive, Glue, NessieREST, Glue, JDBC
CDC to IcebergDebezium + SMTsFlink CDC (native)Built-in CDC connectors
Iceberg source (read)NoYesYes (batch reads)

7. CDC to Iceberg pipeline

A common pattern is capturing changes from an OLTP database and streaming them to Iceberg. Each tool handles this differently.

Kafka Connect: Requires two connectors in series. A Debezium source connector captures CDC events into a Kafka topic. An Iceberg sink connector reads the topic and writes to Iceberg. You need to configure Debezium's event format and the sink connector's interpretation of INSERT/UPDATE/DELETE operations. SMTs handle the format transformation between Debezium's envelope format and the Iceberg table schema.

Flink: Flink CDC reads the database WAL directly (no Kafka required) and writes to Iceberg. This simplifies the architecture but means Flink is responsible for both CDC capture and processing. In Java:

// Flink Java API for CDC to Iceberg
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
    .hostname("prod-db")
    .port(3306)
    .databaseList("ecommerce")
    .tableList("ecommerce.orders")
    .deserializer(new JsonDebeziumDeserializationSchema())
    .build();

Or in Flink SQL:

CREATE TABLE mysql_orders (
    order_id BIGINT,
    customer_id INT,
    amount DECIMAL(10,2),
    updated_at TIMESTAMP(3),
    PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'prod-db',
    'port' = '3306',
    'username' = 'cdc_user',
    'password' = '${MYSQL_PASSWORD}',
    'database-name' = 'ecommerce',
    'table-name' = 'orders'
);

INSERT INTO iceberg_orders SELECT * FROM mysql_orders;

RisingWave: Built-in CDC connectors read directly from PostgreSQL, MySQL, and other databases. The Iceberg Table Engine can handle the entire pipeline in a single SQL statement:

CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'cdc_user',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce'
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    amount DECIMAL,
    order_status VARCHAR,
    updated_at TIMESTAMPTZ
)
FROM pg_source TABLE 'public.orders'
ENGINE = iceberg;

This single statement sets up CDC capture, streaming writes to Iceberg, and automatic compaction. No Kafka topic in between, no connector orchestration.

8. Performance characteristics

MetricKafka ConnectApache FlinkRisingWave
Throughput (simple pass-through)HighHighHigh
Throughput (with aggregations)N/AHighHigh
Latency (event to Iceberg commit)Seconds to minutesSeconds to minutes30-60 seconds (configurable)
State access latencyN/ASub-millisecond (local SSD)Milliseconds (S3 + cache)
Startup timeSecondsMinutes (JVM + state restore)Seconds
Memory footprintLowHigh (JVM heap + off-heap)Moderate (Rust, no GC pauses)

9. Ecosystem and maturity

AspectKafka ConnectApache FlinkRisingWave
Project age10+ years10+ years4 years
Community sizeVery largeVery largeGrowing
Commercial supportConfluent, Aiven, othersConfluent (Flink Cloud), Ververica, ImmerokRisingWave Labs
Iceberg connector maturityModerate (newer connectors)MatureMature
DocumentationExtensiveExtensiveGood
Managed cloud optionsConfluent Cloud, AivenConfluent Cloud, AWS Managed FlinkRisingWave Cloud

10. Total cost of ownership

Cost depends on your workload, but the patterns are predictable:

Kafka Connect: Lowest cost for simple data movement. You pay for Kafka Connect workers (typically small instances) and Kafka storage. No additional compute for transformations. If you already run Kafka, the incremental cost is one more connector.

Flink: Highest cost for stateful workloads due to SSD-backed state stores. A Flink cluster processing 50,000 events/second with joins requires provisioned SSDs on every TaskManager. JVM memory overhead further increases instance sizes. Managed Flink services (Confluent Cloud, AWS Managed Flink) reduce operational cost but charge a premium.

RisingWave: Lower cost than Flink for stateful workloads because state lives on S3 (roughly 10x cheaper per GB than SSD). Compute instances can be smaller since they do not need local storage. The managed cloud service (RisingWave Cloud) includes auto-scaling, which avoids over-provisioning.

When Should You Use Each Tool?

Choose Kafka Connect when:

  • You need simple, direct data movement from Kafka to Iceberg
  • Data is already in its final form (no joins, aggregations, or enrichment needed)
  • You are already running Kafka Connect for other connectors
  • Your team has Kafka expertise but limited stream processing experience
  • Budget is tight and the workload is straightforward
  • You need complex, custom stream processing that goes beyond SQL
  • Your team has strong JVM and Flink expertise
  • You are already running Flink for other workloads
  • You need advanced patterns like complex event processing (CEP) or custom operators
  • You require the deepest ecosystem integration and broadest community support

Choose RisingWave when:

  • Your transformations are expressible in SQL (joins, aggregations, windows, filters)
  • You want to minimize operational complexity and avoid JVM tuning
  • Cost efficiency matters, especially for large-state workloads
  • You prefer streaming without Java
  • You want built-in Iceberg compaction and CDC handling
  • Your team has SQL expertise and wants to avoid learning a new framework

Can You Combine These Tools?

Yes. These tools are not mutually exclusive. A common pattern uses Kafka Connect for simple data movement and RisingWave (or Flink) for workloads that require transformation:

  • Kafka Connect sinks raw event data from Kafka to Iceberg (append-only, no transformation)
  • RisingWave reads from additional Kafka topics, joins and aggregates the data, and sinks the enriched results to separate Iceberg tables
  • Query engines (Trino, Spark) read all Iceberg tables regardless of how the data arrived

This hybrid approach uses each tool for its strength: Kafka Connect for volume, RisingWave for intelligence.

FAQ

Which tool has the lowest latency for streaming to Iceberg?

All three tools can achieve comparable Iceberg commit latency (30 to 60 seconds), since the dominant factor is the Iceberg commit interval rather than processing speed. For raw event-to-commit latency, Kafka Connect can be slightly faster for simple pass-through since it has no processing overhead. Flink and RisingWave add minimal processing latency (milliseconds) for transformations.

Can I switch between these tools without losing data?

Yes, if you design your pipeline correctly. Since all three tools write to standard Apache Iceberg tables, the downstream query layer is agnostic to the ingestion tool. Switching requires setting up the new tool's connector, pointing it at the same Kafka topics (replaying from a known offset), and decommissioning the old tool. The Iceberg tables remain unchanged.

Do I need Kafka to stream data into Iceberg?

Not always. Kafka Connect requires Kafka by definition. Flink can read from Kafka but also supports direct CDC capture (Flink CDC) and file sources. RisingWave supports Kafka, direct CDC from PostgreSQL and MySQL, S3 sources, and other connectors. If you want to stream CDC data to Iceberg without Kafka, both Flink and RisingWave can do this directly.

How do these tools handle schema evolution in Iceberg?

All three tools support Iceberg's schema evolution to varying degrees. Flink and RisingWave handle column additions and type widening through their respective Iceberg connectors. Kafka Connect's schema evolution support depends on the specific connector implementation and the schema registry configuration. For complex schema changes (column renames, reordering), you may need to update the streaming pipeline definition in addition to evolving the Iceberg schema.

Conclusion

Choosing the right tool for streaming data into Apache Iceberg depends on three factors: the complexity of your transformations, your team's expertise, and your cost constraints.

Key takeaways:

  • Kafka Connect is the simplest option for direct, no-transformation data movement from Kafka to Iceberg. Use it when the data is already in its final form.
  • Apache Flink is the most powerful option for complex, custom stream processing with the broadest ecosystem support. Use it when you need Java-based custom operators or advanced patterns.
  • RisingWave offers the best balance of capability and simplicity for SQL-expressible workloads, with lower cost for stateful processing and built-in Iceberg compaction.
  • The tools are complementary, not exclusive. Many architectures use Kafka Connect for simple movement and a streaming engine for enriched workloads.
  • Iceberg's open format ensures portability: you can switch ingestion tools without affecting downstream query engines.

For more detail on how RisingWave compares to Flink specifically, see our RisingWave vs. Apache Flink comparison. To try streaming SQL to Iceberg yourself, check the Iceberg sink documentation.


Ready to stream data into Iceberg with SQL? Try RisingWave Cloud free, with no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.