You have Kafka. You have topics full of events, clicks, transactions, sensor readings. You need to aggregate, filter, join, and route that data in real time. The traditional path? Write a Java application with Kafka Streams, wrestle with serdes, manage state stores, handle rebalancing, and deploy a JVM-based service. For many teams, that is a lot of overhead for what boils down to a few transformations.
There is another way. Instead of writing Java or Scala code, you can process Kafka streams using SQL. Not a simplified subset of SQL, but full-featured streaming SQL with joins, window functions, aggregations, and materialized views. This approach eliminates the need for a JVM, custom application code, and the operational burden that comes with deploying stream processing applications.
In this guide, you will learn how to build complete Kafka stream processing pipelines using only SQL with RisingWave, an open-source streaming database. We will cover ingestion, transformation, real-time aggregation, multi-stream joins, and output to downstream systems.
Why Do Teams Want to Process Kafka Without Java?
Kafka Streams is a powerful library, but it comes with a specific set of requirements. You need Java or Scala expertise, a build toolchain (Maven or Gradle), a deployment target for your JVM application, and the operational knowledge to manage stateful stream processing in production. For data engineers and analysts who think in SQL, this creates an unnecessary barrier.
Here are the common pain points that push teams toward SQL-based alternatives:
- Hiring and skills gap: Not every team has JVM developers. Data engineers, analysts, and platform teams often know SQL far better than Java.
- Development velocity: Writing a Kafka Streams application for a simple aggregation can take days. The equivalent SQL query takes minutes.
- Operational complexity: Each Kafka Streams app is a separate service to deploy, monitor, and scale. SQL-based systems centralize that management.
- State management: Kafka Streams uses RocksDB for local state, which requires careful tuning of compaction, memory limits, and disk allocation. A streaming database handles state management internally.
- Testing and iteration: Changing a Java application requires recompilation and redeployment. Changing a SQL query is instant.
The SQL-first approach does not replace Kafka itself. Kafka remains your event backbone. What changes is the processing layer: instead of custom Java code, you write declarative SQL that a streaming database executes continuously.
How Does SQL-Based Kafka Stream Processing Work?
A streaming database like RisingWave connects to Kafka as a source, continuously reads events, and lets you define transformations using standard PostgreSQL-compatible SQL. The results can be queried directly, materialized into views that stay up to date automatically, or sent to downstream systems through sinks.
The architecture looks like this:
Kafka Topics -> RisingWave (SQL Processing) -> Materialized Views / Sinks (PostgreSQL, Kafka, Iceberg, etc.)
RisingWave acts as both a stream processor and a serving layer. Unlike Kafka Streams, which is a library embedded in your application, RisingWave is a standalone database that you connect to with any PostgreSQL client (psql, DBeaver, your application's Postgres driver).
Setting Up RisingWave
Getting started takes one command. No JVM, no Maven, no build step:
# Install and start RisingWave
curl -L https://risingwave.com/sh | sh
risingwave
# Connect with psql
psql -h localhost -p 4566 -d dev -U root
You can also run it with Docker:
docker run -it --pull=always -p 4566:4566 -p 5691:5691 \
risingwavelabs/risingwave:latest single_node
Connecting to Kafka
To ingest data from a Kafka topic, you create a source or table. Here is an example that reads page view events:
CREATE SOURCE page_views (
user_id VARCHAR,
page_url VARCHAR,
referrer VARCHAR,
user_agent VARCHAR,
view_time TIMESTAMP,
session_id VARCHAR
) WITH (
connector = 'kafka',
topic = 'website.page_views',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
That is the entire connection setup. No Java consumer configuration, no deserializer classes, no consumer group management code. RisingWave handles offset tracking, rebalancing, and fault recovery internally.
If you need the data persisted in RisingWave (so you can query historical records), use CREATE TABLE instead of CREATE SOURCE:
CREATE TABLE orders (
order_id VARCHAR,
customer_id VARCHAR,
product_id VARCHAR,
quantity INT,
price DECIMAL,
order_time TIMESTAMP,
status VARCHAR
) WITH (
connector = 'kafka',
topic = 'ecommerce.orders',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
What Can You Build With Streaming SQL on Kafka?
Let us walk through real-world processing patterns that would each require a separate Java application with Kafka Streams but can be expressed as simple SQL with RisingWave.
Real-Time Aggregations
Count page views per URL in 5-minute windows:
CREATE MATERIALIZED VIEW page_view_counts AS
SELECT
page_url,
window_start,
window_end,
COUNT(*) AS view_count,
COUNT(DISTINCT user_id) AS unique_visitors
FROM TUMBLE(page_views, view_time, INTERVAL '5 minutes')
GROUP BY page_url, window_start, window_end;
This materialized view updates continuously as new events arrive from Kafka. You can query it at any time:
SELECT * FROM page_view_counts
WHERE window_start >= NOW() - INTERVAL '1 hour'
ORDER BY view_count DESC
LIMIT 10;
Expected output:
page_url | window_start | window_end | view_count | unique_visitors
------------------------+---------------------+---------------------+------------+-----------------
/products/summer-sale | 2026-03-29 10:15:00 | 2026-03-29 10:20:00 | 847 | 312
/checkout | 2026-03-29 10:15:00 | 2026-03-29 10:20:00 | 523 | 498
/ | 2026-03-29 10:15:00 | 2026-03-29 10:20:00 | 412 | 287
With Kafka Streams in Java, this same aggregation requires defining a TimeWindowedKStream, configuring a Materialized state store, setting up serdes for the key and value, and writing a topology. That is roughly 50-80 lines of Java compared to 7 lines of SQL.
Multi-Stream Joins
Join orders with customer profiles to build enriched order records. With Kafka Streams, this requires a KStream-KTable join, careful co-partitioning, and matching serdes. In SQL:
-- Customer profile table (could come from a CDC source or another Kafka topic)
CREATE TABLE customers (
customer_id VARCHAR PRIMARY KEY,
name VARCHAR,
email VARCHAR,
tier VARCHAR,
region VARCHAR
) WITH (
connector = 'kafka',
topic = 'crm.customers',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT UPSERT ENCODE JSON;
-- Enriched orders: join the stream with the table
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
o.order_id,
o.product_id,
o.quantity,
o.price,
o.order_time,
c.name AS customer_name,
c.email AS customer_email,
c.tier AS customer_tier,
c.region AS customer_region
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
RisingWave handles the join state, co-partitioning, and incremental maintenance. When a new order arrives, it immediately joins with the latest customer profile. When a customer profile updates, all related materialized results update too.
Filtering and Routing
Detect high-value orders and route them to a separate Kafka topic for priority processing:
-- Materialized view for high-value orders
CREATE MATERIALIZED VIEW high_value_orders AS
SELECT
order_id,
customer_id,
price * quantity AS total_amount,
order_time
FROM orders
WHERE price * quantity > 500;
-- Sink the results back to Kafka
CREATE SINK high_value_orders_sink FROM high_value_orders
WITH (
connector = 'kafka',
topic = 'alerts.high_value_orders',
properties.bootstrap.server = 'kafka-broker:9092'
) FORMAT PLAIN ENCODE JSON;
The sink continuously pushes new high-value orders to the downstream Kafka topic. No consumer, no producer client code, no error handling boilerplate.
Sliding Window Analytics
Calculate a rolling 1-hour revenue total that updates every minute:
CREATE MATERIALIZED VIEW hourly_revenue AS
SELECT
window_start,
window_end,
SUM(price * quantity) AS total_revenue,
COUNT(*) AS order_count,
AVG(price * quantity) AS avg_order_value
FROM HOP(orders, order_time, INTERVAL '1 minute', INTERVAL '1 hour')
GROUP BY window_start, window_end;
In Kafka Streams, a hopping window requires TimeWindows.ofSizeWithNoGrace().advanceBy() with explicit grace period configuration, suppression settings, and a custom timestamp extractor. The SQL version is one statement.
How Does This Compare to ksqlDB and Flink SQL?
You might be wondering: ksqlDB also lets you write SQL on Kafka. Flink SQL does too. What makes RisingWave different?
RisingWave vs. ksqlDB
ksqlDB was designed specifically for Kafka. It uses a custom SQL dialect and runs on top of Kafka Streams internally.
| Capability | ksqlDB | RisingWave |
| SQL dialect | Custom (KSQL) | PostgreSQL-compatible |
| Data sources | Kafka only | Kafka, PostgreSQL CDC, MySQL CDC, S3, Iceberg, and 30+ connectors |
| State storage | RocksDB on Kafka brokers | Built-in distributed storage (S3/MinIO) |
| Join support | Limited (stream-stream, stream-table) | Full SQL joins including temporal, interval, and as-of joins |
| Serving queries | Pull queries with limitations | Full PostgreSQL-wire-protocol queries |
| License | Confluent Community License | Apache 2.0 |
ksqlDB ties you to the Kafka ecosystem exclusively. If you ever need to join Kafka data with a PostgreSQL table or write results to Iceberg, you need additional tools. RisingWave supports these natively.
RisingWave vs. Flink SQL
Flink SQL is powerful but operationally heavy. Flink requires a cluster (JobManager + TaskManagers), a JVM runtime, and significant tuning for checkpointing, parallelism, and memory. RisingWave provides comparable SQL capabilities with simpler operations:
- No JVM required: RisingWave is written in Rust, resulting in lower memory footprint and no garbage collection pauses.
- Built-in state management: Flink requires configuring RocksDB or HashMapStateBackend. RisingWave manages state transparently.
- PostgreSQL compatibility: Connect with any Postgres client. Flink requires its own SQL client or REST API.
- Simpler deployment: Single binary or Docker container vs. multi-component cluster.
For a detailed performance comparison, RisingWave outperforms Flink on 22 of 27 Nexmark benchmark queries while using significantly less memory.
How Do You Handle Advanced Patterns?
Late-Arriving Events
Real-world Kafka data often arrives out of order. RisingWave handles late events through watermarks:
CREATE SOURCE sensor_readings (
sensor_id VARCHAR,
temperature DOUBLE PRECISION,
reading_time TIMESTAMP,
WATERMARK FOR reading_time AS reading_time - INTERVAL '30 seconds'
) WITH (
connector = 'kafka',
topic = 'iot.sensor_readings',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;
The WATERMARK clause tells RisingWave to wait up to 30 seconds for late-arriving data before finalizing window results. This is the same concept as Flink's watermarks, but configured inline with the source definition rather than in separate Java code.
Complex Event Processing
Detect patterns across events, such as a user who views a product page more than 3 times within 10 minutes without purchasing:
CREATE MATERIALIZED VIEW cart_abandonment_signals AS
SELECT
pv.user_id,
pv.page_url AS product_page,
COUNT(*) AS view_count,
MAX(pv.view_time) AS last_view_time
FROM TUMBLE(page_views, view_time, INTERVAL '10 minutes') pv
WHERE pv.page_url LIKE '/products/%'
GROUP BY pv.user_id, pv.page_url, window_start, window_end
HAVING COUNT(*) >= 3
AND pv.user_id NOT IN (
SELECT customer_id FROM orders
WHERE order_time >= NOW() - INTERVAL '10 minutes'
);
Exactly-Once Output
When sinking to Kafka or other systems, RisingWave provides exactly-once processing guarantees through its internal checkpointing mechanism. The Kafka sink supports at-least-once delivery by default, and you can achieve effective exactly-once semantics by combining RisingWave's consistent snapshots with idempotent Kafka producers or downstream deduplication.
What Does a Complete Pipeline Look Like?
Here is an end-to-end example: a real-time e-commerce analytics pipeline that reads from three Kafka topics, joins the data, computes metrics, and outputs results to both a queryable materialized view and a downstream Kafka topic.
-- Step 1: Define sources from Kafka
CREATE TABLE orders (
order_id VARCHAR,
customer_id VARCHAR,
product_id VARCHAR,
quantity INT,
price DECIMAL,
order_time TIMESTAMP,
status VARCHAR
) WITH (
connector = 'kafka',
topic = 'ecommerce.orders',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
CREATE TABLE products (
product_id VARCHAR PRIMARY KEY,
product_name VARCHAR,
category VARCHAR,
brand VARCHAR
) WITH (
connector = 'kafka',
topic = 'ecommerce.products',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT UPSERT ENCODE JSON;
CREATE TABLE customers (
customer_id VARCHAR PRIMARY KEY,
name VARCHAR,
email VARCHAR,
tier VARCHAR,
region VARCHAR
) WITH (
connector = 'kafka',
topic = 'crm.customers',
properties.bootstrap.server = 'kafka-broker:9092',
scan.startup.mode = 'earliest'
) FORMAT UPSERT ENCODE JSON;
-- Step 2: Build an enriched view joining all three
CREATE MATERIALIZED VIEW order_details AS
SELECT
o.order_id,
o.order_time,
o.quantity,
o.price,
o.quantity * o.price AS total_amount,
c.name AS customer_name,
c.tier AS customer_tier,
c.region,
p.product_name,
p.category,
p.brand
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
-- Step 3: Compute real-time category metrics
CREATE MATERIALIZED VIEW category_metrics AS
SELECT
category,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value,
COUNT(DISTINCT customer_name) AS unique_customers
FROM order_details
GROUP BY category;
-- Step 4: Compute regional revenue per hour
CREATE MATERIALIZED VIEW regional_hourly_revenue AS
SELECT
region,
window_start,
window_end,
SUM(total_amount) AS revenue,
COUNT(*) AS order_count
FROM TUMBLE(order_details, order_time, INTERVAL '1 hour')
GROUP BY region, window_start, window_end;
-- Step 5: Sink category metrics to a Kafka topic for dashboards
CREATE SINK category_metrics_sink FROM category_metrics
WITH (
connector = 'kafka',
topic = 'analytics.category_metrics',
properties.bootstrap.server = 'kafka-broker:9092'
) FORMAT UPSERT ENCODE JSON (
force_append_only = true
);
This entire pipeline replaces what would be three or four separate Kafka Streams applications, each with their own state stores, deployment configurations, and monitoring setup. With RisingWave, you manage a single system, and every query is a standard SQL statement.
FAQ
What is the difference between Kafka Streams and a streaming database?
Kafka Streams is an embeddable Java library that you include in your application code to process data from Kafka topics. A streaming database like RisingWave is a standalone service that connects to Kafka, processes data using SQL, and serves results through a PostgreSQL-compatible interface. The key difference is that Kafka Streams requires writing and deploying application code, while a streaming database lets you define processing logic as SQL queries.
Can I use RisingWave with my existing Kafka setup?
Yes. RisingWave connects to any Kafka-compatible broker (Apache Kafka, Confluent Cloud, Amazon MSK, Redpanda, WarpStream) as a consumer. It reads from your existing topics without requiring any changes to your Kafka infrastructure. You can also write results back to Kafka topics using sinks. RisingWave supports JSON, Avro, Protobuf, and Debezium encoding formats.
Does streaming SQL support the same operations as Kafka Streams?
RisingWave supports all common stream processing operations: filtering, projection, aggregation, joins (inner, outer, temporal, interval, as-of), windowing (tumble, hop, session), and exactly-once state management. Some advanced Kafka Streams patterns like custom processors or interactive queries have SQL equivalents in RisingWave through materialized views and user-defined functions.
Is SQL-based stream processing production-ready?
Yes. RisingWave is used in production by companies processing millions of events per second. It provides fault tolerance through periodic checkpointing, horizontal scaling by adding compute nodes, and persistent state storage on S3-compatible object stores. You can get started with RisingWave Cloud for a fully managed experience or self-host the open-source version.
Conclusion
Processing Kafka data does not require writing Java. SQL-based streaming databases provide a faster, more accessible path to real-time data processing for teams that already know SQL. Here are the key takeaways:
- No JVM, no build tools, no deployment pipeline: Connect to Kafka with a SQL statement instead of building a Java application.
- Full SQL expressiveness: Joins, window functions, aggregations, subqueries, and CTEs all work on streaming data.
- Unified processing and serving: Query your streaming results directly using any PostgreSQL client. No separate serving layer needed.
- Simpler operations: One system to manage instead of multiple Kafka Streams applications, each with their own state stores and scaling requirements.
- Ecosystem flexibility: Read from Kafka, join with PostgreSQL CDC, and write results to Iceberg, all in SQL.
If you are currently maintaining Java-based Kafka Streams applications and spending more time on infrastructure than on business logic, SQL-based stream processing is worth evaluating.
Ready to try this yourself? Try RisingWave Cloud free, no credit card required. Sign up here.
Join our Slack community to ask questions and connect with other stream processing developers.

