Source
In stream processing, a Source is a component that represents an external origin of data that is ingested into the stream processing system (like RisingWave). It's the entry point for data into a streaming pipeline. Sources continuously read data from upstream systems and make it available as a stream for processing, transformation, and analysis within the engine.
Core Purpose
- Data Ingestion: The primary function of a Source is to enable data to flow into the stream processing system.
- Integration with External Systems: Sources allow the stream processor to connect and read data from a wide variety of upstream technologies, such as:
- Message Queues / Event Streaming Platforms (e.g., Apache Kafka, Apache Pulsar, Redpanda, Google Pub/Sub, AWS Kinesis)
- Databases via Change Data Capture (CDC) (e.g., MySQL, PostgreSQL, MongoDB using Debezium)
- File Systems or Object Stores (though often for batch sources, some systems can tail files or treat object store events as streams)
- Other data-generating applications or APIs.
Key Characteristics and Operations
- Connectors: Sources are implemented using connectors specific to the external system. The connector handles the protocol, authentication, data fetching, and offset management required to read from the external system.
- Data Deserialization: The Source is responsible for deserializing the data from the external system's format (e.g., JSON, Avro, Protobuf bytes) into the stream processor's internal data representation. This often involves schema handling, potentially integrating with a Schema Registry.
- Offset Management / Checkpointing: To ensure data is not lost and to support fault tolerance (e.g., resuming from where it left off after a failure), the Source connector typically tracks the "offset" or "position" up to which data has been successfully read and processed. This information is often part of the stream processor's checkpointing mechanism.
- Schema Definition: When defining a source, its schema (column names and data types) must be declared so the stream processor understands the structure of the incoming data.
Sources in RisingWave
In RisingWave, you define a Source using the CREATE SOURCE SQL statement. This statement typically specifies:
- The name of the source within RisingWave.
- The type of the external system (e.g., kafka, pulsar, kinesis, postgres-cdc).
- Connection details for the external system (e.g., broker addresses, database connection string, topic names, credentials).
- The data format of the incoming messages (e.g., JSON, AVRO, PROTOBUF, DEBEZIUM_JSON).
- The schema of the data, defining the columns and their types. This can sometimes be inferred if using a Schema Registry with formats like Avro or Protobuf.
Example (Conceptual CREATE SOURCE for Kafka):
CREATE SOURCE bid_events (
auction_id BIGINT,
bidder_id BIGINT,
price DECIMAL,
bid_timestamp TIMESTAMP
)
WITH (
connector = 'kafka',
kafka.topic = 'bids',
kafka.brokers = 'kafka_broker1:9092',
format = 'json'
)
ROW FORMAT JSON;
Once a Source is created, RisingWave's connectors continuously poll or listen to the specified external system, ingest new data records as they arrive, and make them available as a stream. This stream can then be queried directly or used as input for creating Materialized Views for further real-time processing and analytics.
Related Glossary Terms