Lookup Join

A Lookup Join is a common pattern in stream processing used for Stream Enrichment. In a Lookup Join, for each incoming event in a primary data stream, the system performs a synchronous "lookup" request to an external data store (like a database, key-value store, microservice, or even a local cache) to fetch additional related data. This fetched data is then joined with the original stream event to enrich it.

The lookup is typically based on one or more fields (keys) in the stream event.

How it Works

An event arrives on the input stream.
The stream processor extracts a key from the event (e.g., user_id, product_id).
The processor makes a request to an external system using this key (e.g., SELECT user_profile FROM user_database WHERE user_id = ?).
The external system returns the lookup data.
The processor joins the original stream event with the lookup data.
The enriched event is emitted to the output stream.

Example

Imagine a stream of user click events (user_id, page_url, timestamp). We want to enrich these clicks with user profile information (e.g., user_name, region) stored in an external user database.

For each click event, a lookup join would query the user database using the user_id from the click.
The returned user_name and region would be added to the click event.

Use Cases

Data Enrichment: Adding descriptive attributes, user profiles, product details, geographical information, etc., to raw event data.
Contextualization: Providing more context to events for better analysis or decision-making.
Personalization: Fetching user preferences to personalize downstream actions.
Fraud Detection: Looking up account history or risk scores for incoming transactions.

Advantages

Simplicity (Conceptually): The pattern is straightforward to understand.
Access to External Data: Allows direct access to potentially large or frequently updated datasets in external systems without needing to load all of that data into the stream processor's state.

Challenges and Considerations

Latency: Each lookup is a synchronous network call, which introduces latency to the stream processing pipeline. High latency in the external system can significantly slow down stream processing.
Throughput Bottleneck: The external system can become a bottleneck if it cannot handle the rate of lookup requests generated by the stream (i.e., one request per event). This can severely limit the overall throughput of the streaming application.
Load on External System: Lookup joins can impose a heavy read load on the external data store.
Availability: If the external system is unavailable, the lookup join will fail, potentially halting the stream processing pipeline or leading to data loss/un-enriched data if errors aren't handled carefully.
Data Freshness: The freshness of the enriched data depends on the freshness of the data in the external system at the moment of lookup.
Caching: To mitigate latency and load issues, caching strategies are often employed (e.g., caching frequently accessed lookup results within the stream processor). However, this introduces cache consistency challenges.

Lookup Joins in RisingWave (and Alternatives)

While traditional stream processing systems might offer explicit "lookup join" operators that directly query external systems per event, RisingWave promotes a more efficient and integrated approach for many enrichment scenarios using its Stream-Table Joins, especially when the "table" side is a Materialized View.

RisingWave's Approach for Enrichment:

Ingest Dimension/Lookup Data into a Table/Materialized View:
- Data from external databases or systems that would typically be used for lookups can be ingested into RisingWave as a regular table (for static data) or, more powerfully, as a Materialized View (if the dimension data itself changes over time and is available as a stream, e.g., via CDC).
- This brings the lookup data into RisingWave's managed state.
Perform a Stream-Table Join:
- The primary event stream can then be joined with this internal table/Materialized View using RisingWave's SQL JOIN syntax.
- CREATE MATERIALIZED VIEW enriched_stream AS SELECT s.*, t.* FROM event_stream s JOIN lookup_table_mv t ON s.key = t.key;

Advantages of RisingWave's Stream-Table Join over traditional Lookup Joins:

Lower Latency & Higher Throughput: Joins against data already within RisingWave's state store (Hummock) are significantly faster than per-event external network calls. The join is performed as part of the incremental dataflow.
Reduced Load on External Systems: External systems are not hit with a high volume of individual lookup requests. Data is ingested once (or continuously for changing dimension data) and then reused for joins.
Consistency and Temporal Correctness: RisingWave's state management and event-time processing capabilities can provide more consistent and temporally accurate joins, especially if the lookup data is versioned or changes over time (handled by Materialized Views).
Scalability: RisingWave's distributed architecture scales these internal joins.

When a traditional external lookup might still be considered (outside RisingWave's typical model):

If the external dataset is extremely large and volatile, and ingesting it entirely into RisingWave is impractical.
If the lookup logic is very complex and best handled by a dedicated external microservice. However, for many common enrichment tasks, bringing the dimension data into RisingWave as a table or MV and using its native join capabilities is the preferred, more performant, and robust solution. User-Defined Functions (UDFs) in RisingWave could also potentially be used to implement calls to external systems, but this reintroduces the latency and throughput concerns of traditional lookup joins.

(RisingWave Specific)