User-Defined Function (UDF)
A User-Defined Function (UDF) in a data processing system, including stream processing engines like RisingWave, is a function created by a user to extend the system's built-in functionality. UDFs allow users to encapsulate custom logic, complex calculations, or business rules that are not covered by standard SQL functions or operators.
Once defined and registered with the system, UDFs can be invoked within SQL queries just like built-in functions.
Key Characteristics
- Custom Logic: Enables users to implement domain-specific logic or algorithms directly within their queries.
- Extensibility: Provides a way to add new functionalities to the SQL language supported by the processing engine.
- Reusability: Defined once and can be used in multiple queries or by multiple users.
- Language Support: UDFs are typically written in a programming language supported by the stream processing engine. RisingWave, for example, supports UDFs written in languages like Python, Java, and Rust (often via WebAssembly or external function mechanisms).
- Invocation: Called from SQL queries, passing column values or literals as arguments.
Types of UDFs
While the exact classification can vary by system, UDFs generally fall into these categories:
-
Scalar Functions (UDFs):
- Input: Zero, one, or more scalar values (e.g., from columns of a single row).
- Output: A single scalar value for each input row.
- Example: A function that converts a string to uppercase, calculates a custom risk score based on several input fields, or parses a complex string.
- CREATE FUNCTION my_scalar_udf(input_col INT) RETURNS INT LANGUAGE python AS $$ return input_col * 2 $$;
- SELECT my_scalar_udf(column_a) FROM my_table;
-
User-Defined Aggregate Functions (UDAFs):
- Input: A set of rows (a group).
- Output: A single aggregate value for the entire group.
- Example: A function that calculates a custom statistical measure (e.g., geometric mean, a specialized percentile) over a group of rows.
- These are more complex to define as they typically involve initialize, accumulate, merge (for parallel processing), and finalize phases.
-
User-Defined Table Functions (UDTFs):
- Input: Zero, one, or more scalar values from a single input row, or sometimes a table itself.
- Output: A table (zero or more rows, each with one or more columns) for each input row or input table.
- Example: A function that splits a string into multiple rows, unnests a JSON array into rows, or generates a series of dates.
- SELECT col_a, output_col FROM my_table, LATERAL my_udtf(my_table.data_col) AS t(output_col);
Use Cases in Stream Processing
- Custom Data Transformations: Applying complex data cleaning, formatting, or enrichment logic not available as built-in functions.
- Business Rule Implementation: Embedding specific business calculations or decision logic directly into streaming queries.
- Integration with External Libraries: Calling out to specialized libraries (e.g., for machine learning inference, geospatial calculations) from within a UDF.
- Parsing Complex Data Types: Handling custom binary formats or intricate string structures.
- Extending Analytical Capabilities: Adding specialized statistical functions or industry-specific calculations.
RisingWave and UDFs
RisingWave supports external User-Defined Functions, allowing users to write functions in languages like Python, Java, or Rust. These functions are typically executed in a separate, managed runtime environment to which RisingWave offloads the UDF calls.
Key aspects of UDFs in RisingWave:
- External Functions: RisingWave often uses an "external function" approach where the UDF logic runs outside the main RisingWave process, communicating via a defined interface (e.g., gRPC).
- Language Support: Python is a commonly supported language, leveraging its rich ecosystem of libraries. Java and Rust are also options for performance-critical UDFs.
- Performance: While UDFs offer great flexibility, there can be performance overhead associated with inter-process communication or context switching, especially for row-by-row scalar UDFs called very frequently. Vectorized UDFs or careful design can mitigate this.
- Security and Isolation: Running UDFs externally can provide better isolation and security compared to in-process UDF execution.
- Definition: UDFs are typically defined using CREATE FUNCTION DDL statements, specifying the language, input/output types, and the function body or linkage.
CREATE FUNCTION my_python_scalar_udf(s TEXT)
RETURNS TEXT
LANGUAGE PYTHON
AS $$
def main(input_string):
return input_string.upper()
$$;
SELECT my_python_scalar_udf(product_name) FROM product_stream;
Considerations
- Performance: UDFs can introduce performance overhead. Prefer built-in functions when available and optimize UDF logic.
- Complexity: Managing UDF dependencies and environments can add complexity.
- Portability: UDFs are often specific to the processing engine and language, reducing query portability.
- Security: If UDFs execute arbitrary code, security implications need to be considered. External execution and sandboxing help.
Despite these considerations, UDFs are an indispensable tool for tailoring stream processing pipelines to specific business needs and extending the capabilities of SQL-based systems.