How Financial Broker Leader Leveraged RisingWave Streaming Database to Build Feature Store for Fraud Detection
With RisingWave in place, engineers no longer need to navigate between different systems to maintain data pipelines. Debugging streaming queries has become remarkably straightforward, thanks to the accessibility of all the materialized views created inside.
Company A is an American financial service company that offers commission-free trading of stocks, ETFs, options, and cryptocurrencies, making investing accessible to a broad audience. With a substantial user base that spans across the globe, Company A has solidified its position as a leading fintech innovator. Through its sleek and intuitive platform, it empowers users to invest in a wide array of assets, including stocks, ETFs, options, and cryptocurrencies, all without the burden of traditional brokerage fees. As a forward-thinking and disruptive force in finance, Company A continues to shape the future of wealth creation, making it an enticing prospect for investors seeking a transformative investment experience.
Use Case and History
Company A is dedicated to combating fraud and minimizing risk to ensure the safety of its platform. To achieve this, Company A has harnessed the power of machine learning technologies for fraud detection, utilizing various types of signals. To streamline and standardize the feature engineering process, its infrastructure team has established an internal feature store. This feature store facilitates easier access, reuse, and deployment of features for both training and inference in machine learning models by data scientists and machine learning engineers.
The infrastructure team has undergone several phases in the development of its feature store platform:
- Iteration 1: Initially, they swiftly adopted a popular open-source feature store. They chose this solution to expedite the platform's delivery, allowing data scientists to easily harness it for machine learning application development.
- Iteration 2: However, after using the open-source feature store for some time, they encountered various issues, including resource-intensive usage due to its JVM-based nature, extended bootstrapping times, limited debugging capabilities within the Java processes, and uncertainty regarding the project's maintenance and future direction. Consequently, they decided to take matters into their own hands and build custom components.
- Iteration 3: In pursuit of improved stability, the team opted to redesign and reconstruct the feature store platform using standard open-source technologies. This redesign involved leveraging several systems, such as Apache Spark for batch feature transformation, Apache Flink for processing streaming feature transformation, Apache Kafka as a buffer in the data source and between Flink jobs, and DynamoDB for serving features. By harnessing these open-source software solutions, the team delivered a dependable feature store that could scale to meet the needs of data scientists.
While the current feature store functions adequately, there are some notable pain points, particularly within the streaming pipeline:
- Learning Curve: Learning and using Flink proves to be quite challenging. Notably, very few engineers in the team had prior experience with Apache Flink, necessitating them to learn the system from scratch.
- Debugging Difficulty: Debugging Flink jobs presents a significant challenge. A primary reason for this difficulty is that Flink does not provide easy access into its internal states, leaving users unaware of what occurs within a streaming job.
- Complex Data Stack: The team must simultaneously manage both Flink and Kafka. A typical pipeline demands that a Flink job deliver results to a Kafka topic for further consumption by downstream Flink jobs. For example, when calculating a feature, users might initially compute a pre-aggregation using Flink, send the pre-aggregated results to Kafka, and then create another Flink job to further aggregate these results and generate the final outcome. This pipeline can become excessively lengthy, challenging to debug, and may introduce consistency issues if a particular component experiences a failure.
In response to the challenges outlined above, the infrastructure team embarked on a quest to find solutions that could address these issues. RisingWave emerged as a compelling option for several compelling reasons:
- Familiarity and Flexibility: RisingWave's compatibility with PostgreSQL offers developers the flexibility to construct applications using one of the most widely used SQL variants. This compatibility also facilitates seamless integration with the existing ecosystem, allowing for smooth interaction with third-party business intelligence tools and client libraries.
- Queryable Materialized Views: RisingWave's materialized views function much like tables, not only persisting data but also enabling ad-hoc queries. This means that users can directly query results stored in materialized views in a consistent manner, simplifying the verification of program correctness.
- Unified System: RisingWave excels at managing materialized views and providing query services directly from them. Users can perform computations within RisingWave and subsequently query results within the same system. This eliminates the needs of maintaining multiple systems and removes concerns about data consistency issues across different systems.
RisingWave in Production
Following a thorough evaluation, RisingWave was deployed in production to replace the existing pipelines. With RisingWave in place, engineers no longer need to navigate between different systems to maintain data pipelines. Debugging streaming queries has become remarkably straightforward, thanks to the accessibility of all the materialized views created inside.
One initial concern was whether all of Flink's Java-based jobs could be rewritten using SQL. Fortunately, this proved to be a non-issue, as RisingWave offers UDF support. This means that users can write custom UDFs in Python and Java, seamlessly integrating them into the data processing workflow. This feature simplifies the process by connecting the written UDF methods with RisingWave using a straightforward SQL statement.
The infrastructure team has exciting plans to expand their utilization of RisingWave to construct more stream processing pipelines, subsequently eliminating unnecessary components, such as the data serving layer, from the existing data stack.
The data stack prior to the introduction of RisingWave appeared as follows:
The data stack following the implementation of RisingWave appears as follows:
For leaders in the financial services sector, real-time stream processing delivers significant business value. Gaining access to real-time insights empowers the company to combat fraud and enhance customer service.
By incorporating RisingWave as a critical component in their feature store, Company A has streamlined their data stack and bolstered their confidence in building data pipelines. This success story exemplifies how businesses can benefit from modern stream processing technology.