RisingWave Labs Blog

The big data landscape is rapidly changing. 2024, in particular, has brought transformative shifts to AI and data infrastructure. As someone deeply involved in the data and AI space, I've observed significant trends that are transforming how organizations manage, process, and derive value from data. These developments can be broadly categorized into two areas: AI Infrastructure and Traditional Data Infrastructure.

Categorizing Trends: AI Infrastructure vs. Traditional Data Infrastructure

AI Infrastructure

Historically, big data processing focused on complex operations with structured data, leading to the prominence of engines like Apache Spark. However, AI's rapid rise has shifted the focus to unstructured data (images, videos, audio, and text), driving the need for AI-tailored data infrastructure.

The Shift from Structured to Unstructured Data in AI

For instance, companies like unstructured.io are focusing on ETL workflows for unstructured data, extracting and cleaning data from complex formats like PDFs and PowerPoint files. Despite the promise, this area faces significant technical challenges, especially when precision is crucial.

Another focal point in AI is vector search, a method for finding similar items by comparing their vector representations (embeddings). Since large language models (LLMs) gained prominence, vector search has remained a hot topic. As predicted last year, many traditional databases have now adopted vector search capabilities. For example, Postgres has significantly boosted its competitiveness in AI scenarios through the PgVector extension, and traditional players like Elastic are capturing larger market shares. This trend indicates that vector search is no longer exclusive to vector databases.

Beyond that, Postgres has emerged as a solution for long-term memory storage in AI. Given the limitations of LLMs in retaining long-term conversational context, developers need a reliable persistent storage solution. Postgres, with its widespread use and reliability, has become a preferred choice in this domain.

As multi-modal AI becomes mainstream, multi-modal databases are attracting more attention. Interestingly, many breakthroughs in this space are coming from traditional players like Redis and MongoDB, while startups focusing on multi-modal processing have received relatively less market attention. This further highlights the dominance of established players in the AI infrastructure space.

In summary, while some entirely new opportunities have emerged in AI infrastructure (e.g., ETL for unstructured data), most subfields are still about established players enhancing their platforms to better support AI. Their technical expertise and market presence give them a commanding advantage.

Traditional Data Infrastructure

Three key trends are reshaping traditional data infrastructure::

S3 as the Primary Storage Architecture

While not new, using S3 as a primary storage layer has seen significant adoption recently. Analytical databases like ClickHouse have embraced this architecture, leveraging its scalability and cost efficiency.

Similarly, in the streaming domain, Confluent's acquisition of WarpStream—a Kafka alternative with S3 at its core—demonstrates the trend of rethinking storage for modern data pipelines.

Offloading colder, less frequently accessed data to S3 reduces operational costs but introduces latency. However, improvements in parallel processing and metadata management help mitigate this tradeoff.

Flink has also integrated S3 as a state store, a mechanism for storing intermediate results of computations during stream processing, ****enhancing its ability to handle large-scale stateful stream processing workloads. This development follows the architectural lead set by RisingWave, a popular streaming database known for its innovative use of S3 for both real-time and historical data and queries.

These advancements are shaping a new era of hybrid architectures, blending traditional real-time systems with cost-effective object storage solutions, enabling businesses to achieve scalability without compromising functionality.

The Rise of Small Data

With the growing performance of single-node systems, "small data" has become a buzzword this year.

With the growing performance of single-node systems, "small data" has become a buzzword this year. This refers to datasets that can be effectively processed on a single machine, often due to advancements in hardware and optimized database designs. Embedded databases like DuckDB, often called the 'SQLite for Analytics’, have gained significant traction.

DuckDB's appeal lies in its seamless compatibility with Postgres, its robust Python API, and its ability to operate efficiently within local environments without requiring a distributed setup.

Additionally, its unique columnar storage engine and ability to handle analytical queries directly from Parquet files have positioned it as a lightweight yet powerful alternative to Snowflake for many use cases.

As companies increasingly prioritize cost-effective solutions for analytics, DuckDB's ability to deliver high performance at a fraction of the cost makes it a compelling choice.

The Popularity of Open Table Formats

In 2024, open table formats have emerged as a major highlight in traditional data infrastructure.

Databricks acquired Tabular and open-sourced Unity Catalog, while Snowflake introduced the Polaris open-source project. AWS also announced S3 Tables built on Iceberg at its re:Invent conference.

These innovations highlight the growing need for open standards to simplify data management and improve interoperability. Open table formats enable easy querying of S3 data and enhance portability, reducing vendor lock-in.

Additionally, the support for multi-language environments (e.g., SQL, Python, and Java) allows developers and data scientists from diverse ecosystems to work seamlessly. For instance, a data scientist using Python can directly query data formatted by a SQL-based ETL process, streamlining the analytical workflow. This flexibility is particularly critical as enterprises seek to leverage both data and AI applications to boost productivity.

Open formats also promote advanced features such as time travel (allowing queries on historical data snapshots), schema evolution (adapting to changes in data structure without downtime), and transactional consistency, further solidifying their role as a cornerstone in modern data infrastructures.

Looking Ahead: Seamless Integration of Data and AI

As companies recover financially from the economic slowdown of the past two years, the focus is shifting from cost-cutting to efficiency gains. Productivity enhancement, powered by breakthroughs in large models, will be a top priority in 2025.

Real-time data adoption will be a key indicator of this shift. For example, Iceberg's CDC capabilities improve real-time data processing, and OpenAI's real-time API advances real-time AI applications. These developments will likely drive increased demand for real-time data.

Moving forward, we can expect more databases to support AI-related features (e.g., Text2SQL), leading to a more seamless integration of data and AI. This integration will likely blur the lines between traditional data management and AI applications.

The Convergence of Real-time Processing and AI Applications

In conclusion, I firmly believe 2025 will be a pivotal year for both data infrastructure and AI infrastructure. The convergence of real-time processing capabilities and AI applications will define the next era of big data.

Reviewing 2024: Key Trends in AI and Data Infrastructure

Table of contents