Everything You Wanted to Know but Were Afraid to Ask About Modern Data Stack

Everyone is talking about the modern data stack (MDS) nowadays. I am a data system person. I started building core database systems in the big data era, and have witnessed the birth and prosperity of cloud computing over the last decade. But the first time I came across the term “modern data stack,” I felt confused - is it just yet another buzzword that the cloud service vendors created to attract people’s eyeballs? There are so many articles online, but most of them are quite markety and salesy. After running a startup building core systems in the modern data stack domain for a while, I would like to share my thoughts. In this article, I will explain “modern data stack” to you in simple terms, and discuss why modern data stack can really matter in companies.

What is a modern data stack?

The most general (and perhaps most unclear) explanation of the modern data stack is that it is a set of tools built around a data warehouse to simplify data integration. The fundamental starting point of a modern data stack is to save data engineers and data analysts time. Here, data integration essentially is to put data from different data sources together, and then analyze and operate. For example, in an e-commerce website, there must be data sources such as user access data, user order data, and commodity information data. By putting these data together in certain ways, we can get some knowledge, such as which kinds of users are interested in which products, which types of products are best to sell, and so on. The modern data stack, in general, is all about helping people turn data into knowledge.

An illustration of the modern data stack from Fivetran

The image above is from the sharing of Fivetran CEO George Fraser at Tableau Conference 2019. In the sharing George explains what a modern data stack looks like. The data warehouse takes input from various data sources, transforms it internally inside the warehouse, and provides well-modeled data to support business intelligence,

After reading the above definition of the modern data stack, you may ask: Why is this architecture "modern"? Does it sound like a buzzword for marketing gimmicks? It is true that the term "modern data stack" sounds so marketing-oriented that people might even dislike it. Before we dive deeper into the modern data stack, let's explore who invented the term first.

Who invented the term "modern data stack"?

Frankly speaking, it is very difficult to find a very clear answer regarding its roots due to existing ambiguities. However, it is certain that the term was popularized by Fivetran, dbt, and the venture capitals behind them.

By searching on the internet, we can find that this word first became popular after 2020. The most famous speech/article is from the second half of 2020 by dbt CEO Tristan Handy. In his article, “The Modern Data Stack: Past, Present, and Future", the development of the modern data stack is divided into three eras: Cambrian Explosion I (2012-2016), Deployment (2016-2020), and Cambrian Explosion II (2020-2025). I suggest readers read the original article for the exact definitions of the three eras. In the same year, another report from a16z, a well-known venture capital, was published, titled "Emerging Architectures for Modern Data Infrastructure: 2020". This article explains the data infrastructure used by various tech companies in the United States in 2020 from VC’s perspective.

The search popularity of “modern data stack” on Google Trends

In fact, after a deeper search, you will find that George Fraser, CEO of Fivetran, mentioned this concept in a speech in 2019. On Twitter, Tristan also said that he proposed a similar idea of the “modern BI stack” back in 2016.

Reply from dbt CEO Tristan Handy on Twitter.

Why do we need a modern data stack?

Knowing the history of the modern data stack, we can go back to the previous question: why is this architecture "modern"? Why do we need a "modern" data stack? Was the previous data stack not perfect? To answer these questions, I think we should focus on "data", not "modern". It is data that lies at the heart of the modern data stack, not technology. The essence of this terminology is that it changes the way companies use technology: companies no longer build a stack for a certain technology product, but build a stack for their own data.

Let’s think back on how companies managed their data 20 years ago. Twenty years ago, corporate data was basically stored in database systems sold by Oracle, IBM, and Microsoft. Some typical characterizations in these use cases are the following.

First, the database systems sold by tech giants such as Oracle, IBM, and Microsoft are expensive, and not all companies can afford those database systems.
Second, database vendors provide consulting and support services, not operation and maintenance services. Usually, enterprises need to hire a dedicated DBA team to operate and maintain these databases and build applications on top of these databases.
Third, neither the size of data nor the number of applications was large.

Because of these reasons, when an enterprise tried to manage its own data, they were not really building a framework for the data. Instead, they were building a framework for the database.

Now time is different. In the past 20 years, the scale of data and the number of applications of enterprises have grown exponentially. A number of database products have emerged from the needs of modern enterprises. The development and popularization of cloud computing have further stimulated more enterprises to use databases. On the other hand, the explosion of the market has also led to sustainable developments in the database field. Compared with 20 years ago:

The price and the usage bar of databases have been largely reduced.
Database vendors do not only sell software, but services.
Databases no longer compete for performance, but for ease of use.

When facing massive data and applications, if the price of data software is low enough and the performance is good enough, companies can save their time from the dirty work of dealing with the databases, and focus on how to make data management easier. That is exactly the original intention behind the modern data stack: making data management easier.

How to make data management easier?

As mentioned above, the advocates of the modern data stack include dbt, Fivetran, a16z, other data software companies, and many venture capitals. What they all want to change is the way how companies prepare data for analytics: from the traditional ETL (Extract, Transform, Load) to the ELT (Extract, Load, Transform). Below is an illustration also from George Fraser's sharing at Tableau Conference 2019 explaining the difference between ETL and ELT. Here ETL means that to ingest the data source into the data warehouse, it must go through three steps: data extraction, transformation, and loading. On the contrary, what ELT means is that only data extraction and data loading is needed before loading data into the cloud data warehouses. Data transformation is carried out directly inside the data warehouse.

The difference between ETL and ELT

Turning the traditional ETL into the new type of ELT allows the complex computation of data to be moved from outside the data warehouse to the inside. This movement can simplify the management of the entire data. This still sounds a bit confusing: how does ELT simplify our data management? Can you turn your company's "old tech stack" into a "modern data stack" in one simple step?

Imagine what our data management would be like without ELT. Some employees of a company want to analyze raw data, they use ETL tools to build a pipeline to clean and process the data into table X in the data warehouse, and then analyze table X in the data warehouse. In the process of analysis, they find that they have not extracted a column of data, or the data processing method was wrong, or the unit of the data was wrong. At this time, what should she/he do? The only way is to re-fetch the data from the data source, rebuild the ETL pipeline, and re-analyze the data. However, the data in the data source is often temporary. Usually, the enterprise only saves the data source for seven days or 30 days. That is saying, when she/he wants to re-extract the data, build the pipeline, and perform analysis, the original data may be already lost.

ELT solves this problem in a brute-force way: store all the original data in the data warehouse. When the original data is stored in the data warehouse, there is no so-called data loss problem, and all historical records can be found. When we want to process data, we do not build any pipelines for processing but directly write SQL and use the computing power of the data warehouse itself for processing.

It sounds really simple, and doesn't seem like something that only appeared in the 2020s. Why was ELT not used 20 years ago? Why has ELT been promoted only in recent years? I think there are several reasons for this.

Firstly, the amount of data 20 years ago was relatively small, the data format was relatively simple compared to today, and the processing requirements were not high. Even if it was troublesome for everyone to develop applications, they could all be solved by stack engineers.
Secondly, database systems from 20 years ago were simply too expensive to buy or not want to spend a lot of money on storing large amounts of data.
Thirdly, the performance of the database system at that time was still relatively weak. Today, the cost of engineers is high, the applications are complex and diverse, and it is no longer possible to simply rely on engineers to solve problems; the emergence of cloud computing has greatly reduced the cost of data warehouses, making it possible to store massive amounts of historical data cheaply; The performance has been greatly improved, and it is no longer difficult to process large amounts of data in the data warehouse.

If we think about it carefully, we will find that the essence of ELT is actually to turn the data warehouse into the center of data management, and to use the capabilities of the data warehouse to deal with data problems as much as possible. Now back to the topic of this article, this way of building the stack centered on the data warehouse is the way advocated by the modern data stack: the extensive usage of data warehouse capabilities. ELT is only one part of the modern data stack. In fact, the modern data stack also includes other aspects, including data visualization, metadata information management, data discovery, data sharing, etc. We will not cover these topics in this article.

What does the modern data stack have to do with

cloud computing?

Modern data stacks and cloud computing are strongly correlated but not necessarily causal. In fact, cloud computing has indeed catalyzed the development of the modern data stack. The essence of the modern data stack is to simplify data management, while cloud computing is to reduce the cost of data management; hence the starting point is different. Nevertheless, without a low-cost solution, there would be no way to make data management easier.

Will there be a one-size-fits-all system on the modern

data stack?

The one-size-fits-all system here refers to a system that can do everything: it can support operational processing, analytical processing, stream processing, data visualization, data sharing, data governance, etc. All tasks are done with one system. I think it's an ideal vision, but practically speaking, we haven't reached the stage where one size can fit all. The key question here is not whether it is technically feasible, but whether the product can be widely accepted.

I think this question can be approached from both a product competition and user needs perspective.

From the product competition, modern data software competition is more than “hard” level competitions, e.g., performance and price. Vendors have expanded the competition to more points in the “soft” level, for example, security and usability. It is often difficult for a one-size-fits-all system to compete with a product specializing in a certain task. A good example is Zoom. Although many office software, including Slack, Microsoft Teams, etc., can do video conferencing, Zoom has become a worthy winner in this field through its professionalism and ease of use.
In terms of user needs, users rarely need to use all the features in a one-size-fits-all system. Small and medium-sized enterprises tend to use only two or three functions, and naturally, they may prefer the system that does the best job of these two or three functions. In a large company, users do use a variety of functions, but when trying to convince such users to replace their existing systems, they are often deterred by the high cost of migration.

Of course, various viewpoints justify the existence of a one-size-fits-all system. The most common argument is that enterprise-level users often only want to solve problems but do not want to spend time choosing solutions. The existence of a one-size-fits-all system can significantly save their time in selecting products. Furthermore, such a system can provide a unified user experience that is more suitable for enterprise users.

I don't think there is a standard answer to this question, and I believe that the future trend (at least in the next 3-5 years) is coexistence. Nevertheless, the one-size-fits-all system will still be a minority. Whether it is a Swiss Army knife, a fruit knife, or a utility knife, they all have their own living space, so that the Swiss Army Knife will not occupy the market of other kinds of knives.

The modern data stack serves users' data, and it is completely changing the situation in which users are bound with a few tech giants. The goal of modern data stacks is to greatly simplify the difficulty of managing data for users, allowing users to care more about the data itself, rather than the software. Of course, the modern data stack is still quickly evolving. In the next article, I will explain the current landscape of modern data stack, and what's next in the near future.