If data teams are the “top chefs” of information in an organization, metrics are their “recipes” for valuable data insights.
The ultimate goal of organizing, modeling, and serving data is to make data useful for your stakeholders. Denormalization is key to making data useful and yet the concept is not widely discussed and lacks the thoughtful approaches that have been developed for building clean and structured core data sets.
Data modeling has seen great improvements over the last decade—both in terms of tooling and in the thoughtfulness of the approaches—and we have all benefited greatly, but the next generation of data tools will revolutionize the process of denormalization and those that understand these trends will be the first to benefit.
To understand denormalization, we first need to define normalization:
- Normalization is the organization of data to appear similar across all records and fields. Normalization aims to reduce redundancy and ensure consistency in a data warehouse.
Normalization, when done well, enables broader and easier data consumption. In order to consume that data, a data consumer must be able to combine the various datasets and apply aggregations. This process is called denormalization.
- Denormalization is the process of combining and synthesizing cleaned and organized datasets from a variety of sources, which helps make data useful for analyses.
Normalization is akin to organizing ingredients in preparation for cooking, while denormalization would be the process of taking those ingredients and preparing a meal. Keeping a clean and organized pantry and fridge can make the experience of preparing your favorite meal a smoother and more enjoyable experience. It’s not enough to organize your kitchen. To make use of the various ingredients, you need to know how to combine them.
Data analysts are chefs and are often capable of improvising with their ingredients, while business users are often looking for consistent reliable results and want to follow recipes.
Both denormalization and normalization are critical and in some form, every data team does both. While normalization is well-defined and debated frequently within analytics engineering circles, little has been done in the modern data stack to improve the methods we use to serve denormalized datasets. The outcome is a sprawl of logic across the data stack.
Analysts spend most of their days denormalizing and the majority of a company's business logic is defined in this process. Most data analysts know these concepts, although they may use words like core data sets or data marts for normalized datasets and summary or metric tables for denormalized datasets.
Normalization vs. denormalization: What is the difference?
The value of data doesn’t come from clean and organized storage. It’s necessary for success, but normalization alone won’t make data useful. It becomes useful through denormalization—the process of distilling that information into data sets that represent condensed information into data points or metrics that can be used to derive insights.
Similarly, the value of food is not in the ingredients or the process by which we combine those ingredients, but in the end-product, the dish. Denormalization is the process of combining the ingredients, while the ordering and expression of joins, aggregations, and filters are the recipe required to achieve the desired end-product.
Before we can fully understand denormalization at a practical level, we need to dive into some more detail about normalization. Normalized data is defined as a data model that limits data redundancy, while ensuring data integrity. To provide some examples:
- Data Redundancy (example): When a user’s address is stored in both a users table and a transactions table. In a normalized data set, the engineer would store the user’s address only in the users table and join the users table to the transactions table.
- Data Integrity (example): When the same field appears in several forms—for example, US, USA, and United States. In a normalized data set, the engineer has changed all of these different values to the same value.
Thoughtful normalization has historically been and remains an enormously important job for data and analytics engineers. Behind every application is a database, or even many databases, and these must be organized in an analytical data warehouse to best serve data consumers.
Normalization is sometimes referred to as a canonical form and very related to the historical concept of Master Data Management. This concept comes in various degrees from the simple definition proposed in 1970 as the First Normal Form (1NF), to today’s fairly common Third Normal Form (3NF), and beyond, all the way to the unrealistically complicated Sixth Normal Form (6NF). Normalization is good housekeeping. For this reason, the range of data modeling techniques proposed over the last five decades has been instrumental to the progress that data teams have realized in recent years.
Denormalization and redundancy
Denormalization happens across nearly every tool in the data stack. In our EL tasks, in scheduled batch transformation jobs in the data warehouse, through orchestration tools, and as queries in downstream business intelligence (BI) and analytics tools. Nearly every interface that makes use of data goes through the process of denormalization.
In practice, the act of applying joins, aggregations, and filters is relatively simple, but they can have significant second-order effects on logic governance and reusability or really, the lack of reusability of datasets down the line. The challenge becomes reproducibility, consistency, and data discovery.
Here are a few important factors that we should consider as we think about a strategy for how organizations should apply denormalization:
- Denormalized data is easier to query than normalized data: Normalized data is both expensive and time-consuming to query. Since data is never duplicated, we must first join datasets to do just about anything interesting. For this reason, we create, consume, and manage large volumes of denormalized datasets, both in the data warehouse and in various downstream tools.
- Denormalization is everywhere: Denormalization is everywhere. It’s quite common to publish clean summary tables of the most common metrics and dimension pairs to end-users. These denormalized datasets inform the most common questions, but, to dig deeper, we aggregate, filter, and join datasets in analytical or BI tools. Many questions end up requiring going back to normalized tables, merging new raw datasets with these summary tables, or any other combination. This happens in any number of tools that consume data.
- Denormalized datasets can be challenging to manage at scale: Managing data that is derived from normalized data sets is painful for a variety of reasons. One of the most prominent reasons is ironically, redundancy—the opposite of normalization’s supposed benefit. If you ever wonder why organizations like Airbnb have more than a hundred thousand tables sitting in their data warehouse, it’s because there are many applications of data and each requires different joins, slices and dices or representations of the companies core normalized datasets. In practice, they must pursue these various derived datasets to derive value from the data and they pay the price of having to manage redundancy.
- Governance over denormalized data is non-trivial: There are incomprehensibly many combinations of data that could be interesting for analysis, but we can only pre-compute so many possible combinations. Some of these queries are worth maintaining and updating consistently despite the required discipline and resources to ensure accuracy. Unfortunately, many of these derived datasets are forgotten and become a liability as time goes on and the definitions of the metrics in them gradually deviate from the current zeitgeist. Without proper context, and unaware of quality problems with them, analysts will sometime later consume these datasets and the results will be questionable. This leads to a general distrust in data.
- Denormalized data is difficult or impossible to repurpose: Aggregations are information compression and filters are simply data removal. It is, therefore, mostly not feasible to go from denormalized datasets back to their more useful normalized forms or to many other related datasets. By itself, this isn’t problematic. If the denormalized dataset is creating value, that’s great. The challenge comes from the volume of denormalized datasets that are required to support a line of questions and answers aiming to take data and turn it into insights.
Metrics and denormalization: How do they relate?
Metrics defined in relation to denormalization: Metrics are recipes for denormalizing data, synthesized into a universal and comprehensible language for data.
Metrics at their core are formulas to combine these normalized data sets into a meaningful and digestible set of values or a time series. In our food analogy, metrics are the recipes to take the various ingredients and create something of value. While you can make a risotto or a paella with a set of ingredients that is partially overlapping, no one would be happy if they ordered one and received the other. The way we combine the ingredients is important.
Data teams are tasked with providing a clean interface to address the most common questions, so they create summary tables, data marts, and views. Most companies have an unmanageably large number of potential combinations of metrics and dimensions, preventing them from materializing in advance every useful dataset. Instead, the logic to construct metrics creeps into every tool that builds or consumes data within a company’s tool stack. Precomputation is always preferable from a performance and ease of consumption perspective, but it’s also limiting for end consumers who are not comfortable adding new data models—and it lacks the flexibility to ask any question that comes up.
As data interfaces improve, consumers are able to ask more questions. They end up building derivative data sets off of the core data sets to find answers to their questions. In addition to copying data to build these denormalized datasets and metrics, we are often copying logic that should be consistent across multiple systems. The result is a proliferation of repeated logic across tools, and worse, there are multiple folks building on each other’s logic making the end result a near black-box to the end-consumer.
Based on conversations with a few hundred data analysts over the last two years, I will assert that this is one of the few remaining points of the data stack where there are few established best practices, let alone the tooling to enable them. The problem limits the breadth of data consumption and the depth that consumers are willing to go to inform their decisions. The solution of managing these datasets manually puts a maintenance burden on data and analytics engineers, preventing them from accomplishing more meaningful work.
While this post is meant to provide an intellectual foundation to talk about denormalization and its relation to metrics, the argument is that better tools and processes to support denormalization are lacking in the Modern Data Stack. When we talk about the missing layer of the modern data stack, we are actually pointing to the challenges we have with managing denormalized data sets.