By Nick Handel
Today is the start of Coalesce festivities. dbt Labs has made it clear that news is coming about the evolution of dbt metrics and the Public Preview launch of their Semantic Layer, and we're excited. This could be a formative moment for simple, safe, accessible data.
Even if you haven’t heard of the concept of the semantic layer before, you’ve likely either used one (as part of a BI tool like Looker, PowerBI, etc.), or your brain has gotten good at translating English to SQL, and you are one. Put simply, semantic layers map data to language. To be a bit more precise, a user maintains some logic (for instance, “Revenue” maps to `sum(price) from transactions where …), but this logic is usually not simple, and there are hundreds or thousands of chunks of this logic to maintain.
Semantic layers are valuable. They have the potential to consolidate duplicate logic, make analytics teams more efficient, create better governance and metadata, and, most importantly, make data more accessible for consumers.
Companies have been using semantic layers as parts of their BI tools for some time. This creates a clean interface, but also locks the value of the semantic layer into individual BI tools, and unfortunately, people don’t just consume data through one interface. So, we all agreed that the next generation of semantic layers should be universal. That takes buy-in from downstream tools. dbt has the influence to effect this change, but that’s not why I’m really excited.
There’s one more major problem we’re not talking about - maintaining these things is horrible. And that is where the real opportunity of dbt metrics lies.
If you’ve maintained a semantic layer before, sorry…
To get all the benefits of a semantic layer, you have to specify lots of logic, and today that requires a fair amount of manual configuration. This information is organized in a spec, and anyone who uses LookML, MetricFlow, or other semantic layers has spent time with these configuration files.
The worst part about semantic layers is maintaining them. It takes time to get up and running and even more time whenever a log changes or a new product line is launched. Not having one takes even more time. Rather than maintaining definitions, you’re chugging away by copying and pasting and making slight modifications to your SQL. When they aren’t updated, trust is lost, and when they aren’t deprecated, consumption interfaces become the same scary place that existed before the semantic layer.
Let’s just talk about the Welcome to Looker Project. It's 2902 lines of code across 18 files. Welcome indeed.
If you look at an older Looker project… or worse your Looker project, you typically see a tangled mess of legacy views and explore mixed in with newer objects. Just comment out the dimension, we'll come back to it! But did you? You did not. If you can’t figure what’s correct in there how can you expect that of your data consumers?
Some diligent teams have managed to fight the entropy of a semantic layer, but they have only accomplished this through sheer will and frankly admirable persistence. Companies are hooked on the value of the semantic layer but left with a mess of objects or unsustainable diligence. There is a void between the creation and the consumption of data. And in that void lies pain.
So then we have to ask - how could we make semantic layers easier to maintain?
dbt metrics: Making semantic layers sustainable since 2022
This is where dbt metrics comes in. There is no better time to define a consumption interface than during the data set creation process. While dbt aims to be DRY(don’t repeat yourself), there is still significant amounts of repetition happening in most consumption interfaces today. With dbt metrics, the complete chain of logic, from raw data to the final state can now be expressed in a single place. That means less maintenance, more trust, and clean consumption interfaces.
The basic idea is that there is a new node in the dbt dag, a metric. Currently, a metric is either an aggregation or a derived metric from other aggregation. A metric is composed of simple components like a name, an aggregation type, an expression and some filters, But, this will likely evolve to include more useful options. Each new piece of logic can open up new types of metrics or ways of querying those metrics.
Lots of the useful logic relates to the tables (or models) that the metrics come from. If you look at the data source object in MetricFlow’s spec or the View in LookML, you will see a ton of information that dbt has some awareness of or could easily extend the specs to be more aware of. This includes context on when data is updated, how a table is built (incremental, SCD, etc.), what entities exist in tables and how they relate to other tables, and so much more.
Imagine a world where the context on data sources is not manually configured, but instead, context is passed from the transformation layer to the semantic layer directly. Analytics teams still define how these data sets should be consumed, but they could be doing this with much less work. dbt as a transformation layer and as a source of semantics enables an otherwise impossible seamless connection between creation and consumption.
The path is through open collaboration
Semantic layers are hard to build. You can tell by the differences between the specs of newer semantic layers that there are a lot of ideas and not a ton of consensus.
The best product teams know how to decide when a product is ready, and it nearly always feels too early for everyone working on it. Inevitably, if there is interest in metrics, dbt will hear a volley of feature requests, from performance improvements to new metrics types. The good news is they have proven to be willing to put ideas out in the open and respond to feedback.
As a consumer of logic from dbt, we want more logic from dbt metrics and so we have to start these conversations. Consumers want their BI Tools to be flexible and the key to making dbt metrics more flexible will be in how the dbt metrics spec evolves. Some areas where we’re eager to contribute our ideas:
- Joins - Companies with larger data models will inevitably need to join data. It was a reasonable decision in V1, but the community has made this clear. With the addition of a concept for entities, dbt could construct joins using a single Universal Semantic Model rather than multiple sub-graphs like in Looker’s explores.
- Abstractions - Time and Dimensions could be pulled out of the metrics spec and defined as reusable objects across many metrics. This would reduce the volume of YAML and make metrics more sustainable for larger organizations.
- Metric Types - dbt metrics should continue to support more new metric types, like conversion, percentile, cumulative, metrics built on slowly changing dimension (SCD) tables, etc., which will be important in reducing the need to go around the semantic layer and back to old methods that produce non-DRY code.
- Query Performance & Legibility - There are endless edge cases in constructing SQL. Each new feature will compound the complexity. We’ve found constructing SQL with a graph yields performant, flexible, and legible SQL.
This may not happen overnight, and semantic layers generally have a long way to go, but we have to start somewhere. While semantic layers are hard, these problems are solvable, but it will take patience and perseverance.
But for now, it’s Coalesce
As believers in the opportunity to centralize transformation and semantic logic, we want to push semantic layers forward. The opportunity is undoubtedly worth the challenge; if there was ever a community capable of building this opportunity, it’s this one.
See y’all at Coalesce!