Our data stacks aren’t built for change
Data Talks

Our data stacks aren’t built for change

Nick Handel
Nick Handel

If you’ve been working in data for long enough, you start to see patterns. Not that long ago, we were stitching together Airflow tasks and SSHing our way into Hadoop gateway nodes. Or maybe you weren’t—and for that you should be thankful! These tools unlocked “big data” but they were clunky, so we built tools with better interfaces, the modern data stack.

Each time we solve a problem, our teams get a bit further with a bit less. Like anything, the best part of data is also the worst part—our answers only ever drive more questions.

Paradigms come and go, and the paradigm we exist in today centers around building DAGs. Modern tools have made this process controlled, and scientific. But, we have a new problem. Our DAGs are growing into wild, tentacled, bloated messes and we are left to maintain them.

But we have to take a step back and wonder if the problem is not about our tools, but instead, about the abstraction we're betting the farm on.

Our DAGs are static representations of our organization

The problem is that our DAGs are static representations of our organizations needs. We build the tables that represent our products and the types of questions we must answer. But we don’t exist in static environments—our organizations, ideas, questions are constantly evolving. Every new question or change spawns a new appendage.

Let’s illustrate this. Your company just added a new feature. Your business partner has questions about adoption. So they ask you: Who uses the feature? How did they get there? Are they coming back? This is the fun part, so you dive in.

You write some queries, find something interesting, and share your learnings. Your business partner has more questions and a loop of exploration and refinement begins. Then, they ask you to help them track this analysis. Now you must answer,  “What should I add to my DAG?”

You have a few answers to the question, “Who uses the feature?”: monthly, last7, last28, etc. Which do you choose? You’ll inevitably calculate them all in the future, so you save a model and share the table with your business partner… they know some SQL—what could go wrong?

They decide they need another dimension and try to write a join but the number is more than an order of magnitude off and they quickly get confused by all the options, and the docs site shows a spaghetti of lineage explaining the other tables that exist, but not how to consume them.

Or, they don’t try at all. They say, “Can you just pull this number for me? Monthly revenue. Interesting… slice it by product category. How about weekly? No wait, show me by country. Oh and show me customer tickets too. Now make that a rolling 7-day value.” At best, they’re thankful and patient with you, but they're guilty because you haven’t done anything else for a week. You try to add these dimensions to the dataset but you know they’ll just have new questions next time.

Configuration as code

DevOps Engineers were in a similar place not so long ago. Any time a system needed to change, it was a painful, manual process filled with ad-hoc scripts and cloud console tinkering. It wasn’t just painful, it was also costly, websites would go down much more frequently.

The solution was the “infrastructure as code” movement, where tools like Terraform introduced higher-level abstractions, making change management as easy as a PR and building new services as easy as referencing existing systems. DevOps engineers move up a level of abstraction and got a ton of time back that they could re-allocate to higher-value work.

If we, as data professionals, want to focus on higher impact work, we need a similar revolution in our stacks.

The warehouse must be as dynamic as our organization

It turns out, we are bumping up against the edges of something bigger, but most of us don’t see it yet. We’re so used to working with our DAGs that we cannot free ourselves of the idea that “if I just add this model, it’ll all be better.” What if our manual expression of SQL is holding us back?

What we have all been getting wrong is that the definition of a metric is not just the final node of the DAG or BI tool. Instead, metric definitions are dependent on all the processes upstream—the expression of joins across source tables, filters to datasets, aggregations of facts throughout the data model. But, today, we define these pipelines manually and those joins, filters and aggregations are repeated many, many, times.

What if those components could be contained in logical abstractions and reused across many metrics?

  • We could build derived tables using lists of objects, the metrics, dimensions and entities, rather than manually expressing the SQL to render those objects in every data model. Multiple models with overlapping objects could be refactored with only a few lines of code by redefining one of those objects.
  • Every application could leverage those definitions. A request for any metric by any dimension would be compiled to SQL and served. The user could explore the entire data model freely and safely.
  • We could build smarter abstractions on top of this layer: we could cache data and route requests to the right table obfuscating the data catalog, or reorganize our DAGs for efficiency without manual refactors. We could build more complicated datasets such as those needed for experimentation.
  • Translation between logic expressed in abstractions and business concepts could unlock self-serve experiences that do not confine users to unsatisfying analysis or overwhelm them with flexibility.

The opportunity in front of us

The implications of elevating the analyst above the abstractions of today are significant. We have made so much progress, yet, we are still leaving so much value on the table.

We dream of novel applications for our data. We may even plan to get to them “shortly” in just a few quarters, but somehow by the time we get there, they are as far away as ever.

The ideas for what we could accomplish by moving up a level of abstraction are ambitious and would meaningfully alter our paths as data teams. This evolution will require thoughtful design and will require significant contributions from a larger group to accomplish, but I believe we are closer than you might think.