Metrics Store: My introduction
I believe we each get a few moments in our careers where we experience something so groundbreaking and significant that it impacts our personal trajectory. Airbnb’s creation of their metrics store, as part of their experimentation tool, was one of these moments for me. To share why the metrics store has so meaningfully impacted my day-to-day I've accumulated a few stories related to the effect on me and on my team. The core of what this tool did well was denormalization, by definition a repetitive task, and it did it at scale, consistently, and reliably.
As an onboarding task during my first week as a Data Scientist on Airbnb’s Growth team, I was asked to analyze an experiment that had run a few weeks before I joined. The experiment tested the optimal send time for a customer email. In the prior few months, Airbnb had been investing in creating a clean, and primarily normalized, section of its data warehouse they called core_data. My task was relatively straightforward -- take the experiment assignment logs and join them to various datasets published by the central data engineering team.
Using the cleaned-up datasets in core_data simplified my task, and when I needed additional datasets, I was able to ask the other data scientists on my team if they knew where to find them and how to query them. By the end of my first week, I enthusiastically turned around my analysis querying the core_data and a few other datasets that had been collected on email data that had rarely been touched until recently.
I produced a report and some insights about what experiments we could attempt next. I soon realized that I wouldn’t be able to support the rate of experimentation needed by the growth product team whose six engineers could produce multiple product improvements each week. Ideally, each product change would be measured and only launched following a product experiment. Given the time it took to analyze each experiment, it wasn’t going to be possible to measure every product change, and the bottleneck was going to be me.
Serendipitously, for my career path, Airbnb’s data tools team had a project in the works to launch a tool they called ERF (experiment reporting framework). Their project plan included three elements:
- Defining metrics centrally
- Building better tooling around tracking experiment assignment
- Creating an interface for the product teams to use to report results from their experiments.
In my second week, a data engineer released a beta version of this tool. As someone who almost immediately felt like a bottleneck, I eagerly picked up the tool, along with a few other data scientists who had felt that pain for much longer.
I jumped in and began defining metrics around some of the core datasets the company had produced that were relevant to me, and some metrics on the email datasets I was working on. The tool took in a query defining a data source and used some simple logic to define how to construct each metric built on top of that data source. The output was a data pipeline that ran each night and returned an experiment readout the next day.
This tool sounds simple and, at the time, it was relatively simple. Logic was defined in YAML as SQL and abstractions as input, aggregated experiment data sets as output. The tool grew more robust while I grew as a data scientist at Airbnb. Over the next two years, I owned the analysis and deeper exploration of more than 150 product experiments, five to ten times the number I could have achieved without ERF. My role transitioned from writing manual and repetitive queries investigating a few metrics to running tens of experiment analyses on hundreds of metrics every day and then digging into the surprising results.
Identifying otherwise missed opportunities
One of my early assignments was analyzing email experiments. In one of those experiments, we tried sending an email at various times after a user abandoned their session at the final booking screen. We tried a few different times: one day, three days, and seven days. The results followed our hypothesis, sending the email sooner was more impactful, but the magnitude of the impact was curious.
Had I calculated only a few metrics and seen their impact, I wouldn't have been surprised to see the result. But, since I was able to use ERF to add hundreds of metrics to the analysis, the results surprised me to a degree that required additional explorations. Both the magnitude and the breadth of the impact were significant. The one-day variant moved nearly every metric, even metrics that didn't have anything to do with the booking flow, like customer service contacts, hosts updating their calendars, and so many more. The results raised an alarm for me, I worried there had to be something wrong.
After digging in, I discovered nothing was wrong with the experiment, but the actual driver was two characters in our code base. Airbnb was logging users out after 48 hours of inactivity. This begged the question, what would happen if we increased that time limit leading to session expiration?
Establishing trust across teams
Next, we started running experiments increasing the length of user sessions after inactivity. Of course, the security team had some very reasonable questions about how these changes to authentication would impact some of their core metrics. Luckily, they were also keen to adopt the company's new metrics store in the experimentation framework and had recently begun to define some of their key metrics in the shared space.
In the metric’s store, we were able to select and add their key metrics to the configs for the experiment I was tracking. The next day, the pipelines had run, and we had a clear answer that their metrics were not materially impacted in the short term. We had the green light to keep exploring. As a data scientist, you often act as the arbiter for debate between product teams, and distrust in data is rampant and a constant pain that limits decision-making.
Without the metrics store, I would have contacted their data scientist to find out where to find the best, cleanest available datasets for me to use to produce some of their key metrics. I probably would have spent some time exploring their metric definitions, calculating their metrics, and then returned my analysis to both my team and their data team. If there were differences between my construction of their metrics and their own method, it may have gone unnoticed and ultimately created some anomalies in experiment analysis.
With the metrics store, I could confidently and easily add the metrics they were tracking to my experiment. Without me needing to know the definitions, data sources, and all the caveats of building those metrics, my analysis would be constructed in a neutral space, where the various stakeholders had already agreed on the nuances. I could safely turn around an analysis that gave them a clear picture of how my team's product changes were impacting their core metrics as defined by them.
One of the most painful and challenging positions in data analysis is holding a belief rooted in data, and then sharing that data, only to discover that the data falls apart under the scrutiny of a domain expert. It was unfair from the start. You didn’t spend every day querying that data. Some amount of this is inevitable; analysis is hard. However, the challenge of calculating a metric precisely is one of the most difficult challenges to solve at the scale of the modern data organization. Inevitably, the outcome is two analysts sitting in a room or on a call debating the nuances of a few hundred lines of SQL.
Optimizing at global, rather than local, scale
The end result of these learnings on session extension was one of the most significant positive impact experiments on Airbnb's north star metric run in 2015. At the core, this was changing a number in a single line of code. This one learning churned out tens of experiments and led to several significant and positive movements to the company’s core metrics. Most of this seems obvious now, but anyone who works on product experimentations knows that what seems obvious is either trivially true or shockingly wrong, but always provides insights.
It turns out that several teams at Airbnb had built features that relied on a user being logged in. By altering when we logged out users, we were also increasing how many user sessions happened logged-in. A key metrics for the product team’s vision was what percentage of bookings happened through the instant book feature.
Over time, Airbnb's data analysts defined thousands of metrics and nearly every team in the company adopted the tool. The data tools team continued to expand functionality. As someone who had invested most of my time at the company consuming from the tool, I was also vocal about features that would increase my impact. One particular feature added a page that allowed anyone to see the outcome of every experiment currently being run that impacted a specific metric.
While it wasn't my goal to optimize this metric, the PM for the team who did own this metric messaged me after navigating to that page that showed the impact of every experiment on their metric. At the time, hosts had settings that enabled instant bookings only for previous bookers. It was not a point of optimization for their team to increase the ratio of logged-in users browsing the site, but it turned out it had a dramatic impact on their key metric.
It turns out that the product experiments that we were running could reveal insights to other teams. I never intentionally performed this analysis and probably wouldn't have, given the optimizations that mattered to my team, but the results led to an additional dimensions optimization that, again, had a meaningful impact on the company.
Without a metrics store, I doubt I could have caught that logging out inactive users were affecting my experiment. The ease of analyzing trusted metrics and turning around my experiments and analysis enabled me to widen my view and bring invaluable evidence. Put simply, I most likely would not have had the time to be curious and dig into the unusual pattern I was observing. With metrics tooling, the sheer amount of clean summarized data presented to me revealed a path that deserved deeper analysis, and I had the time to do it.
Generalized metrics stores
After defining hundreds of metrics in the experimentation framework and then consuming tens or hundreds of thousands of lines of automatically generated queries, there is no doubt that my impact as a data scientist at Airbnb was dramatically greater than it would have been without this tool. It shaped my career for a number of years, propelled me to new opportunities, and gave me the platform to have an impact that I am proud of.
Although the majority of my time using this tool was through the lens of experimentation, there were a number of occasions where we launched fake experiments to use the automatically generated pipelines to build these denormalized metric datasets. Separately, another data scientist built an API on top of the ERF tool, which allowed data scientists to generate SQL using the metric definitions. It didn’t take hold at the time, but the methods carried through to the evolution of Airbnbs’s metrics tooling today and the tooling we are now working on at Transform.
A metrics store can facilitate analytically minded organizations to spread consistency and empower broader data sets across the organization. At the core, every tool in the data space is working to make the data worker more productive or to build more trust in the data itself, and the metrics store is no different. The distinction between this tool and most other related tools is that the metrics store, if well implemented, can improve trust and productivity across many downstream tools.
At Transform, our mission is to make data accessible. We believe that the metrics store is a foundational piece of technology that will allow modern data organizations to improve their interfaces to data. The keys to that impact are centered around the empowerment of data consumers, both in terms of producing insights faster and in creating insights that are trusted more broadly. While a few of the largest organizations have invested significant resources to build this tooling, there remains a lack of access for most organizations to this tooling which enabled this impact.
If you're interested in learning more about what we're building, please contact us at email@example.com