Best practices for building an enterprise data catalog

Keeping track of organizational assets isn’t always easy. And it gets even more complicated as companies scale. This challenge is exacerbated by widespread digitalization across industries—led by hybrid and full-cloud deployments.

Additionally, companies will oversee a staggering quantity of data—175 zettabytes—by 2025. Much of this proprietary data has business value. To boost discoverability, companies are investing in data catalogs to create an inventory of all of their data assets:

  • Structured, tabular data
  • Unstructured data (documents, files, webpages, mixed-media content, mobile data, emails, social media data)
  • Reports and query results
  • Visualizations and dashboards
  • Machine learning training models
  • Database connections

The ease with which team members can access their data—even with valid privileges and credentials—has always fluctuated between companies. Data catalogs help analysts, scientists, and others rapidly tap into targeted data stores. Accordingly, the assets within data catalogs showcase project milestones, support analytical processes, and facilitate remote productivity. That includes granular metadata relevant to structure, administration, and business.

At Transform, we recognize the importance of these data stores; they have immense value to the companies that rely on them. Accordingly, we also believe that a newer type of data repository—the metrics catalog—can fundamentally change how organizations centrally harness their data for the better.

This guide explores best practices central to building an enterprise data catalog. These guidelines are meant to promote a holistic approach, establish common requirements for cataloging projects, and describe general rules for problem-solving. Additionally, we’ll discuss the core differences between data catalogs and metrics catalogs.

Leave no data types behind

Compared to structured data that resides in tables, unstructured data was previously much harder to search through and discover. The rise of purpose-built AI analytics tools has made it much easier to tap into this diverse information. It’s now possible to build a data catalog incorporating all data types, including, but not limited to:

  • Sales figures
  • Customer data
  • Industry-dependent data (like patient records to healthcare, social trends to political campaigns)
  • IoT device data
  • Digital metrics and time-stamped data
  • Open data, publicly available through external parties
  • Infrastructure and monitoring data
  • Logs
  • Real-time data (patient vitals, GPS location, etc.)

Your data catalog should also incorporate metadata. Metadata boosts discoverability by a wide margin, since this data describes a certain data asset, and because those unique descriptors are typically searchable.

Three types of metadata to consider

There are three core types of metadata worth adding to your catalog:

1. Technical metadata: Describes how data is structured based on how it’s compiled, whether that be within tables, columns, rows, indexes, and more. It gives professionals hints regarding how they can harness related data—or if any manipulation is necessary beforehand.

2. Process metadata. Records when and how data was created and covers who accesses the data and from where. Functioning as an “activity log” of sorts, it also describes access permissions, data lineage, and general reliability. Process metadata is also used for troubleshooting purposes.

3. Business metadata. Describes the business value of certain data, and whether or not it’s suitable for a given purpose. Business metadata has cross-team relevance, generally enables discussion about certain assets, and even links back to compliance.

Diagram showing flow from data sources like AWS and GCP through a Data Catalog and onto users.
How data flows from sources to end users

Overall, transparency and searchability are key within a data catalog. The repository should act as a single source of truth, applicable to all authorized users and business units.

These catalogs are visual interfaces in their own right. The user interface should ideally be clean and include other discoverability features like filters to minimize search times. Many third-party tools will group assets into folders based on department, team, or individual user. It’s also important to keep track of text-based conversations, which can contain vital project information.

What if data resides in numerous locations? AWS S3, Azure Database, Dropbox, and Google Cloud Storage are popular resting places for company data. Creating a data catalog means unifying this information, and therefore pulling data from different cloud services. This is a common challenge with multi-cloud deployments—and even hybrid infrastructure—and requires clear understanding of your data’s overall footprint.

Manage sensitive data carefully

Maintaining data security and privacy is non-negotiable regardless of your industry. Companies must meet compliance requirements (HIPAA, SOX, FISMA, etc.) and supplemental cybersecurity guidelines (NIST and SOC 2) when designing their systems. You must keep these things in mind when designing your data catalog. Should you opt for third-party cataloging, remember to evaluate if and how these vendors meet those obligations. Do they exceed them?

How do we define sensitive data? Personally identifiable information (PII) is highly sensitive, especially in certain industries like health and finance, since it’s linked to specific individuals. Accounts and identities can be compromised if login credentials, addresses, phone numbers, payment information, and records escape the walled garden.

Going through data piece by piece is possible, but it’s incredibly time-consuming. Human error also comes into play. The introduction of new AI-driven detection algorithms has accelerated the data sorting process immensely. This is effective for analyzing images (eg, passports, social security cards, medical records, bills), while optical character recognition (OCR) can comb through unstructured and structured text alike. These core technologies have existed for some time, and generally can pick out sensitive information with great accuracy.

Security through anonymity

You should anonymize sensitive data wherever possible. Should this information ever fall into the wrong hands, its utility will be quite limited. That’s true for both external and internal bad actors. As noted in this article by Imperva, you can do this by:

  • Encrypting data
  • Masking data by adding values, substituting characters, or shuffling them around
  • Using pseudonyms and fake identifiers
  • Eliminating certain identifiers to avoid unnecessary specificity
  • Performing data swapping

Basic housekeeping

Identify redundant data: Because companies manage such massive quantities of data, identifying *redundant data* is also essential. There’s the cost aspect—storage capacity costs money—as well as the management aspect. When copies of data exist in separate locations, modifications must be applied in all locations. This is especially true with SQL databases. Excess redundancy increases data catalog size, undermines efficiency, and can even cause corruption in some cases. Detecting these resources and consolidating them is crucial.

Naming and readability: Since your data catalogs are going to be used by humans, the items contained within must be human readable. Contextually naming all items gives them added significance, meaning, and makes your repositories more scannable.

Consistent naming: Files should also be named consistently, so users know what to expect while browsing. Data names that are too lengthy (say, over 25 characters) are too complex; simplicity is preferable. Avoid using special characters, as these aren’t highly readable and can make search results unpredictable. Version numbers and standardized date formats—as applicable to a given resource—also provide helpful context for team members.

It’s useful to employ an SEO-centric approach to your cataloging. But instead of optimizing for Google’s search results, your goal is serve the most *relevant* data based on user search terms.

Employ proper data validation

When we use data to support business operations and drive projects forward, data quality is critical. Data validation processes cleanse company data—essentially guaranteeing that each piece is accurate, unmodified, and uncorrupted. Unfortunately, data loss isn’t uncommon during transit, which happens when data moves from cloud storage to the catalog. Validation helps you spot incomplete resources and update them accordingly to fix any inconsistencies.

Again, automation can work wonders here by streamlining otherwise tedious tasks. This is best left to external tooling. Automated processes can automatically detect missing or corrupt data—then replacing, modifying, or deleting troublesome items. Manual validation scripting, by comparison, can be woefully inefficient.

The financial benefits of sound validation can’t be overstated. Incorrect validation costs companies anywhere from $2.5 to $3 trillion annually.

Bad data can lead to poor decision-making, which may noticeably impact your organization’s bottom line.

Data catalogs vs. metrics catalogs: What’s the difference?

Simply speaking, *data catalogs* excel at organizing and surfacing key data assets across your organization, while giving context around their unique uses. Meanwhile, metrics catalogs focus on surfacing key business metrics, providing a centralized place to see how a metrics is performing alongside the business context. Additionally, you can dig into the lineage behind a metric to understand how it was defined or if it has changed over time.

Like data catalogs, metrics catalogs are centered around enablement. These bits of data allow organizations to see how they’re performing, but only if they’re widely available to those who can harness them.

Chart in Transform showing messages sent over the last 365 with an annotation around a spike in users from Greece.
Example of a chart for "Messages Sent" metric in Transform's Metrics Catalog. 

Thankfully, the metrics catalog (like the Transform metrics catalog) offer democratized access to metrics-based knowledge—knowledge with business significance—without alienating team members based on department or technical acumen.. Everyone from DevOps to marketing can benefit from having information at their fingertips, without pulling teeth to actually understand it. Metrics are essential for helping organizations tell a story.

List of annotations in Transform's Metrics Catalog, including 1) multi-person message bug, training webinar increase, and changes in messaging around product instructions.
Example of annotations in Transform's metrics catalog. These provide context into why a metric might have changed. 

That said, which metrics matter most will depend on your business. Let’s say you’re an e-commerce store. By far, some of the most critical metrics to track would be order values, discounts, volumes, and customer activity relative to key dates or time of day. Tracking consumer habits and performance of goods and services is important. Monitoring bounce rates, click-through, and other metrics involving user interaction is also helpful.

Metrics are distinct from data in this sense because they’re largely qualitative. They offer an objective overview of business success based on hard numbers. These statistics apply to professionals of all backgrounds. They’re easier to model, average, or manipulate in ways non-applicable to resources like documents.

This is where the benefits of metrics catalogs come into play. It’s much easier to expose this information to something like AI algorithms or automated workflows, which can then generate insights much faster than one could manually.

Conclusion

Building a data catalog certainly isn’t easy. The endeavor requires strong planning, plus a deep understanding of what types of data you oversee and where it’s stored. You’re shooting for unification while keeping user experiences in mind. It’s not unlike creating helpful documentation for a software product. Good cataloging will positively impact your daily operations.

Are you looking to catalog your business’s key data and metrics? Transform’s Metrics Catalog introduces a centralized approach to data organization. Transform’s centralized metrics store helps democratize access to data in the language that people already use—KPIs and metrics. Keep teams up-to-date on metrics changes and build institutional knowledge with business stakeholders.

The metrics layer, known as the Metrics Store, sits between your data storage, orchestration, and analytics layers. Transform supports metrics definitions as code and data lifecycle management. It also integrates seamlessly with any downstream platform, which unlocks rich contextual visualizations based on your critical metrics. This flexibility makes it much easier to build your perfect data catalog.

Want to learn more? Access our demo to see if Transform is right for you.

This post is authored by Tyler Charboneau. Tyler is a hardware-software devotee and researcher. He specializes in simplifying the complex while speaking effectively to all audiences.

Guest Author

Guest Author

This author is a friend of Transform.