x
Thought Leadership

The Great Data Paradox - Governance vs. Self-Service

by
Josh Berry
on
9/9/2022
The Great Data Paradox - Governance vs. Self-Service

Delivering trustworthy analytics is hard. Data teams support a multitude of internal stakeholders across various functional business lines. They all demand data and want self-service, but the data team is responsible for ensuring the answers from that data are consistent and accurate. 

Self-service analytics and data governance can seem like a paradox. On the one hand, if I give the business ungated access to our enterprise data, they will initially be happy. But if the answers from that data are inconsistent or inaccurate, I know I’ll be the one who ultimately gets blamed. 

Every modern data leader has faced this conundrum; you are not alone. 

This blog discusses why the data industry needs self-service with guardrails and how to deploy standardized, self-service metrics within your organization. We should be unsatisfied with just solving for one or the other. We should have our cake and eat it too! The lessons presented in this blog are heavily influenced by our experience in the data ecosystem over the past decade. Additionally, we have interfaced with hundreds of data teams who have all struggled with the same challenge.

A Brief History of the issue. Remember When Metric Definitions Were Centralized?

For those who aren’t old enough to remember, OLAP cubes were invented in 1993 and were all the rage around 2005. In those days, there was a surge in the need for Business Intelligence (BI), which put the power in the hands of the users to generate their own reports and write their own queries. Back then, the problem was, quite simply, processing power. Since one rogue query could bring down the company’s operational database, everything was offloaded into a separate database to isolate the computation. And since systems weren’t powerful enough to serve hundreds of similar queries, metrics were pre-computed along with dimensions into a “cube.”

OLAP Cubes used to calculate metrics.
OLAP Cube historically used to calculate metrics.



One of the nice benefits of this was that there was only 1 official definition of a given metric. Want to look at the metric “Customer Satisfaction” over time? The cube already has that, pre-calculated. Want to focus on a single Product Category and look at Revenue over time? Again, because of the structure of the pre-computed cube, you only ever received one correct answer. Therefore, we enjoyed centralized logic definitions without even knowing that it was something to be desired.  

Well, Modern Data Tooling Has Created “The Great Data Paradox”

As modern computing advanced, database systems amassed computing power that nobody in 1993 could have imagined. Traditional BI, which required specially built applications to operate between the business user and the data, gave way to Self Service BI, which promised to empower more business users to slice and dice data on their own. The promise, which was largely delivered, was that the business could “have the data when they need it.” 

Subsequently, the prolific rise of self-service BI tools made governance a challenge. In 2005, data governance was relatively simple. You had 1 OLAP system, and you knew exactly which applications and users had access to each metric. Fast forward to today, a company’s data is spread across perhaps 5 or 6 different locations, each designed to deliver on the promise of self-service. Therefore, not only does governance become more challenging in a modern data environment, but we find ourselves with an unexpected modern data paradox – governance and self-service appear to oppose one another. As the charter of data leaders is to standardize key data assets to provide governance and empower democratization of self-service, this represents a massive challenge that can seem impossible to solve. 

In summary, we see that OLAP provided the governance we wanted, while BI gave us the self-service we needed. The paradox is that these two aspects of data management appear to be at odds with one another. 

How governance inhibits self-service:

  • Policies - governance policies restrict who can see what, so users are tempted to “make it work” with whatever they have (rather than finding the correct metric) 
  • Standards: definitions are often established too broadly -  i.e., the standard of “churn” gets defined at too high a level and never acknowledges the need for “voluntary” and “involuntary” as 2 different metrics. This is because the standard-makers are disconnected from real analytics.
  • Discovery:  usually, governance stops at policy and restriction and does not go so far as to make things discoverable. Cataloging efforts are often “on the side” and produce a nightmare of outdated documents and wiki pages.

How self-service defies governance:

  • Silos - almost all self-service tools push for the “semantics” to be defined in the tool itself, not upstream in the database.
  • Documentation - someone can define a metric in a self-service tool, but the actual definition in terms of the source data is lost and not exposed. Can a SQL programmer understand the definition built into a dashboard? Can a business user understand the definition built into SQL? Unfortunately, no. 

So… how does this play out in the real world?

Let us examine the journey of a large telecommunications company. In 2010, analysts did most of their work through Traditional BI and Essbase (OLAP), which meant that self-service was low, but standardization was high. Any analyst in the company could query “what was West Division’s churn rate last April?” and get the same answer.

Also, around that time, the company had started moving data into modern systems, where analysts that knew SQL could write their own queries. This increased self-service only slightly because most analysts were not strong SQL programmers. But the SQL-equipped analyst gave business leaders something they craved – the ability to get data and answers quickly without relying on IT. So when the market for self-service BI exploded, with Tableau at the forefront, the business was already hungry. Suddenly, in a matter of a few years, there were no less than 6 different places to analyze data (not counting directly querying databases), each providing some sort of self-service function across the enterprise. OLAP, with its implicit benefits in governance and standardization, got phased out to make way for self-service with BI.

For a subscription-based company, churn is typically the metric that rules them all. In the days of OLAP, everyone sliced and diced churn the same way and got the same answer. In the self-service world, Sales examined churn across different sales channels, Finance sliced churn based on product tier, and Marketing filtered churn based on Designated Marketing Area (DMA). Quite often, an executive, who is constantly being fed data from multiple departments, would question an analysis because the numbers didn’t seem to line up with each other. 

The root cause is that self-service enabled micro-adjustments to common metrics, which were done with the best intentions but ended up contributing to massive disarray across the entire enterprise. For example, when a subscription ends, it is churn. Plain and simple. But what about if the customer dies? Or moves to France? Finance views that as a loss of revenue, so obviously, they would still count those as churn. However, the retention department doesn’t count that as churn since nothing could be done to prevent it. Their goals are to reduce churn, so they create a definition that helps them focus on that. Therefore, without knowing it, the enterprise was operating on 3 churn metrics: Total Churn, which consists of Voluntary and Involuntary Churn. Imagine a dozen intricacies like this across hundreds of metrics, departments, and lines of business. Therefore,  the micro-adjustments to each metric, while useful for individual business units, sow chaos into a data-driven organization.

To try and control the chaos, an official Data Governance organization was formed, which came with its own technology purchases, implemented a “semantic layer,” and attempted to tame the beast. This is where the modern data paradox rears its ugly head. On the one hand, governance did a great job documenting all of the different metrics, variations of metrics, and coordinated the debates about official definitions. On the other hand, actual governance was nearly impossible to implement across so many different systems and organizations. 

Here is a diagram highlighting the paradox in action:

A representation of the modern data paradox using Collibra, Tableau, Looker, Power BI and Snowflake as examples.
A representation of the modern data paradox with examples.

The obvious simplification is consolidating self-service BI tools and connecting everything to 1 semantic layer. However, why should we have to choose?

Companies that find themselves in this situation are presented with a false choice between flexible Self-Service BI and metric governance. The strategy of how your company tries to control the chaos is important - what you decide to do next will impact the future of your company’s data culture. Although each company that we’ve worked with is uniquely different, we’ve compiled a list of the most common pitfalls that companies find themselves in when trying to deal with this problem.

Common Pitfalls We’ve Seen In The Wild

Pitfall #1 - Focus only on governance and standards 

The first great pitfall we see organizations fall into is swinging the pendulum the entire way in the other direction. Usually, this comes in the form of 

  • An internal “request data” portal or form
  • Tedious external data cataloging that is highly manual to maintain 
  • Extreme role-based permissions to access data

The most common way companies overcompensate with governance is through over-implementing business processes. Yes, it is certainly important to guard access to data so that users aren’t able to see protected or sensitive data where it is not needed. However, this can easily be overengineered to the point that users find it impossible to discover the data they need.

The governance process should be directly tied to metrics and your semantic layer, not raw databases or tables. For example, an analyst might be looking for a way to analyze Headcount but cannot find the official source. Because of a lack of access to the source of truth, the analyst may attempt to create the metric through creative means using the sources they do have access to.

A common complaint amongst analysts is that they can’t request access to something they don’t already know exists. This arises when the formal request process requires that the analyst be particular about what schema/table they want access to. Yet, the analyst has no underlying knowledge of what exists. Instead, users should be able to understand very clearly the metrics that exist, even if they don’t currently have access to the data. This will make it more likely that the user will properly request access instead of trying to work around the process by inventing their own definition.

Another mistake that companies make is by implementing extreme restrictions on what metrics and data a user has access to based on their functional area. For example, a company might restrict Marketing metrics to only the marketing department, and the Call Center Workforce team would not be able to access them.

The reality is that a business does not operate in the same silos that we use to organize our people. Suppose the Call Center Workforce team was able to add Marketing data to their analyses. In that case, they might realize that outbound marketing volume significantly impacts inbound call volume, and consequently, their staffing needs. Having access to that data will allow the Workforce team to plan and schedule more effectively. This leads to a direct improvement in operating efficiency.

The whole premise of why Self-Service is good is that it enables business users to make better data-driven decisions quickly. We agree that all of this access creates a new standardization problem, but trying to solve it by restricting access to data, is like amputating your leg because of a broken foot.

Pitfall #2 - Selecting a single self-service tool as your semantic layer

The second pitfall that we see is that companies recognize that self-service is part of the problem and attempt to centralize all self-service into a single tool and utilize finance or budgeting authority to eliminate the others.


At smaller companies, it might be entirely feasible that only a single self-service tool is needed. At larger companies, however, analysts will have a wide range of skill sets and will be able to create more value through a variety of self-service tools. This is because tools all have their inherent strengths and weaknesses. 

We are not saying that consolidating self-service tools isn’t ever a good thing. In fact, you may benefit from volume discounts through consolidation. Our point is that the consolidation of self-service tools should not be primarily driven by a lack of metric standardization. 

Today, there are so many specialized tools, platforms, and systems, that declaring a single one as the only choice only ends up inhibiting analytics. Some teams might choose one tool over another because of its unique capability to design dashboards for mobile devices. Other tools are stronger at geospatial mapping. No matter what the Microsoft sales rep tells you, the act of choosing a single tool to do everything will always leave some users behind. 

The temptation to rip the favorite tools out of analysts’ hands, and consolidate them onto a single tool, has gotten stronger in recent years. This is due to the recent resurgence of the semantic layer. The current story that you hear from BI companies is that by utilizing their platform correctly, you can define, store, and centralize semantic definitions in one place. However, this is a pitfall because if the definition is contained in the silo itself, then that by definition is not centralization. 

For example, marketing could set up metrics and semantic definitions in Looker. As long as everyone is using Looker, that’s fine. But what about Call Center Workforce Planning, who uses Tableau instead? What about the power analyst who runs SQL directly on the warehouse?

The Market Has Already Realized This

Here is a recent image from Looker’s marketing material to illustrate the point in their own words:

Lookers representation of the modern data paradox and semantic layer.
Source: Bring governance and trust to everyone with Looker’s universal semantic model, by Adam Wilson


Interestingly enough, Google announced that sometime in 2022, they would release a new feature called “BI Connectors” (see the yellow box in the top right). Their intent here is to solve the conundrum that we’re discussing at this very moment. However, if you use one of your BI Tools as the semantic layer, then it isn’t really a semantic layer unless all of your end users (including other BI tools) are downstream. In this approach, the users who write queries against the database are simply ignored.

We are not exclusively picking on Looker. The same story is happening across most of the popular Self-Service BI tools. Today, the modern BI tools are in an arms race to consolidate and own the semantic logic.

A business information tool's dream of owning the semantic layer in the modern data stack.
A business information tool's dream of owning the semantic layer in the modern data stack.

Other than ignoring the fact that there will always be powerful analysts who will be accessing the database directly, this proposed solution presents a major challenge with cooperation between data engineering and analysts. Data engineers, whose job is to build database objects that the business finds useful, cannot easily access the semantic logic from inside a BI tool in order to build it more efficiently upstream in the data layer. Furthermore, in order to calculate these metrics on a continual basis, the company must begin to invest in a 2nd compute layer that will eventually rival the data warehouse itself. 

Therefore, a better approach is to consolidate the semantic logic in the data warehouse environment itself - to leverage existing computational resources, impact all end users equally, and share a common language (SQL) with the data engineers.

Introducing a Better approach - Define Once, Compute Efficiently, Use Everywhere

A better approach to the modern data paradox.
Our approach to the modern data paradox.

We introduced and designed Rasgo around a guiding principle that the data community needs standards and governance without sacrificing self-service for the business. We help you properly standardize your critical business metrics while empowering self-service consumption of these metrics within the business.

At Rasgo, we believe the semantic layer should live within the cloud data warehouse, centralized with your raw and analytic-ready tables. Because metrics are driven by SQL code, all version control and storage of the SQL logic that creates metrics are centralized within the git repository. 

When it comes to maintaining a semantic layer of physical metric data, dbt has emerged as the clear leader in this space. You can read about their plans here

However, this is a solution that focuses solely on the semantic layer for physical data supporting the data engineers; it does not solve for the business users who can’t read yaml files or navigate the data warehouse. Data consumers from the business need to have dbt yaml files translated into business contextualized metrics so they can interact and consume them in self-service. 

Rasgo is designed to work seamlessly with dbt. While analytics engineers refer to SQL code and yaml objects, business users can use a graphical UI to search, interact with, and analyze the standardized metrics with the metadata that matters to them.

Additionally, business users can use the semantic layer as a foundation to create their own metric definitions, which can be automatically synchronized with a code repository like github. 

Rasgo connects the data and analytic engineers with the business users in a way that lets them speak to each other using the same language. When a business user refers to Voluntary Churn, the engineer sees exactly how that metric is defined in terms of SQL. This allows the metric standardization process to be a collaborative effort that reduces the bottleneck on data engineering teams.

Adhoc analytics are best performed directly on the data warehouse. When Operational BI is the goal (such as a permanent dashboard that displays weekly sales performance), modern BI tools such as Tableau and Looker are great. However, most companies quickly become inundated with hundreds or even thousands of dashboards that were built for one-time analysis, which clutter up the environment. This is because the majority of analyses are ad hoc, and not permanent.

Therefore, Rasgo was built to provide SQL users, python users, and no-code users the ability to analyze data directly on the data warehouse. For example, an analyst can quickly generate a heatmap or a funnel chart with a few clicks, while Rasgo translates those clicks to SQL and automatically runs them against the trusted semantic tables. Insights and visualizations can then be shared via URL, or their source data can be extracted directly from the warehouse to their tool of choice for further analysis.

The choice for Self-Service BI tools should be respected and embraced but should not contain critical definitions of key metrics. 

Rasgo is not opinionated as to whether you should consolidate BI tools or have multiple. When an analyst uses Rasgo to discover a standardized metric, Rasgo simply connects the user with the appropriate object in the data warehouse and stays out of the way. This ensures that whatever is done inside a BI tool can always be traced back to a trusted source.

Conclusion

In the rush to get rid of OLAP and modernize our data stacks, we left behind one of the most important benefits that it gave us: metric standardization. We designed Rasgo to add governance and standardization to the data without creating a new silo or compromising the flexibility of Self-Service BI. By integrating with dbt and the data warehouse itself, data teams can avoid the “Great Data Paradox” and experience the best of both worlds.

Sign-Up For Your Free 30-Day Trial!