The Case for Data Contracts
Preventative data quality rather than reactive data quality
Companies lose large amounts of time and money due to data quality issues with data that feeds into a value-generating use (e.g., data quality issues with “production data”). After data is produced, it must travel through multiple uncoordinated software vendors that store and manipulate data, often resulting in poor data quality that breaks the systems designed to utilize the data. The solution is data contracts, which are a set of enforceable rules and automatic data fixes that ensure that data is in its expected format upon reaching the point at which it is consumed.
The case for data contracts: a technological and market analysis
One way to think about the world of data infrastructure is to carve it up into three estates: the hyperscalers (AWS, GCP, and Azure), the data lakehouses (Databricks and Snowflake), and an amorphous “Third Estate” made up by a sprawling array of data tooling whirligigs and doodads that move, change, and analyze data between when data is created within an organization (“production”) and when data is used (“consumption”). The Third Estate is a fragmented market, and this fragmentation leads to data quality pain points caused by the lack of functional interoperability between vendors. In other words, there is no way to prevent dropping the baton during the data infrastructure relay race.
Data quality can be divided into two subcategories: data observability and preventative data quality. Up until now, the latter has not been solved because hyperscalers, lakehouses, and larger vendors within the Third Estate are too focused on fully penetrating their own slice of the market to tackle a solution. We propose data contracts as a solution to the issue of preventative data quality: programmatic enforcement of data quality between all tools within the Third Estate ecosystem to guarantee that no value is lost from data quality issues.
The three categories of data infrastructure
Here is an oversimplified but hopefully clear framework for what market structure should pop into mind when you hear about the data infrastructure market. We mention this here to get everyone on the same page for how we think about the data world:
Hyperscalers: The hyperscalers are big dogs: Amazon Web Services, Google Cloud Platform, and Microsoft Azure. They provide cloud data storage and processing for enterprises that no longer want to host their own servers on-prem. These players mainly care about winning contracts for companies to migrate their data storage and compute from the companies’ servers (on-premises) to the hyperscalers own servers (the cloud). For the hyperscalers, who offer virtually every product under the sun, selling all the accoutrements of data warehousing or the various functions within the Third Estate is nice and dandy – the software margins sure are great – but they are just as happy to co-sell with someone like Snowflake if it means winning the underlying cloud migration. Better to win the migration () than sour a potential customer by pushing point solutions ($–$$$).
Lakehouses: Snowflake and Databricks are data platforms that are duking it out in the lakehouse arena, which aims to combine the best features of data lakes (legacy Databricks) and data warehouses (legacy Snowflake). Data warehouses offer a platform to provide fast and flexible analysis of large sets of structured data, while data lakes focus on developing, training, and deploying machine learning models and big data processing workflows. Snowflake has historically been SQL based, and Databricks Python and Scala. Now, they aim to provide a unified platform for data management and analysis where raw data is stored in a lake and processed with warehousing techniques. Unifying analytics and machine learning workflows is one of the holy grails of the current modern data stack. For the most part, these vendors are too busy scrapping to care about the rest of the data infrastructure stack.
The Third Estate: Software within the category of the Third Estate covers numerous functions, and within each function, there are numerous companies that differentiate amongst themselves through marginally incremental technological innovation. Markets within the Third Estate, includes message brokers; change data capture (CDC); extract, transform, and load tools (ETL); reverse extract, transform, and load tools (RETL), customer data platforms (CDPs), metrics layers, transform tools, and data management tools. Whether or not you are familiar with every branch of this family tree is less relevant than understanding that the number of these tools exploded over the last five to ten years, which were often financed by eye-popping, record setting valuations.
Doing things to data between producer and consumer
Recall that on the left and right borders of the third estate, you have (1) data producers, which include data produced from events like data entered by a human into Salesforce, clickstream data from when a person visits a website, or records of credit card transactions; and (2) data consumers, which include analytics, machine learning models, and production applications.
When producers produce data, consumers can’t just eat that data raw; it has to be refined, like oil. So in between producers and consumers, you have a system of pipes, boxes, and machines that move, store, and configure data from multiple producers. And lastly, you have a few observability tools that make sure you can peek inside the whole system to see when things go wrong. This system is the Third Estate, and it is how you get data from what it looks like when produced, to what it needs to look like for consumption to be possible.
I put together the below landscape picture to describe how I characterize the Third Estate: everything between producer and consumer except for the lakehouses. If you are not familiar with this market, feel free to skip over the image. While lakehouses also exist between producer and consumer, I think they are worthy of their own category.
It is imperative that data systems do not break
Each type of data consumption requires a different constellation of tools within the Third Estate to accomplish the business objective that the data consumer sets out to accomplish. Here are a couple of examples:
An analytics system that produces dashboards to give business information to management: Transaction data is produced and written into a database (e.g., Postgres). From there, the data is “moved” through an EL/ETL pipeline (e.g., data replication) into a lakehouse. Either while that data is moving to the lakehouse, or once inside the lakehouse, that data is transformed into something useful (e.g., combined with other data, for example). From there, that data might move through another pipeline into a business analytics tool that creates reports for management to consider when making business decisions.
An automated preventative maintenance system on a factory floor: Sensor data from sensors on a factory floor is produced and goes directly into a message broker. From there, that message broker sends the data to a streaming database which analyzes the data in motion. The results of that analysis power a preventative maintenance data application that can maintain machinery before it breaks down.
An e-commerce recommendation engine: Clickstream data is captured by Google Analytics and placed into a message broker such as RabbitMQ or Apache Kafka that moves the data to a lakehouse of choice. The data ultimately powers machine learning algorithms that power the recommendation and search ranking systems of an e-commerce website that connect potential customers with the products they are looking for.
In all three of these examples, there is a material amount of “stuff” done to between the point at which data is created and the point at which data is consumed. Data must be moved and changed such that consumers can ultimately derive business value from the data.
The entire system is remarkably fragile: should something bad happen to the data in these systems at any point in this system, downstream data consumers can break. A dashboard that shows priority customers for a CEO to reach out to might start returning the wrong names. A preventative maintenance system might fail causing expensive machinery failures and missed production quotas. An e-commerce recommendation engine might go down, causing millions in forgone sales.
However, data systems do break
The problem with the Third Estate is changes to data quality cause real destruction of business value.
Imagine one of the following happens:
A software developer changes some code, doesn’t tell anyone, and now his program spits out data with a changed schema. Models break, but everyone wishes they didn’t.
Someone using a sales system decides to move data from one field to another, but doesn’t tell anyone, and now data looks weird because models did not break, but everyone wishes they did!
In the first example, nothing about the data changed except the schema. Maybe instead of writing “Netflix” to a production database, a software program started writing “netflix", causing Netflix’s content recommendation engine to break down. In this example, no real functional change occurred, so philosophically and hypothetically, I’m sure most teams would have preferred that this change not cause a break.
In the second example, an entire field of data is emptied, with an entire new field taking its place. This 100% should cause a model breakage because the data you need no longer exists where you need it! But instead, the model might just read #NULLs where the old field of data resided and will most likely not break. The result would be incorrect analytics built off incorrect information – it would have been better had the model just broke in the first place so you could fix it before getting bad analytics!
You can observe these data quality issues today, but we think you should be able to prevent them
The best you can do right now is observe issues in the data quality space. Vendors in the space include companies like Great Expectations and Monte Carlo. What there aren’t vendors for right now is prevention of data quality issues.
Preventative data quality software – data contracts – is ultimately the point of this essay: we believe that you should be able to prevent unforeseen value destruction that typically arises from undesired data schema and semantics changes. Since you can’t prevent changes in data schema and semantics from ever occurring in the first place, and you can’t blame data consumers for needing their data to be just-so in order to go about their business delivering value to the enterprise, we reason that data contracts should enforce interactions between vendors throughout the Third Estate to prevent data quality issues along every step of the value chain.
Data contracts should be a technological implementation, not organizational
At their simplest, data contracts are an agreement between data producer and data consumer that data schema and semantics need to be “just-so” so that downstream data consumers can continue to run uninterrupted without data changes breaking down their systems. Because data contracts are nothing more than a way for producers and consumers to agree about what data should look like, contracts could be entirely organizational, or entirely technological.
An organizational data contract would look like a data producer and consumer within an organization getting together over coffee to hash out a simple and enforceable set of data quality standards. A technologically implemented data contract system would be a software platform that programmatically enforces data quality standards as data flows through the Third Estate.
Organizational data contracts are unfeasible organizationally and technologically. Organizationally, enterprises often have complicated webs of producers and consumers where producers seldom, if ever, have line of sight to quickly discover exactly who produces what data elements that the consumer uses for their data consumption. Technologically, because of the jumble of infrastructure that sits between producers and consumers - maintained by neither producer nor consumer - a sustainable handshake agreement between producers and consumers is virtually unobtainable.
Thus, we are interested in the idea that data contract software can sit between producer and consumers and programmatically enforce the data quality agreements required to keep business critical data consumers up and running 24/7.
Monetization and market opportunity: data contracts are for production use cases
The most important use case for data contracts is when data is being used for any production use case, whether that be models or analytics. Recall the three data system examples previously; when there is a breakdown in these data systems due to non-enforcement of data quality requirements, the cost to the organization could be in the millions of dollars.
In terms of monetization, data contracts should charge a usage based pricing model. This pricing model could be based on many forms on consumption, such as number of topics ingested, volume of data ingested, underlying compute required to complete contract checks, etc. The specific structure should be customized based on contract developer or customer needs and could be customized with third party usage based pricing software vendors (another area I have written about, to any interested readers).
The typical enterprise will use about 1/3 of the data that it produces for production use cases. Since data contracts are applicable across streaming and batch data, the TAM for data contracts is roughly 1/3 of the event streaming TAM (e.g., Confluent’s TAM) and 1/3 of the batch data movement TAM (e.g., Fivetran’s TAM) (note that this uses a simplified assumption that data used by the enterprise is evenly distributed across batch and streaming ingestion sources).
Data engineers spend a lot of time fixing issues like these. At larger companies, there are entire teams of data engineers who run around putting out these fires who often spend >50% of their time on mundane data quality tasks. At smaller organizations, data engineers who should be worrying about literally anything else are distracted from their more high value functions to instead play a technological game of Clue to get to the bottom of data quality issues. And on top of the headache for data engineers, most business analysts and ML engineers we talked to over the last couple of months that owned dashboards or models told us it might take them weeks to get problems like these fixed because of the volume of these issues.
For mid to large size companies, these outages are often worth hundreds of thousands to millions of dollars in both lost revenue from broken production pipelines and the cost to get them fixed. These problems haven’t been solved yet because most companies have not yet had the technical knowledge to solve them; only companies with the most sophisticated engineering teams have built in-house data contract solutions so far. Now that knowledge of the technology is growing, developer communities are buzzing:
Data contract architecture
Data contract architecture begins with being able to enforce the schema and semantics of a data stream. Organizations with cutting edge data infrastructure teams have currently implemented different version of the data contract system by building in-house solutions.
In the implementation of data contract software, consumers should define necessary data schema and semantics programmatically so that business value remains uninterrupted. While some organizations have producers define the data product and allow consumers the choice of whether or not to subscribe those data products, we believe consumers should define the data needs in virtually all cases so as to be able to maximally extract value from the data; this is a value accretive competitive advantage.
These definitions must accommodate the full range of acceptable data outcomes that a consumer can tolerate to create the maximum possible “wiggle room” for software engineers to work with as they make changes to their software that might require underlying changes to data produced.
The linked workflow (credit to Aurimas) is an example of what a lightweight data contract system might look like. Data feeds into a Kafka topic and uses the Kafka schema registry to enforce data schema. To run semantic enforcement on the data, a stream processor is used and data is checked against the schema registry before moving along to production use cases.
For any production grade analytics or model, consumers should define their data needs with respect to schema, semantics, et al. such that by the time data is consumed by a production system, there is a full guarantee of data quality.
What data contracts would look like if we could wave a magic wand
The above is a bare bones data contract, but a sophisticated contract software should allow software engineers the ability to send a subset of sample data through the data stack to generate a data diff (essentially, a snapshot of how data now differs from previous data – an example company doing this now is Datafold). Consumers can test whether this diff will break their models or analytics which would allow consumers to either accept or reject changes, or accommodate new data if the changes proposed by software engineers are non-negotiable.
This system should be wrapped in a machine learning model to accomplish one of two goals: (1) train the model on past records of acceptable data quality (e.g., Netflix vs. netflix) to create self-healing pipelines that correct faulty records (some organizations have tackled this issue), or (2) train the model on the decision of whether to accept or reject data to automate change approvals.
Ultimately – and I believe, feasibly – the data contract software paradigm should be thought of as a way to prevent data quality issues and eventually either automate fixes or automatically accept or reject data quality changes before they hit production grade data consumers.
Characteristics of the data infrastructure market that make market entry of a data contract vendor feasible
If you’re now convinced of the value in data contracts (or at least intrigued) you might be wondering why existing vendors don’t enter the space with great excitement.
The Third Estate, as a collection of markets, includes essentially everything between producer and consumer other than lakehouses. Unlike the markets of the hyperscalers and lakehouses, which are relatively crystalized in terms of who the relevant vendors are, the Third Estate market is still in a pre-consolidation phase where individual point solutions vie for control of individual markets through marginally unique product strategies.
The breadth of selection within markets within the Third Estate is immense, in part due to the marginally incremental technological differentiation of each of these tools and the wealth of venture funding in this sector over the last 5-10 years. Developers and data engineering teams have adopted these tools en masse with little hesitation, but the Third Estate as it stands now still has significant limitations to data quality.
However with the recession now upon us and softness showing up in enterprise IT budgets and infrastructure software earnings revisions, the failures inherent in a fragmented tooling ecosystem will eventually give way to either consolidation or the rise of preventative data quality software (data contracts) that can safeguard against data quality issues causing production breakages.
Multiple market tailwinds make high data quality in production use cases more high stakes than ever
Numerous issues exist that can be ameliorated by implementing data contracts within an organization’s data infrastructure systems. A more comprehensive list can be found here, but below are some examples we heard in conversations with data teams:
Data infrastructure systems are expanding in scope to encompass more financial data, and maintaining regulatory compliance through these systems will be increasingly important (e.g., financial reporting accuracy CEO and CFO liability under Sarbanes-Oxley
As it becomes easier to share data across organizations (e.g., Snowflake stable edges, which have exploded in popularity) and as organizations increasingly monetize data as a secondary revenue source (e.g., Snowflake marketplace), it will be of paramount importance to only provide high quality data to customers
The machine learning market broadly continues to grow at a 30%+ CAGR p.a. to a projected hundreds of billions of market value; high quality data will continue to underpin successful new ML use cases
Increasing speculation that data teams will, in aggregate, be restructured from highly centralized (e.g., data engineers responsible for all infrastructure between producer and consumer) to highly decentraliezd (e.g., data producers take ownership of the data they produce, including its quality, in a world of “data-as-product”). This philosophy of data team structure is highly congruous with data contracts
Market dynamics within the data infrastructure market that make significant competition unlikely
Companies within each market within the Third Estate are fairly content fighting their own battles given the gargantuan size of the markets they are playing in, the relative under-penetration of these markets, or both.
As an example case study, consider Fivetran and Matillion, two EL / ETL vendors that continue to jockey for control of this sub-category within the Third Estate.
Fivetran, as an oversimplification, takes data from its source and moves it wholesale into a lakehouse where a different vendor (dbt Labs) transforms that data into something useful, whereas Matillion takes data from its source, transforms it into something useful, and then moves it into a lakehouse. These are two different ways to play in the EL/ETL market, both of which have appeal to different sets of users, and both of which have immense remaining whitespace to focus on.
Similarly, companies within the Third Estate beyond just the EL / ETL markets are growing so rapidly with their core products that there are little to no incentives to actually deviate from focusing on entering new verticals within their core competency, improving user experience, and fine tuning their consumption based pricing strategies (a fairly undeveloped model that tends to be a black box). Organically, companies ranging from the Series B to D are growing anywhere from 500% to 50% per annum (respectively), and so have little reason to take their eyes off the ball.
Moreover, the limited inorganic growth in modern data infrastructure markets also points to vendors not intending to leave their swim lanes. Consider Fivetran, which recently acquired HVR. Fivetran’s acquisition came in conjunction with a Series D financing near the peak of the market cycle, with correspondingly remarkable valuations for both. Most notably, however, is the industrial logic behind the acquisition. HVR enables Fivetran to perform another type of data replication. Rather than branch into other categories, Fivetran chose to consolidate its presence within data replication by acquiring a CDC vendor while continuing to let dbt gain share in their own swim lane of data transformations.
Even if these companies did want to enter the data contracts space, it is a daunting prospect given the complexity and cost which would deter an initial batch of would-be contracts vendors who are currently focused on their paths to profitability. Should these companies want to acquire their way in the space, there are virtually no vendors available for sale (the next best thing would be a strategic investment into a data visibility tool), and even if there were acquisition opportunities, leverage capacity is low and financing opportunities are particularly unattractive these days.
The only real potential market entrants to data contracts are data observability and testing tools. Companies like Soda or Monte Carlo are aware of data contracts to varying degrees of implementation enthusiasm, and represent vendors worth watching.
I’m extremely excited about data infrastructure broadly and would love to chat with any infrastructure software investors and entrepreneurs working on building technology in the space. Data contracts are particularly exciting to me but there are other ways to invest behind the trend of “containerization” or decentralization of data infrastructure. I’m available by LinkedIn (Colin Campbell) or by email (email@example.com). My background is venture capital, growth equity, and private equity and includes data analytics software investing.
Thank you to my classmates at Stanford GSB who brought a wealth of experience on this topic, to the many industry participants who covered how their data infrastructure stacks work and their wish list of pain point solutions, and the key figures within the data infrastructure world Chad Sanderson and Ananth Packkildurai.