The SME Blog

Data Quality in a Data Product World

Written by George Barrett | Sep 10, 2024 4:15:00 AM

While fundamentally it is nothing new, the concept of treating data like a product has gained prominence in the last decade. ‘Data Products” has been added to the lexicon of data engineers and business stakeholders alike. What was the catalyst behind this? I think the two main drivers were the advancements in the data warehousing and data lake space as well as the rise in business demands for data as strategic, marketable, and sellable asset. Couple that with the fact that data’s journey no longer ends with dashboarding and visualization. Machine learning and AI use cases have exploded with the popularization of generative AI and advanced computing methods. Data also has become more deeply engrained in applications, and data sharing and marketplaces are popping up. This all results in widespread data democratization, allowing more people to have access to more data to make more data-driven decisions. And while all this “more” can certainly lead to more value, there is also more responsibility as well. One of those responsibilities is data quality.

Quality control exists for all products, not just those related to data. Imagine ordering a supreme pizza from your favorite delivery place. How would you react if your pizza came with no sausage on it? Or way too many onions? What if half the pizza had the meat and the other half had the veggies? What if there was pineapple on your supreme pizza? My gut tells me that if this happened multiple times, you would have a new favorite pizza place relatively quickly. Well, imagine if instead of being a pizza eater you are a data product consumer. Instead of toppings issues, you get null values, duplicate rows, incorrect counts, values that are out of range or just flat-out wrong….If that is the case, you probably would call it unreliable data and not trust it. But if the data is coming from your company’s data warehouse as the “single source of truth”, what data, if any, are you supposed to use for your decision-making?

In most cases, data quality existed at some point for every company in their data pipelines and engineering. What changed? One word: scale. More pipelines, more logic, more projects, more data consumers, and more data sources. There’s that word “more” again. But a lot of times this does not translate to more data engineers. And trust me, as someone who has been on data engineering projects in the past, one of the first things to get rushed when deadlines become more pressing are testing and documentation. We’d prefer to get the data product out the door on time. It’s understandable why it’s happening but that doesn't mean it should become acceptable.

I believe that it is every CIO, CDO, and data team manager to champion data quality in their organization. There is simply too much at stake to not do it. As mentioned before, poor data quality dismantles trust between the data producers and data consumers. But other issues include making decisions on inaccurate data, risks with compliance and privacy, and operational inefficiencies. Unaddressed, these problems will only amplify in the negative consequences as more and more data products are requested and created. If this sounds scary, it’s because it is. But this doesn’t have to be overwhelming.

Data quality starts with the foundation of a company-wide data culture and strong data governance framework. Roles should determine who has which responsibilities and processes for managing data quality. Without this foundation, your program is not likely to maintain its success over time. Data quality is also not a one-stop on the track ordeal. It should be evaluated and measured at every step of the data pipeline, from the source systems all the way down to the consumption layer. Not every step is going to consist of the same rules and tests but should at least evaluate the end results of the logic and transformation that was applied at that step. In terms of how often data quality is tested, this should be automated via the data pipeline. In 2024, we have the technology and frameworks to assure continuous monitoring and establish data quality KPIs to measure progress and accountability. Speaking of technology, an investment in new tools can definitely assist with automation, monitoring, observability, and policy implementation, especially at scale. But often times you do not need to buy a tool with a high price tag to get started on your data quality journey. Again, it comes back to data culture and governance.

Speaking candidly, I believe data quality has been neglected in recent years. The Generative AI boom came and spurred dozens if not hundreds of new use cases off of the same (and sometimes additional) data that was there before the boom. I think that instead of using Generative AI to build these new use cases, perhaps we should start by seeing how it can help build and foster that data-driven culture through implementing strong data governance. It’s not the most exciting project, and yes it will mean that new data products may have to wait before hitting the shelves. But the benefits of waiting until you know that your data is accurate and can be trusted by your data consumers is the difference between having increased opportunities versus increased emergencies in the next couple of years.

 

 

 

Get In Touch With Us Today