Highlights:
- Datafold’s focus has been on automated testing during the transformation step with Data Diff.
- Developers and data analysts, who want to compare various databases quickly and effectively without creating a DIY diff tool, can use Datafold’s solution, which constitutes a significant advancement.
Datafold, a data reliability company, recently announced the release of a new open-source cross-database diffing program called data-diff. The new solution is an open-source extension to Datafold’s original Data Diff tool to compare data sets. Open-source data-diff uses high-performance algorithms to verify the consistency of data across databases.
In the current data stack, businesses gather data from sources, put it in a warehouse, and then transform it to be used for analysis, activation, or data science use cases. Datafold’s focus has been on automated testing during the transformation step with Data Diff. This ensures that any changes to a data model does not disintegrate a dashboard or cause a predictive algorithm to have the wrong data. With the release of open-source data-diff, Datafold can now assist with the extract and load phases of the process. Open-source data-diff ensures that the loaded data matches the original source from where it was derived. Datafold now provides data engineers with coverage across the extract, load, and transform (ELT) process. With every part of the data stack needing testing so that data engineers to produce reliable data products, Datafold provides engineers with the much-needed coverage throughout the extract, load, transform (ELT) process.
“Data-diff fulfills a need that wasn’t previously being met,” said Gleb Mezhanskiy, Datafold founder and CEO. “Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process. Although multiple vendors and open-source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.”
Mezhanskiy continued, “Data-diff solves this problem elegantly by providing an easy way to validate the consistency of data sets across databases at scale. It relies on state-of-the-art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open-source tool, it can be easily embedded into existing workflows and systems.”
Addressing an important need
Today’s organizations utilize data replication to combine data from several sources into data lakes or data warehouses for analytics. They combine data for search, move data from legacy systems to contemporary databases, and connect operational systems with real-time data pipelines.
Data synchronization between various systems and apps is now simpler than ever, thanks to incredible solutions like Fivetran, Airbyte, and Stitch. The majority of data synchronization scenarios demand a 100% data integrity guarantee. However, in reality, records can occasionally be lost in any connected system owing to failed packets, general replication problems, or configuration errors. Validation checks must be made with the help of the data diff tool to ensure data integrity.
Developers and data analysts, who want to compare various databases quickly and effectively without creating a DIY diff tool, can use Datafold’s solution, which constitutes a significant advancement. Currently, data engineers compare data using a variety of techniques, from straightforward row counts to in-depth row-level analysis. The latter methodology is slow but ensures full validation, while the former is quick but not comprehensive. Open-source data-diff is quick and offers full-scale validation.
Building and managing data quality with open-source data-diff
Data-diff, which is now available, employs checksums to quickly and effectively confirm complete consistency between two separate data sources. This technique enables a row-level comparison of 100 million records to be completed in a matter of seconds without compromising the granularity of the resulting comparison.
Data-diff was made available by Datafold under the MIT license. Datafold includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto, and Oracle. To create connectors for more data sources and particular business applications, Datafold intends to invite contributors.