Highlights –
- Data versioning is crucial since it speeds up the creation of data products while cutting down on errors.
Generally speaking, a version refers to a “form wherein some details differ when compared with earlier ones.” The difference is that the new product is improved, upgraded or customized. Putting this in context with the digital world, versioning is considered luxurious – be it small objects or an entire system, multiple versions of anything can be maintained. Often, many things are automatically versioned – for example, the software running on our phones, word doc, etc.
But why do we have to version so many things? Because versioning produces an indispensable record of the gradual changes made and the period when they occurred. This log is valuable for businesses as it helps them understand why the value of a specific data point is what it is.
Data is one of the areas where versioning is still in the early stages. But why? Because data can be significant, possibly bigger than anything else in the digital age, maintaining numerous copies of something as vast as data is almost next to impossible.
But with the right data structures and abstractions, a versioning system for any data size can take place. This helps map data objectives to the versions they effectively belong to. Versioning is essential for business-critical data environments. Before we discuss this aspect, we will define what data versioning is, why it is important, and some of its applications.
What is Data Versioning?
At the basic level, when we talk about data versioning, it means establishing a unique reference for a data set. This reference can be in the form of a search, an ID, or a Date Time identifier. This broad definition qualifies various strategies under the umbrella term “data versioning.” It could be something like saving a complete copy of the data with a new name or file path every time you wish to generate a new version of it. It also contains more sophisticated versioning solutions that enhance storage usage between versions and expose individual actions to manage them.
Importance of Data versioning
The reasons why data versioning is crucial are many: It speeds up the creation of data products while cutting down on errors. Suppose a petabyte of production data gets deleted by mistake. It is much simpler to restore an earlier version of a dataset than to execute a backfill process again, which can take an entire day to complete.
Or you would want to know which records have changed in a table without a trustworthy last updated field or Change-Data-Capture (CDC) log. This can be accomplished by taking snapshots of various versions of the data and then enquiring about the differences. As the two examples have proved, reducing the cost of errors and uncovering how data has changed over a while help enhance the development speed of a data team. Data versioning is the catalyst to make this possible.
Applications of data versioning
Let’s now get a clearer picture of how data versioning benefits various situations with the help of a few examples.
Versioning Data for Machine Learning
Imagine you work for a company that enhances grainy video footage using Machine Learning (ML) techniques. Consumers of the product use the improved footage for several commercial purposes. The data scientists create a new and updated algorithm that enhances the classification precision of its outputs. However, the new algorithm cannot be made available to all users simultaneously. Instead, the need is to let consumers switch between the classifications of the two algorithms for a while.
This way, the outputs of both algorithm versions can be saved to various pathways in the object store. You could get away with this “hardcoding file paths” technique for one algorithm with only two versions. However, when the number of both algorithms and versions grows, developers may commit more errors, such as forgetting to change a version’s route or the specific parameters used for a particular version. The inclusion of data versioning helps raise helpful abstractions to manage the outputs more reasonably.
Versioning of Data in Analytics
The fundamental step in analytics is the creation of metrics that include the logic to assess a user and business behavior. Essential concepts like “Active users,” “Sessions,” and “Churn Rate” frequently have their logic written in SQL and computed regularly.
An issue that frequently arises in analytics is that a query that was running the other day successfully may now start generating errors due to a change in the data. The best way to identify the problem’s root cause is to run exactly the same query over the data as it was when the error initially appeared. Having a current data version makes the debugging process easier and speeds up the resolution of data issues.
Data versioning best practices
Here are some points that have been found useful when implementing data versioning.
Use TTLs to expire old versions
A few versions may be older than 30 days or a year, which might not need to be kept on file. However, no one wants to take on the role of a housekeeper and be in charge of erasing outdated data. Thus, they begin to accumulate. A TTL (Time To Live) policy is an excellent approach to automatically remove earlier versions of data when you’re aware they’re no longer relevant after a specific period. If you’re using a versioning system that doesn’t support TTLs, running a version clean-up script regularly can produce the same results.
Version in terms of meaning rather than time
There is nothing wrong with using a cadence hourly or daily to generate new versions of a dataset. In most cases, locating a version created at the time nearest to the time you’re looking for is good enough.
Data versions are more significant when versioning is linked to the beginning and/or end of data pipeline jobs. Even if the ETL script finishes running, you can make a version of it. And if you’re looking to send your “highly engaged” users an email, you could save a version of the dataset first. This enables you to add more valuable metadata to your versions outside the time they were created and speeds up the process of determining what went wrong.
Utilize versioning for teamwork
One major challenge in data environments is to avoid stepping on your peers’ toes. Data assets are frequently treated as a kind of shared folder that anybody may read from, write to, or modify. One way to avoid this is to make personalized versions of the data while developing. This reduces the possibility that a modification you make will unintentionally impact another team member.
Wrapping it up
There is a noticeable trend in the data industry toward adopting the mindset of treating “Data as a Product.” Though it is a good trend for data organizations, many data teams need to level up the way their data teams operate. The best practices like CI/CD, testing, and version control are features you need to consider if you want your data team to take on these types of projects confidently.