We live in an age of information explosion. The information we receive every day is more than a person would have received in a month 50 years ago!
This information deluge is easy to get lost in. Not only is it a LOT of information to organize, but it’s also often noisy, inaccurate, and sometimes even fake…All of this makes it challenging for us to process.
This is true for machine learning models too! Sure, they can process much more data than we can, but they still must contend with volume, noise, and inconsistencies.
This is one reason why data quality is so important.
True enough, having “enough” data is critical. But even an endless amount of data will have its limits if it is all garbage…
A large dataset is helpful to represent as many variants as possible. But if the data quality is not high enough to represent the variants accurately, it’s counterproductive.
Training a model with a “noisy” dataset is like trying to put together an IKEA crib with just the pictures. Inevitably it will be wrong and will have to be redone.
So, we spend time cleaning and preparing data to reduce the noise. Different types of data require a focus on different types of data quality.
For text data, we might adjust misspellings, gibberish words, weird symbols, grammar errors, etc. For image data, we might adjust sharpness, resolution, noise, etc.
Tabular data also requires cleaning, but it’s an entirely different beast because tabular data is heterogeneous. Remember when I said, “Data scientist and machine learning engineers spend 40% to 60% of their time preparing data”? Tabular data is an example of why.
So, turning data into quality data is a process. It takes time. And it takes a certain expertise to do it both “right” and efficiently.
Generally, the first step in the process is to clean and prepare the data. This typically involves things like:
This cleaning phase is pretty standard. Although it is time-consuming, it is fairly straightforward.
However, to ensure we feel good about the data going into our model, we need to do a deeper analysis and more pre-processing.
Depending on the data and the model we are building, we might have to do different things. But generally speaking, we will usually optimize for three critical elements: accuracy, balance, and consistency.
We hope (maybe pray) our data will accurately describe the pattern we want our model to learn. Unfortunately, this often just doesn’t work out.
There are many accuracy issues we need to guard against. But a common one is extreme values. These could come from anomaly cases or manual mistakes.
For example, a person listed as sixty feet tall is in all probabilities inaccurate. This is likely a manual mistake that can be easily fixed. Even when these extreme values come from real cases, we still want to fix them. Because they usually are rare, edge cases that we don’t want to influence our model.
Usually, accuracy errors can be fixed easily, but finding them in a large data set is another issue.
When our data comes from multiple groups, we hope (again, likely pray) the amount of data from each group is balanced. However, due to data availability or sampling methods, we almost always get imbalanced data.
The impact of imbalanced data is it makes our model unable to perform equally well in every group. This is because some groups train the model better than others. So imbalanced data produces an imbalance in model predictions.
Of course, in some cases, the imbalance is actually a part of the model. Like fraud detection in credit card applications. There tends to be fewer cases of actual fraud, so the data is naturally skewed.
In situations like this, we need to apply some techniques to make our model able to learn from a very imbalanced dataset. Because this is what the real data is, and our model is supposed to be data driven.
Sometimes the data we are working with is recorded across a number of years. Checking the data consistency across this timeline is critical.
For instance, when we are predicting future revenue for a business, our training label (revenue) is calculated by its price model. But the price model might be different for data that comes from a few years ago.
Therefore, we need to make the historical data consistent with the newer data. If our model learns from this data without any preprocessing to establish consistency, it will produce a model that is not very robust.
Just as critically, ensuring consistency is an ongoing effort.
Once we train a model and make predictions in real-time, we need to keep checking the consistency between new and old data. If the distribution of new data changes, our model will start generating trash. So, detecting changes in data distribution and retraining our model over time is essential.
As mentioned in my previous post about data quantity, without sufficient quantity and quality of data, you’ll never have a truly powerful and useful model.
My hope is this statement is now a little more clear.
Both data quantity and quality are core to building a machine learning model. Yet, I believe, data quality is even more important than quantity.
It’s a process to create data quality, but the robust, impactful models created make the journey well worth it. And so, I wish you mazel tov in this pursuit!