A major part of most data projects is making sure that the inputs have been properly cleaned. Poorly formatted input data can quickly lead to a cascade of problems. Worse, errors can go completely undetected if the faults in the data don’t lead to faults in the process.
On the flip side, data cleaning can end up eating up a lot of your time. It’s a good idea to think about why that is and how you might be able to remedy the issue.
Why Data Cleaning is So Time-Consuming
A big problem when it comes to fixing data up for use is that there are often mismatches between the source format and the format used by the system processing the information. Something as simple as dealing with the use of semicolons and quotes in a CSV file will still add to the time required to clean data for a project.
It’s hard to anticipate all the ways things can be wrong with source data. User-contributed data, for example, may not be highly curated. Rather than getting highly clean inputs, you may get all kinds of characters that have the potential to interfere with reading and processing.
Security features also can drive the need for data cleaning. Web-submitted data is often automatically scrubbed to prevent SQL injection attacks. While doing data cleaning, it’s often necessary to reverse this process to get at what the original inputs looked like.
Cultural differences can present major problems in cleaning data, too. Even simple things like postal codes can create trouble. A US ZIP code is always either a 5-digit input or 5 digits followed by a dash and four more digits. In Canada, postal costs use both letters and numbers, and there are spaces.
End-users of web applications often enter inputs regardless of whether they fit the database’s format. In the best scenario, the database software rejects the entry and alerts the user. There are also scenarios, though, where the input is accepted and ends up in a field that’s mangled, deprecated or just blank.
Considerations During Data Cleaning
A major question that has to be asked at the beginning of an effort is how much data can you afford to lose. For example, a dataset of fewer than 1,000 entries is already dangerously close to becoming statistically too small to yield relevant results. One way is to just toss out all the bad lines. If a quarter of the entries are visibly flawed and 10% more have machineability issues, it’s not long before you’re shaving off one-third of the dataset without even starting processing. Pre-processing may shave off even more data due to things like the removal of outliers and duplicates.
Barring extreme limits on time or capabilities, your goal should be to preserve as much of the original information as practicable. There are several ways to tackle the task, including doing:
- Manual checks of the data to identify obvious problems
- Dry runs of processing and post-processing work to see how mangled or accurate the output is
- Standard pre-processing methods to spot common problems, such as unescaped or escaped characters and HTML entities
- Machine learning work to recognize patterns in poorly formatted data
While it might be possible to acquire pre-cleaned data from vendors, you’ll still need to perform the same checks because you should never trust inputs that haven’t been checked.
Data cleaning is one step in a larger process, but it has the potential to wreck everything and force you into a reset. Worse, there’s the very real possibility that the issues go undetected and somehow end up in a final work product. Rest assured that someone will read your work product and see what’s wrong with it.
It’s best to position the cleaning of data early in your larger set of processes. This means planning out how data will be processed and understanding what can’t be properly digested by downstream databases, analytics packages and dashboards. While some problems, such as issues with minor punctuation marks, can be handled in post-processing, you shouldn’t assume this will happen. Always plan to clean data as early as possible.
With a structured process in place, you can operate with cleaner datasets. This will save time and money, and it will also reduce storage overhead. Most importantly, it will ensure you have the largest datasets possible to get the most relevant analysis that can be derived from it.