Now it’s time to start thinking about the specific quality issues within our data. There is a whole field built around multiple aspects of Data Quality Measurement and Management. Because it was written by data nerds, of course it has a fancy title:
THE SIX PRIMARY DIMENSIONS FOR DATA QUALITY ASSESSMENT.
(BTW, that right there is an 8-word title for a concept that’s only six-words long. Hooray for Data Nerds!)
I am going to zero in on the most common problems you’re likely encounter with dirty data, and show you the fastest possible way to:
Spot them, and
Evaluate them, and
Most Common Data Problems
I’ve picked the most common data problems you’re likely to see in the average organization. There might be other problems not on this list, but the ones below are the most common. If you do come up with a new one, drop us a line. We’re always on the hunt for new species of data problem!
While you’re reading these below, focus on remembering how to spotthem. We’ll get to fixing them in a later chapter.
Many times because the data has not been entered in the system correctly, or certain files may have been corrupted, the data has several missing variables. For example, if an address does not include a zip code at all, the remaining information can be of little value, since the geographical aspect of it would be hard to determine.
Some systems exporting data will output a “NULL” or “0” value when there is a blank field. These will end up in your data file, and they are equivalent to a blank field, however if you check only for Blank values, you might be undercounting the real number of missing values. There may be other conventions used by your system, such as multiple “0”s, “xxxxx’’s, so watch for patterns of those as well.
Partial or Incomplete Values
Sometimes there is data in a field, but it’s not all the data you’re expecting. Incomplete values can be as simple as an initial instead of a name in the Last Name field. These are most often caused by human data entry error but they could also come from failed writes or a mis-mapping of fields.
Another example might be where a single data field contains an incomplete value (such as a phone number without the area code). The other type of incomplete value is an incomplete record. For example, a Customer Address record that is missing a Zip code. It’s important to differentiate between the two because each one must be addressed
Multiple copies of the same records take a toll on the computation and storage, but may also produce skewed or incorrect insights when they go undetected. One of the key problems could be human error — someone simply entering the data multiple times by accident — or it can be an algorithm that has gone wrong.
A remedy suggested for this problem is called “data deduplication”. This is a blend of human insight, data processing and algorithms to help identify potential duplicates based on likelihood scores and common sense to identify where records look like a close match.
Mis-Formatted Data (Data in The Wrong Format)
If the data is stored in inconsistent formats, the systems used to analyze or store the information may not interpret it correctly. For example, if an organization is maintaining the database of their consumers, then the format for storing basic information should be pre-determined. Name (first name, last name), date of birth (US/UK style) or phone number (with or without country code) should be saved in the exact same format. It may take data scientists a considerable amount of time to simply unravel the many versions of data saved.
Text Encoding Artifacts
As data is moved and imported from system to system and written to files, text encoding can vary. This can cause strange, non-English symbols to appear amongst your data instead of familiar apostrophes and quotation marks.
Exactly like it sounds, data with no structure. No columns, no headings, it barely fits in a spreadsheet. Just a big file of text information. If you downloaded all of the comments ever made on Facebook and threw them into one file, that would be some unstructured data. Unfortunately, there’s no way to go about cleaning unstructured data before you’ve structured it. It’s a whole other topic.
Dataset with Delimiter and Offset Issues
This type of data problem is practically invisible in a CSV file but looks absolutely horrible in the spreadsheet, but luckily it is not all that difficult to fix. It occurs when the delimiter structure of your input file is off. It can be cause by a single comma or semicolon in the wrong place. This misplaced delimiter will cause the spreadsheet to misinterpret the data structure. If you’ve ever loaded a file that looked fine for the first 500 rows, then suddenly everything was shifted by one column for the rest of the file and everything after it looked like gibberish, this error was likely the culprit.