What is Data Cleaning?
Lets face it, most data you’ll encounter is going to be dirty. Dirty data yields inaccurate results, and is worthless for analysis until it’s cleaned up. Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. There are several methods for data cleansing depending on how it is stored along with the answers being sought.
Data cleansing is not simply about erasing information to make space for new data, but rather finding a way to maximize a data set’s accuracy without necessarily deleting information.
Data cleansing is the process of spotting and correcting inaccurate data. Organizations rely on data for many things, but few actively address data quality. Whether it’s the integrity of customer addresses or ensuring invoice accuracy. Ensuring effective and reliable use of data can increase the intrinsic value of the brand. Business enterprises must assign importance to data quality.
For one, data cleansing includes more actions than removing data, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes such as missing codes, empty fields, and identifying duplicate records. Data cleaning is considered a foundational element of data science basics, as it plays an important role in the analytical process and uncovering reliable answers.
Most importantly, the goal of data cleansing is to create datasets that are standardized and uniform to allow business intelligence and data analytics tools to easily access and find the right data for each query.
Why Clean Data?
A data driven marketing survey conducted by Tetra data found that 40% of marketers do not use data to its full effect. Managing and ensuring that the data is clean can provide significant business value.
Improving data quality through data cleaning can eliminate problems like expensive processing errors, manual troubleshooting, and incorrect invoices. Data quality is also a way of life because important data like customer information is always changing and evolving.
Business enterprises can achieve a wide range of benefits by cleansing data and managing quality which can lead to lowering operational costs and maximizing profits.
How To Clean Data
To be considered high-quality, data needs to pass a set of quality criteria. Those include:
- Valid: The degree to which the measures conform to defined business rules or constraints. When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not implemented in software) or where inappropriate data-capture technology was used (e.g., spreadsheets, where it is very hard to limit what a user chooses to enter into a cell, if cell validation is not used). Data constraints fall into the following categories.
- Data-Type Constraints – e.g., values in a particular column must be of a particular datatype, e.g., Boolean, numeric (integer or real), date, etc.
- Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.
- Mandatory Constraints: Certain columns cannot be empty.
- Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.
- Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person’s gender may be Female, Male or Unknown (not recorded).
- Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the “state” column is required to belong to one of the US’s defined states or territories: the set of permissible states/territories is recorded in a separate States table. The term foreign key is borrowed from relational database terminology.
- Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.
- Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient’s date of discharge from hospital cannot be earlier than the date of admission.
- Data cleaning to correct validity issues can often be done programattically.
- Accuracy: Quite simply, is the data right. The conformity of a measure to a standard or a true value. Accuracy is very hard to achieve through data cleaning in the general case, because it requires accessing an external source of data that contains the true value: such “gold standard” data is often unavailable. Accuracy has been achieved in some data cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state), and also help verify that street addresses within these zip codes actually exist.
- Completeness: The degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleaning methodology: one cannot infer facts that were not captured when the data in question was initially recorded. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates “unknown” or “missing”, but supplying of default values does not imply that the data has been made complete.
- Consistency: The degree to which a set of measures are equivalent in across systems. Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies – e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).
- Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems. It is often quite easy to ensure this through data cleaning early in the process, but as the process moves along, and data is transformed and changed, it becomes far more difficult. In datasets pooled from different locales, weight may be recorded either in pounds or kilos, and must be converted to a single measure using an arithmetic transformation. This also points out the problem of naked metrics, where values like weight may be recorded as an integer value. Unless there is another column directly next to it, or a notation in the column heading, it can be next to impossible to determine whether a value is in kilos or pounds, or celsius or Fahrenheit.
The term integrity encompasses accuracy, consistency and some aspects of validation but is rarely used by itself in data-cleansing contexts because it is insufficiently specific.
How Can I Use Data Cleaning?
Regardless of the type of analysis or data visualizations you need, data cleansing is a vital step to ensure that the answers you generate are accurate. When collecting data from several streams and with manual input from users, information can carry mistakes, be incorrectly inputted, or have gaps.
Data cleaning helps ensure that information always matches the correct fields while making it easier for business intelligence tools to interact with data sets to find information more efficiently. One of the most common data cleaning examples is its application in data warehouses.
After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data
A successful data warehouse stores a variety of data from disparate sources and optimizes it for analysis before any modeling is done. To do so, warehouse applications must parse through millions of incoming data points to make sure they’re accurate before they can be slotted into the right database, table, or other structure.
Organizations that collect data directly from consumers filling in surveys, questionnaires, and forms also use data cleaning extensively. In their cases, this includes checking that data was entered in the correct field, that it doesn’t feature invalid characters, and that there are no gaps in the information provided.
More on Data Cleaning:
Check out our Index of Business Intelligence Terms for more definitions.