Categories
Business Intelligence Data Quality

Chapter 3 – Cleaning Data with Excel and Google Sheets

How Spreadsheets Became The #1 BI Tool in the World

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Microsoft Excel and Google Sheets are the first choice of many users when it comes to handling large amounts of data. They’re readily available, easy to learn and support universal file formats. When it comes to using a spreadsheet application like Excel or Google Sheets, the point is to present data in a neat, organized manner which is easy to comprehend. They’re also on nearly everyone’s desktop, and were probably the first data-centric software tool any of us learned.

In this eBook, we are going to tell you some of the tips as to how to clean and prep up your data using Excel and Google Sheets, and make it accurate and consistent, and make it look elegant, precise, and user-friendly.

spreadsheet software

Risks of Cleaning Data In Spreadsheets

While spreadsheet tools are quite adequate for many small to mid-level data chores, there are some important risks to be aware of. Spreadsheets are desktop-class, file-oriented tools which means their entire data contents are stored in volatile RAM while in use and on disk while you’re not using them. That means that between saves, the data is stored in RAM, and can be lost.

Spreadsheet tools also lack any auditing, change control, and meta-data features that would be available in a more sophisticated data cleaning tool. These features act as backstops for any unintended user error. Caution must be exercised when using them as multiple hours of work can be erased in a microsecond.

Unnoticed sorting and paste errors can also tarnish your hard work. If the data saves to disk while in this state, it can be very hard, if not impossible, to undo the damage and revert to an earlier version.

Spreadsheets also lack repeatable processes and automation. If you spend 8 hours cleaning a data file one month, you’ll have to repeat nearly all of those steps the next time another refreshed data file comes along. More purpose-designed tools like Inzata Analytics allow you to record and script your cleaning activities via automation. Data is also staged throughout the cleaning process, and rollbacks are instantaneous. You can set up data flows that automatically perform cleaning steps on new, incoming data. Basically, this lets you get out of the data cleaning business almost permanently.

Performance and Size Limits in Spreadsheet Tools

Most folks don’t bother to check the performance limits in Spreadsheet tools before they start working with them. That’s because the majority won’t run up against them. However, if you start to experience slow performance, it might be a good idea to refer to the limits below to measure where you are and make sure you don’t start stepping beyond them. Like I said above, spreadsheet tools are fine for most small data, which will suit the majority of users.

Excel Limits

Excel is limited to 1,048,576 rows by 16,384 columns in a single worksheet.

  • A 32-bit Excel environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process.
  • 64-bit Excel is not subject to these limits and can consume as much memory as you can give it. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded.

Google Sheets Limits

  • Google Spreadsheets are limited to 5,000,000 cells, with a maximum of 256 columns per sheet. (Which means the rows limit can be as low as 19,231, if your file has a lot of columns!)
  • Uploaded files that are converted to the Google spreadsheets format can’t be larger than 20 MB and need to be under 400,000 cells and 256 columns per sheet.

In real-world experience, running on midrange hardware, Excel can begin to slow to an unusable state on data files as small as 50mb-100mb. Even if you have the patience to operate in this slow state, remember you are running at redline. Crashes and data loss are much more likely!

If you believe you will be working with larger data, why not check out a tool like Inzata, designed to handle profiling and cleaning of larger datasets?

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Categories
Data Quality

Ebook Excerpt: Chapter 2 – Why Clean Data?

Chapter 2: Why Clean Data?

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Data cleansing is the process of spotting and correcting inaccurate data. Organizations rely on data for many things, but few actively address data quality. Whether it’s the integrity of customer addresses or ensuring invoice accuracy. Ensuring effective and reliable use of data can increase the intrinsic value of the brand.  Business enterprises must assign importance to data quality.

A data-driven marketing survey conducted by Tetra data found that 40% of marketers do not use data to its full effect. Managing and ensuring that the data is clean can provide significant business value.

Improving data quality can eliminate problems like expensive processing errors, manual troubleshooting, and incorrect invoices. Data quality is also a way of life because important data like customer information is always changing and evolving.

Business enterprises can achieve a wide range of benefits by cleansing data and managing quality which can lead to lowering operational costs and maximizing profits.

Who are the heroes who allow the organization to seize and enjoy all these benefits? I affectionately refer to these poor souls as PWCD’s, or People Who Clean Data[1].

These brave people, and hopefully you are reading this because you hope to be one of them, are the noblest. They often get little recognition even though they clean up the messes of hundreds, if not thousands of other people every day. They are the noble janitors of the data world. And I salute them.

Top 5 Benefits of Data Cleaning

1.  Improve the Efficiency of Customer Acquisition Activities

Business enterprises can significantly boost their customer acquisition and retention efforts by cleansing their data regularly. With the high throughput of the prospecting and lead process, filtering, cleansing, enriching having accurate data is essential to its effectiveness. Throughout the marketing process, enterprises must ensure that the data is clean, up-to-date and accurate by regularly following data quality routines. Clean data can also ensure the highest returns on email or postal campaigns as chances of encountering outdated addresses or missed deliveries are very low. Multi-channel customer data can also be managed seamlessly which provides the enterprise with an opportunity to carry out successful marketing campaigns in the future as they would be aware of the methods to effectively reach out to their target audience.

2.  Improve Decision-Making Processes

The cornerstone of effective decision making in a business enterprise is data. According to Sirius Decisions, data in an average B2B organization doubles every 12-18 months and though the data might be clean initially, errors can creep in at any time. In fact, in nearly all businesses where data quality is not managed, data quality decay is constantly at work. Each time new records are added; duplicates may be created. Things happening outside your organization, like customers moving and changing emails and telephone numbers will, over time, degrade data quality.

Yet the majority of enterprises fail to prioritize data quality management, or even acknowledge they have a problem! In fact, many of them don’t even have a record of the last time quality control was performed on their customer’s data. More often than not they merely discard or ignore data they believe to be of poor quality, and make decisions through other means. Here you can see that data quality is a massive barrier toward digital transformation and business intelligence, much less every company’s desire to become more Data-Driven.

Accurate information and quality data are essential to decision making. Clean data can support better analytics as well as all-round business intelligence which can facilitate better decision making and execution. In the end, having accurate data can help business enterprises make better decisions which will contribute to the success of the business in the long run.

3.  Streamline Business Practices

Eradicating duplicate and erroneous data can help business enterprises to streamline business practices and avoid wasteful spending. Data cleansing can also help in determining if particular job descriptions within the enterprise can be changed or if those positions can be integrated somewhere else. If reliable and accurate sales information is available, the performance of a product or a service in the market can be easily assessed.

Data cleansing along with the right analytics can also help the enterprise to identify an opportunity to launch new products or services into the market at the right time. It can highlight various marketing avenues that the enterprises can try. In practically any other business process you can name, decisions are made every day, some large, but many small. It is this systematic pushing of high-quality information down the chain of command, into the hands of individual contributors that helps them improve decisions made at all levels of the organization. Called Operational Intelligence, it is used more commonly for quick lookups and to inform the thousands of decisions that are made every day inside the organization.

4.  Increase Productivity

Having a clean and properly maintained enterprise dataset can help organizations ensure that the employees are making the best use of their time and resources. It can also prevent the staff of the enterprise from contacting customers with out-of-date information or create invalid vendor files in the system by conveniently helping them to work with clean records thereby maximizing the staff’s efficiency and productivity. High-quality data helps reduce the risk of fraud, ensuring the staff has access to accurate vendor or customer data when payments or refunds are initiated.

5.  Increase Revenue

Business enterprises that work on improving the consistency and increasing the accuracy of their data can drastically improve their response rates which results in increased revenue. Clean data can help business enterprises to significantly reduce the number of returned mails. If there are any time-sensitive information or promotions that the enterprise wants to convey to their customers directly, accurate information can help in reaching the customers conveniently and quickly.

Duplicate data is another aspect which can be effectively eradicated by data cleansing. According to Sirius Decisions, the financial impact of duplicate data is directly proportional to the time that it remains in the database.

Duplicate data can significantly drain the enterprise’s resources as they will have to spend twice as much on a single customer. For example, if multiple mails are sent to the same customer, they might get annoyed and might completely lose interest in the enterprise’s products and services.

[1] “People with Crappy Data” is an alternate interpretation coined by some of my clients.

Polk County Schools Case Study in Data Analytics

We’ll send it to your inbox immediately!

Polk County Case Study for Data Analytics Inzata Platform in School Districts

Get Your Guide

We’ll send it to your inbox immediately!

Guide to Cleaning Data with Excel & Google Sheets Book Cover by Inzata COO Christopher Rafter