Business IntelligenceData PrepData Quality

Chapter 3 – Cleaning Data with Excel and Google Sheets

By May 16, 2019 July 24th, 2019 No Comments

Chapter 3: How Spreadsheets Became The #1 BI Tool in the World

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Microsoft Excel and Google Sheets are the first choice of many users when it comes to handling large amounts of data. They’re readily available, easy to learn and support universal file formats. When it comes to using a spreadsheet application like Excel or Google Sheets, the point is to present data in a neat, organized manner which is easy to comprehend. They’re also on nearly everyone’s desktop, and were probably the first data-centric software tool any of us learned.

In this eBook, we are going to tell you some of the tips as to how to clean and prep up your data using Excel and Google Sheets, and make it accurate and consistent, and make it look elegant, precise, and user-friendly.

spreadsheet software

Risks of Cleaning Data In Spreadsheets

While spreadsheet tools are quite adequate for many small to mid-level data chores, there are some important risks to be aware of. Spreadsheets are desktop-class, file-oriented tools which means their entire data contents are stored in volatile RAM while in use and on disk while you’re not using them. That means that between saves, the data is stored in RAM, and can be lost.

Spreadsheet tools also lack any auditing, change control, and meta-data features that would be available in a more sophisticated data cleaning tool. These features act as backstops for any unintended user error. Caution must be exercised when using them as multiple hours of work can be erased in a microsecond.

Unnoticed sorting and paste errors can also tarnish your hard work. If the data saves to disk while in this state, it can be very hard, if not impossible, to undo the damage and revert to an earlier version.

Spreadsheets also lack repeatable processes and automation. If you spend 8 hours cleaning a data file one month, you’ll have to repeat nearly all of those steps the next time another refreshed data file comes along. More purpose-designed tools like Inzata Analytics allow you to record and script your cleaning activities via automation. Data is also staged throughout the cleaning process, and rollbacks are instantaneous. You can set up data flows that automatically perform cleaning steps on new, incoming data. Basically, this lets you get out of the data cleaning business almost permanently.

Performance and Size Limits in Spreadsheet Tools

Most folks don’t bother to check the performance limits in Spreadsheet tools before they start working with them. That’s because the majority won’t run up against them. However, if you start to experience slow performance, it might be a good idea to refer to the limits below to measure where you are and make sure you don’t start stepping beyond them. Like I said above, spreadsheet tools are fine for most small data, which will suit the majority of users.

 

Excel Limits

Excel is limited to 1,048,576 rows by 16,384 columns in a single worksheet.

  • A 32-bit Excel environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process.
  • 64-bit Excel is not subject to these limits and can consume as much memory as you can give it. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded.

 

 

Google Sheets Limits

  • Google Spreadsheets are limited to 5,000,000 cells, with a maximum of 256 columns per sheet. (Which means the rows limit can be as low as 19,231, if your file has a lot of columns!)
  • Uploaded files that are converted to the Google spreadsheets format can’t be larger than 20 MB and need to be under 400,000 cells and 256 columns per sheet.

In real-world experience, running on midrange hardware, Excel can begin to slow to an unusable state on data files as small as 50mb-100mb. Even if you have the patience to operate in this slow state, remember you are running at redline. Crashes and data loss are much more likely!

If you believe you will be working with larger data, why not check out a tool like Inzata, designed to handle profiling and cleaning of larger datasets?

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)