The 3 Things You Need To Know If You Work With Data In Spreadsheets
Microsoft Excel and Google Sheets are the first choice of many users when it comes to working with data. They’re readily available, easy to learn and support universal file formats. When it comes to using a spreadsheet application like Excel or Google Sheets, the point is to present data in a neat, organized manner which is easy to comprehend. They’re also on nearly everyone’s desktop, and were probably the first data-centric software tool any of us learned. Whether you’re using Excel or Google Sheets, you want your data cleaned and prepped. You want it accurate and consistent, and you want it to elegant, precise, and user-friendly.
But there is a downside. While spreadsheets are popular, they’re far from the perfect tool for working with data. We’re going to explore the Top 3 things you need to be aware of if you work with data in spreadsheets.
While spreadsheet tools are quite adequate for many small to mid-level data chores, there are some important risks to be aware of. Spreadsheets are desktop-class, file-oriented tools which means their entire data contents are stored in volatile RAM while in use and on disk while you’re not using them. That means that between saves, the data is stored in RAM, and can be lost.
Risk #1: Beware of Performance and Data Size Limits in Spreadsheet Tools
Most people don’t check the performance limits in Spreadsheet tools before they start working with them. That’s because the majority won’t run up against them. However, if you start to experience slow performance, it might be a good idea to refer to the limits below to measure where you are and make sure you don’t start stepping beyond them. Like I said above, spreadsheet tools are fine for most small data, which will suit the majority of users.
But at some point, if you keep working with larger and larger data, you’re going to run into some ugly performance limits. When it happens, it happens without warning and you hit the wall hard.
Excel is limited to 1,048,576 rows by 16,384 columns in a single worksheet.
- A 32-bit Excel environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process.
- 64-bit Excel is not subject to these limits and can consume as much memory as you can give it. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded.
Google Sheets Limits
- Google Spreadsheets are limited to 5,000,000 cells, with a maximum of 256 columns per sheet. (Which means the rows limit can be as low as 19,231, if your file has a lot of columns!)
- Uploaded files that are converted to the Google spreadsheets format can’t be larger than 20 MB and need to be under 400,000 cells and 256 columns per sheet.
In real-world experience, running on midrange hardware, Excel can begin to slow to an unusable state on data files as small as 50mb-100mb. Even if you have the patience to operate in this slow state, remember you are running at redline. Crashes and data loss are much more likely!
(If you’re among the millions of people who have experienced any of these, or believe you will be working with larger data, why not check out a tool like Inzata, designed to handle profiling and cleaning of larger datasets?)
Risk #2: There’s a real chance you could lose all your work just from one mistake
Spreadsheet tools lack any auditing, change control, and meta-data features that would be available in a more sophisticated data cleaning tool. These features are designed to act as backstops for any unintended user error. Caution must be exercised when using them as multiple hours of work can be erased in a microsecond.
Accidental sorting and paste errors can also tarnish your hard work. Sort errors are incredibly difficult to spot. If you forget to include a critical column in the sort, you’ve just corrupted your entire dataset. If you’re lucky enough to catch it, you can undo it, if not, that dataset is now ruined, along with all of the work you just did. If the data saves to disk while in this state, it can be very hard, if not impossible, to undo the damage.
Risk #3: Spreadsheets Aren’t Really Saving You Any Time
Spreadsheets are fine if you just have to clean or prep data once, but that is rarely the case. Data is always refreshing, new data is coming online. Spreadsheets lack any kind of repeatable processes and or intelligent automation.
If you spend 8 hours cleaning a data file one month, you’ll have to repeat nearly all of those steps the next time a refreshed data file comes along.
Spreadsheets can be pretty dumb sometimes. They lack the ability to learn. They rely 100% on human intelligence to tell them what to do, making them very labor intensive.
More purpose-designed tools like Inzata Analytics allow you to record and script your cleaning activities via automation. AI and Machine Learning lets these tools learn about your data over time. Your data is also staged throughout the cleaning process, and rollbacks are instantaneous. You can set up data flows that automatically perform cleaning steps on new, incoming data. Basically, this lets you get out of the data cleaning business almost permanently.