Categories
Big Data Data Analysis Data Prep

The 5 Classic Excel Tricks Every Analyst Should Know

While numerous analytics and business intelligence packages now dot the landscape, for many people Excel remains their weapon of choice when it comes to data analysis. Working in Excel means more than just reading through a spreadsheet. Excel is a powerful system in its own right, and every analyst should know the following 5 tricks that will allow them to get the most out of their work.

1. Data Cleaning in Excel

One of the most important jobs when looking at a spreadsheet is data cleaning. Fortunately, there are several built-in solutions for data cleaning in Excel.

Before you can get any work done, you’ll want to make sure the cells are properly formatted. To accomplish this, you can use the conversion tools that are available when you right-click on any cell. This gives you a slew of options, but the two big ones are:

  • Number stored as text
  • Convert to number

Respectively, those two allow you to either treat a number as text or to make sure a cell with numbers in it can be read as a number. Especially if you’ve imported something in scientific notation, this can simplify a typically painstaking task.

2. PivotTable and Unpivot

When it comes to data analysis Excel issues, the simplest methods are often the best ones. Few are quite as good as using PivotTable and Unpivot, operations that are two sides of the same coin. 

Suppose you have a table of data that needs to be condensed into a simpler second table. For example, you might want to tally all of your social media visitors by region. With a column labeled “region” in the original context, PivotTable will create a second table that condenses the work. If you need to accomplish the reverse, simply use Unpivot.

3. INDEX and MATCH

Finding a specific entry can be a genuine pain, especially if you’re dealing with rows and columns that are inconsistent. INDEX and MATCH are the tools that allow you to specify relationships between cells as a way to track down which ones you want. For example =INDEX(B6:O6, 3) will give you the entry in cell D6. It might not seem like a big deal when you first hear about it, but INDEX can massively reduce headaches when dealing with tables that are constantly changing.

MATCH is much easier to understand. If you need to find, for example, the February entry in a set of columns, =MATCH(“Feb”, B6:Z6, 0) will hunt down the right match from the range of cells provided.

4. SUMIF and COUNTIF

The SUM and COUNT tools are among the first weapons analysts tend to learn when using Excel. You can take them to the next level by using the IF versions. For example, SUMIF will only sum the values if the condition included is present. This allows you to tally up entries only if they meet specific criteria that exist. It’s also possible to go one level higher and use the SUMIFS syntax to set more criteria and look at multiple ranges.

5. VLOOKUP

Hunting through giant spreadsheets can cross a line at which it ceases to be humanly possible. A lot of people end up either using massive chains of entries, or they might even break out VBA to do the job programmatically.

This is where VLOOKUP enters the game because programmatic solutions are rarely necessary for Excel. It’s a fairly straightforward tool that operates as =VLOOKUP(target, table, index). Whatever you want to look up is the target, and the table is what you mean to search. The index informs Excel which column will be searched, with the first column starting at one.

Back to blog homepage

Categories
Big Data Business Intelligence Data Analysis Data Prep

Big Data Time Killers That Could Be Costing You

Putting big data systems to work across varying companies and industries all have one thing in common, almost all forms of big data work end up being time-demanding. This cuts into productivity in many ways, with the most obvious being that less time can be allocated towards analysis.

To address the problem, the first step is to identify the varieties of time killers that often occur during these projects. Let’s take a look at four of the most significant as well as solutions to avoid them.

Data Acquisition and Preparation

One of the most easily recognized time killers is the effort that goes into simply collecting data and preparing it for use. This occurs for a host of reasons, including:

  • Difficulty finding reliable sources
  • Inability to license data
  • Poorly formatted information
  • The need for redundancies in checking the data
  • The processing time required to go through massive datasets

Solutions run the gamut from paying third parties for data to creating machine learning systems that can handle prep work. Every solution has an upfront cost in terms of either money or time, but the investment can pay off generously if you’re going to reuse the same systems well into the future.

Lack of Coordination

Another problem is that lack of coordination can lead to various parties within a company repeating the same efforts without knowing it. If an organization lacks a well-curated data lake, someone in another division might not realize they could have easily acquired the necessary information from an existing source. Not only does this cost time, but it can become expensive as storage requirements are needlessly doubled.

Similarly, people often forget to contribute to archives and data lakes when they wrap projects up. You can have the most advanced system in the world, but it means nothing if the culture in your company doesn’t emphasize the importance of cataloging datasets and making them available for future use.

Not Knowing How to Use the Analytics Tools

Even the best of data scientists will find themselves picking and sticking to get a system to work. Some of this issue is inherent to the job, as data science tends to reward curious people who are self-taught and forward-thinking. Unfortunately, this is time spent on work that a company shouldn’t be paying for.

Likewise, a lack of training can lead to inefficient practices. If you’ve ever used a computer program for years only to learn that there was a shortcut for doing something you had handled repeatedly over that time, you know the feeling. This wasted time adds up and can become considerable in the long run.

Here, the solution is simple. The upfront cost of training is necessary to shorten the learning curve. A company should establish standards and practices for using analytics tools, and there should be at least one person dedicated to passing on this knowledge through classes, seminars, and other training sessions.

Poorly Written Requirements for Projects

When someone sits down with the project requirements, they tend to try to gloss over the broad strokes, identify problem areas, and then get to work. A poorly written document can leave people wondering for weeks before they even figure out what’s wrong. In the best-case scenario, they come back to you and address the issue. In the worst-case scenario, they never catch the issue and it eventually ends up skewing the final work product.

 Requirements should include specifics like:

  • Which tools should be used
  • Preferred data sources
  • Limits on the scope of analysis
  • Details regarding must-have features

It’s always better to go overboard with instructions and requirements than to not provide enough specifics.

Conclusion

It’s easy during a big data project to get focused on collecting sources, processing data, and producing analysis. How you and your team members go about doing these things is, though, just as important as handling them. Every business should have processes in place for weeding out the time killers in projects and ultimately making them more streamlined. This may include project reviews such as when team members are prompted to state what issues they encountered. By taking this approach, you can reduce the amount of time spent on mundane tasks and increase the amount of work that goes into analysis and reporting.

Back to blog homepage

Categories
Data Analytics Data Modeling Data Prep

Disparate Data: The Silent Business Killer

Data can end up in disparate spots for a variety of reasons. Deliberate actions can be taken in the interest of not leaving all your eggs in one basket. Some organizations end up in a sort of data drift, rolling out servers and databases for different projects until each bit of data is its own island in a massive archipelago.

Regardless of how things got this way at your operation, there are a number of dangers and challenges to this sort of setup. Let’s take a look at why disparate data can end up being a business killer.

Multiple Points of Failure

At first blush, this can seem like a pro. The reality, however, is that cloud computing and cluster servers have made it possible to keep your data in a single pool while not leaving it subject to multiple points of failure.

Leaving your data in disparate servers poses a number of problems. First, there’s a risk that the failure of any one system might wipe information out for good. Second, it can be difficult to collect data from all of the available sources unless you have them accurately mapped out. Finally, you may end up with idle resources operating and wasting energy long after they’ve outlived their utility.

It’s best to get everything onto a single system. If you want some degree of failure tolerance beyond using clouds or clusters, you can set up a separate archive to store data at specific stages of projects. Once your systems are brought up to speed, you’ll also begin to see significant cost savings as old or excess servers go offline.

Inconsistency

With data spread out across multiple systems, there’s a real risk that things won’t be properly synchronized. At best this ends up being inefficient. At worst it may lead to errors getting into your finished work products. For example, an older dataset from the wrong server might end up used by your analytics packages. Without the right checks in place, the data could be analyzed and out into reports, producing flawed business intelligence and decision-making.

Likewise, disparate data can lead to inconsistency in situations where multiple teams are working. One group may have its own datasets that don’t line up with what another team is using. By centralizing your efforts, you can guarantee that all teams will be working with the same data.

Bear in mind that inconsistency can get very far out of hand. If you need to audit data for legal purposes, for example, you may find data that has been retained too long, poorly anonymized or misused. With everything centralized, you’ll have a better chance of catching such problems before they create trouble.

Security Risks

More systems means more targets. That opens you up to more potential spots where hackers might get their hands on sensitive data. Similarly, you’re stuck with the challenge of patching multiple servers when exploits are found. In the worst scenario, you may not even notice a breach because you’re trying to juggle too many balls at the same time. Simply put, it’s a lot of work just to end up doing things the wrong way.

Turf Wars and Company Culture

When different departments in control of different data silos, it’s likely that different groups will start to see the data within their control as privileged. It’s rare that such an attitude is beneficial in a company that’s trying to develop a data-centric culture. Although you’ll want access to be limited to appropriate parties, there’s a big difference between doing that in a structured and well-administrated manner versus having it as the de facto reality of a fractured infrastructure.

Depending on how culturally far apart the departments in a company are, these clashes in culture can create major friction. One department may have an entirely different set of security tools. This can make it difficult to get threat monitor onto a single, network-wide system that protects everyone.

Conflicts between interfaces can also make it difficult for folks to share. By building a single data pool, you can ensure greater interoperability between departments.

Conclusion

Consolidating your data systems allows you to create a tighter and more efficient operation. Security can be improved rapidly, and monitoring of a single collection of data will allow you to devote more resources to the task. A unified data pool can also foster the right culture in a company. It takes an investment of time and effort to get the disparate data systems under control, but the payoff is worth it.

Back to blog homepage

Categories
Data Analysis Data Prep Data Quality

The Five Ways Dirty Data Costs Businesses Money

Dirty data in their systems costs U.S. companies anywhere from $2.5 to $3.1 trillion each year. Errors and omissions in master data in particular are notorious for causing costly business interruptions. And there’s no insurance you can buy against dirty data. It’s a simple fact of life many businesses grudgingly live with, but barely acknowledge. Our goal in this piece is to help you understand the areas where Dirty Data causes profit leakages in businesses, how to recognize them, and a little on what you can do about them.

Here are five ways dirty data could be costing your business money….

1. Wrong Conclusions and Time Wasted

Stop me if you’ve heard this one before:

Analyst goes into a meeting with their first bright, shiny new dashboard from the new multi-million dollar Data Warehouse.

A few minutes in, one executive starts to notice an anomaly in the data being presented. Something doesn’t add up, so they pull up their system and check it. Yes, definitely a mismatch.

Smelling blood in the water, other employees start to pile on until the poor analyst is battered and beaten, all for just doing their job.

This scenario plays out every day in companies across the US when even slightly dirty data is unknowingly used for analytics. The way most businesses detect the problem is to run right smack into it.

Apart from this disastrous meeting, which has been a waste of time, the BI team might spend months debating their findings with a disbelieving business audience. The net result: lots of time wasted, incorrect conclusions from analysis, and eventually nobody really trusts the data.

2. Operational Interruption

Dirty data and operational mistakes go hand in hand to cost businesses trillions every year.

The address was wrong, so the package wasn’t delivered.
The payment was applied to the wrong duplicate account.
The callback never reached the client because the number was wrong.

On the bright side, operational errors due to bad data often get addressed first and often because they’re so visible. They are the squeaky wheel of the organization, and for good reason.

If you’re trying to improve operational efficiency, make sure you start with as clean data as possible. And protect your data to keep it clean. Don’t import records into your CRM until you’ve checked them. Your operation is a clean, pristine lake with a healthy eco-system. Dirty data is toxic pollution that will disrupt that natural harmony.

3. Missed Opportunities

In our opinion, this one is the costliest of all by far, but it flies below the radar since it’s rooted in opportunity cost; However, it really deserves far more attention.

When a company lapses into and accepts a culture of “We have dirty data”, lots of great ideas for new initiatives never get off the ground, which results in billions of dollars in missed commercial opportunity every year.

New business ideas and innovations for current practices are constantly shot down because “We don’t have that data.” Or “We have that data, but it’s only 50% accurate.” Even worse, sometimes these innovative new ventures proceed into the incubator stage with high startup costs, only to explode on the launchpad because the data can’t be trusted or turns out to be dirty.

4. Poor Customer Experience

Every executive will admit customers are the #1 priority. Customers are also, of course, real people. But to your front line sales and service reps – the ones actually interacting with customers by phone and email, the crucial link is the data your company holds about that customer.

Think about it, the outcome of every service call, product order, subscription purchase is based in large part on the data your company has on its customers. If that data is inconsistent across customers, or just downright dirty and inaccurate, bad things start to happen. Products ship out to the wrong address. The wrong products are recommended. Returns go to the wrong place. Sales calls go to old, disconnected numbers. Inaccurate bills go out, payments are applied incorrectly.

If your business is one with multiple departments and business lines, clients can start to feel pretty underappreciated when one department knows their birthday and children’s names, and another can barely look up their account number by last name.

5. Time Wasted Cleaning It Up

Cleaning up dirty data is the first step in eradicating it. But it is a terribly time consuming process, and often very manual. Cleaning dirty data that’s been neglected for years can take years itself and is tedious and costly. Appending and fixing contact information can cost as much as one dollar per record. The average cost to clean up one duplicate record ranges from $20-$100 with everything factored in. Cleaning up thousands of duplicates and incomplete rows must be done carefully to avoid compounding the errors.

The cleanup of dirty data starts with a good understanding of the root causes, which takes time to forensically analyze what went wrong and when. Cleaning up dirty data is one step in a larger process, but it has the potential to wreck everything and force you into a reset. Worse, there’s the very real possibility that the issues go undetected and somehow end up in a final work product. Rest assured that someone will read your work product and see what’s wrong with it.

Often what’s wrong with the data is not fully understood and some cleaning efforts actually make it worse. (Ever have a sort error on a table column get loaded back into production? Fun times.)

It’s best to position the cleaning of data early in your larger set of processes. This means planning out how data will be processed and understanding what can’t be properly digested by downstream databases, analytics packages and dashboards. While some problems, such as issues with minor punctuation marks, can be handled in post-processing, you shouldn’t assume this will happen. Always plan to clean data as early as possible.

Luckily, we are seeing new strides in Artificial Intelligence that make this process easier and reduce the time from years down to days and weeks.

Automated Data Profiling (https://qastaging.wpengine.com/products-data-analysis-tools/) can shave months off the “finding out what’s wrong” phase of a data cleanup, giving a statistical readout of each issue by category so the problems can be prioritized and addressed in the right order.

Automated Data Enrichment (https://qastaging.wpengine.com/products-data-analysis-tools/data-enrichment/)and data append help with deduplication and merging of duplicate records.

Finally, Automated Data Modeling (https://qastaging.wpengine.com/products-data-analysis-tools/aipowered-data-modeling-augmented-data-management/) helps to round out the view of key entities, resulting in a more consistent customer experience, for example.

Categories
Data Analysis Data Prep Data Quality

Data Cleaning: Why it’s Taking Up Too Much Time

A major part of most data projects is making sure that the inputs have been properly cleaned. Poorly formatted input data can quickly lead to a cascade of problems. Worse, errors can go completely undetected if the faults in the data don’t lead to faults in the process.

On the flip side, data cleaning can end up eating up a lot of your time. It’s a good idea to think about why that is and how you might be able to remedy the issue.

Why Data Cleaning is So Time-Consuming

A big problem when it comes to fixing data up for use is that there are often mismatches between the source format and the format used by the system processing the information. Something as simple as dealing with the use of semicolons and quotes in a CSV file will still add to the time required to clean data for a project.

It’s hard to anticipate all the ways things can be wrong with source data. User-contributed data, for example, may not be highly curated. Rather than getting highly clean inputs, you may get all kinds of characters that have the potential to interfere with reading and processing.

Security features also can drive the need for data cleaning. Web-submitted data is often automatically scrubbed to prevent SQL injection attacks. While doing data cleaning, it’s often necessary to reverse this process to get at what the original inputs looked like.

Cultural differences can present major problems in cleaning data, too. Even simple things like postal codes can create trouble. A US ZIP code is always either a 5-digit input or 5 digits followed by a dash and four more digits. In Canada, postal costs use both letters and numbers, and there are spaces.

End-users of web applications often enter inputs regardless of whether they fit the database’s format. In the best scenario, the database software rejects the entry and alerts the user. There are also scenarios, though, where the input is accepted and ends up in a field that’s mangled, deprecated or just blank.

Considerations During Data Cleaning

A major question that has to be asked at the beginning of an effort is how much data can you afford to lose. For example, a dataset of fewer than 1,000 entries is already dangerously close to becoming statistically too small to yield relevant results. One way is to just toss out all the bad lines. If a quarter of the entries are visibly flawed and 10% more have machineability issues, it’s not long before you’re shaving off one-third of the dataset without even starting processing. Pre-processing may shave off even more data due to things like the removal of outliers and duplicates.

Barring extreme limits on time or capabilities, your goal should be to preserve as much of the original information as practicable. There are several ways to tackle the task, including doing:

  • Manual checks of the data to identify obvious problems
  • Dry runs of processing and post-processing work to see how mangled or accurate the output is
  • Standard pre-processing methods to spot common problems, such as unescaped or escaped characters and HTML entities
  • Machine learning work to recognize patterns in poorly formatted data

While it might be possible to acquire pre-cleaned data from vendors, you’ll still need to perform the same checks because you should never trust inputs that haven’t been checked.

Planning Ahead

Data cleaning is one step in a larger process, but it has the potential to wreck everything and force you into a reset. Worse, there’s the very real possibility that the issues go undetected and somehow end up in a final work product. Rest assured that someone will read your work product and see what’s wrong with it.

It’s best to position the cleaning of data early in your larger set of processes. This means planning out how data will be processed and understanding what can’t be properly digested by downstream databases, analytics packages and dashboards. While some problems, such as issues with minor punctuation marks, can be handled in post-processing, you shouldn’t assume this will happen. Always plan to clean data as early as possible.

With a structured process in place, you can operate with cleaner datasets. This will save time and money, and it will also reduce storage overhead. Most importantly, it will ensure you have the largest datasets possible to get the most relevant analysis that can be derived from it.

Categories
Big Data Data Prep Data Quality

What is Data Cleaning?

What is Data Cleaning?

Lets face it, most data you’ll encounter is going to be dirty. Dirty data yields inaccurate results, and is worthless for analysis until it’s cleaned up. Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. There are several methods for data cleansing depending on how it is stored along with the answers being sought.

Data cleansing is not simply about erasing information to make space for new data, but rather finding a way to maximize a data set’s accuracy without necessarily deleting information.

Data cleansing is the process of spotting and correcting inaccurate data. Organizations rely on data for many things, but few actively address data quality. Whether it’s the integrity of customer addresses or ensuring invoice accuracy. Ensuring effective and reliable use of data can increase the intrinsic value of the brand.  Business enterprises must assign importance to data quality.

Since there are many different types of data quality issues, each one requires different tactics to clean them.  

For one, data cleansing includes more actions than removing data, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes such as missing codes, empty fields, and identifying duplicate records. Data cleaning is considered a foundational element of data science basics, as it plays an important role in the analytical process and uncovering reliable answers.

Most importantly, the goal of data cleansing is to create datasets that are standardized and uniform to allow business intelligence and data analytics tools to easily access and find the right data for each query.

Why Clean Data?

A data driven marketing survey conducted by Tetra data found that 40% of marketers do not use data to its full effect. Managing and ensuring that the data is clean can provide significant business value.

Improving data quality through data cleaning can eliminate problems like expensive processing errors, manual troubleshooting, and incorrect invoices. Data quality is also a way of life because important data like customer information is always changing and evolving.

Business enterprises can achieve a wide range of benefits by cleansing data and managing quality which can lead to lowering operational costs and maximizing profits.

How To Clean Data

To be considered high-quality, data needs to pass a set of quality criteria. Those include:

  • Valid: The degree to which the measures conform to defined business rules or constraints. When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not implemented in software) or where inappropriate data-capture technology was used (e.g., spreadsheets, where it is very hard to limit what a user chooses to enter into a cell, if cell validation is not used). Data constraints fall into the following categories.
    • Data-Type Constraints – e.g., values in a particular column must be of a particular datatype, e.g., Boolean, numeric (integer or real), date, etc.
    • Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.
    • Mandatory Constraints: Certain columns cannot be empty.
    • Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.
    • Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person’s gender may be Female, Male or Unknown (not recorded).
    • Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the “state” column is required to belong to one of the US’s defined states or territories: the set of permissible states/territories is recorded in a separate States table. The term foreign key is borrowed from relational database terminology.
    • Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.
    • Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient’s date of discharge from hospital cannot be earlier than the date of admission.
    • Data cleaning to correct validity issues can often be done programmatically.
  • Accuracy: Quite simply, is the data right. The conformity of a measure to a standard or a true value. Accuracy is very hard to achieve through data cleaning in the general case, because it requires accessing an external source of data that contains the true value: such “gold standard” data is often unavailable. Accuracy has been achieved in some data cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state), and also help verify that street addresses within these zip codes actually exist.
  • Completeness: The degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleaning methodology: one cannot infer facts that were not captured when the data in question was initially recorded. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates “unknown” or “missing”, but supplying of default values does not imply that the data has been made complete.
  • Consistency: The degree to which a set of measures are equivalent in across systems. Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies – e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).
  • Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems. It is often quite easy to ensure this through data cleaning early in the process, but as the process moves along, and data is transformed and changed, it becomes far more difficult. In datasets pooled from different locales, weight may be recorded either in pounds or kilos, and must be converted to a single measure using an arithmetic transformation. This also points out the problem of naked metrics, where values like weight may be recorded as an integer value. Unless there is another column directly next to it, or a notation in the column heading, it can be next to impossible to determine whether a value is in kilos or pounds, or celsius or Fahrenheit.

The term integrity encompasses accuracy, consistency and some aspects of validation but is rarely used by itself in data-cleansing contexts because it is insufficiently specific.

How Can I Use Data Cleaning?

Regardless of the type of analysis or data visualizations you need, data cleansing is a vital step to ensure that the answers you generate are accurate. When collecting data from several streams and with manual input from users, information can carry mistakes, be incorrectly inputted, or have gaps.

Data cleaning helps ensure that information always matches the correct fields while making it easier for business intelligence tools to interact with data sets to find information more efficiently. One of the most common data cleaning examples is its application in data warehouses.

After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data

A successful data warehouse stores a variety of data from disparate sources and optimizes it for analysis before any modeling is done. To do so, warehouse applications must parse through millions of incoming data points to make sure they’re accurate before they can be slotted into the right database, table, or other structure.

Organizations that collect data directly from consumers filling in surveys, questionnaires, and forms also use data cleaning extensively. In their cases, this includes checking that data was entered in the correct field, that it doesn’t feature invalid characters, and that there are no gaps in the information provided.

More on Data Cleaning:

https://en.wikipedia.org/wiki/Data_cleansing

https://searchdatamanagement.techtarget.com/definition/data-scrubbing

Categories
Business Intelligence Data Prep Data Quality

The 3 Things You Need To Know If You Work With Data In Spreadsheets

Microsoft Excel and Google Sheets are the first choice of many users when it comes to working with data. They’re readily available, easy to learn and support universal file formats. When it comes to using a spreadsheet application like Excel or Google Sheets, the point is to present data in a neat, organized manner which is easy to comprehend. They’re also on nearly everyone’s desktop, and were probably the first data-centric software tool any of us learned.  Whether you’re using Excel or Google Sheets, you want your data cleaned and prepped. You want it accurate and consistent, and you want it to elegant, precise, and user-friendly.

But there is a downside. While spreadsheets are popular, they’re far from the perfect tool for working with data. We’re going to explore the Top 3 things you need to be aware of if you work with data in spreadsheets.

While spreadsheet tools are quite adequate for many small to mid-level data chores, there are some important risks to be aware of. Spreadsheets are desktop-class, file-oriented tools which means their entire data contents are stored in volatile RAM while in use and on disk while you’re not using them. That means that between saves, the data is stored in RAM, and can be lost.

Risk #1: Beware of Performance and Data Size Limits in Spreadsheet Tools

Most people don’t check the performance limits in Spreadsheet tools before they start working with them. That’s because the majority won’t run up against them. However, if you start to experience slow performance, it might be a good idea to refer to the limits below to measure where you are and make sure you don’t start stepping beyond them. Like I said above, spreadsheet tools are fine for most small data, which will suit the majority of users.

But at some point, if you keep working with larger and larger data, you’re going to run into some ugly performance limits. When it happens, it happens without warning and you hit the wall hard.

Excel Limits

Excel is limited to 1,048,576 rows by 16,384 columns in a single worksheet.

  • A 32-bit Excel environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process.
  • 64-bit Excel is not subject to these limits and can consume as much memory as you can give it. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded.

Google Sheets Limits

  • Google Spreadsheets are limited to 5,000,000 cells, with a maximum of 256 columns per sheet. (Which means the rows limit can be as low as 19,231, if your file has a lot of columns!)
  • Uploaded files that are converted to the Google spreadsheets format can’t be larger than 20 MB and need to be under 400,000 cells and 256 columns per sheet.

In real-world experience, running on midrange hardware, Excel can begin to slow to an unusable state on data files as small as 50mb-100mb. Even if you have the patience to operate in this slow state, remember you are running at redline. Crashes and data loss are much more likely!

(If you’re among the millions of people who have experienced any of these, or believe you will be working with larger data, why not check out a tool like Inzata, designed to handle profiling and cleaning of larger datasets?)

Risk #2:  There’s a real chance you could lose all your work just from one mistake

Spreadsheet tools lack any auditing, change control, and meta-data features that would be available in a more sophisticated data cleaning tool. These features are designed to act as backstops for any unintended user error. Caution must be exercised when using them as multiple hours of work can be erased in a microsecond.

Accidental sorting and paste errors can also tarnish your hard work. Sort errors are incredibly difficult to spot. If you forget to include a critical column in the sort, you’ve just corrupted your entire dataset. If you’re lucky enough to catch it, you can undo it, if not, that dataset is now ruined, along with all of the work you just did. If the data saves to disk while in this state, it can be very hard, if not impossible, to undo the damage.

Risk #3:  Spreadsheets Aren’t Really Saving You Any Time

Spreadsheets are fine if you just have to clean or prep data once, but that is rarely the case. Data is always refreshing, new data is coming online. Spreadsheets lack any kind of repeatable processes and or intelligent automation.

If you spend 8 hours cleaning a data file one month, you’ll have to repeat nearly all of those steps the next time a refreshed data file comes along.

Spreadsheets can be pretty dumb sometimes. They lack the ability to learn. They rely 100% on human intelligence to tell them what to do, making them very labor intensive.

More purpose-designed tools like Inzata Analytics allow you to record and script your cleaning activities via automation. AI and Machine Learning lets these tools learn about your data over time. If you Data is also staged throughout the cleaning process, and rollbacks are instantaneous. You can set up data flows that automatically perform cleaning steps on new, incoming data. Basically, this lets you get out of the data cleaning business almost permanently.

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Categories
Business Intelligence Data Prep Data Quality

Chapter 3 – Cleaning Data with Excel and Google Sheets

How Spreadsheets Became The #1 BI Tool in the World

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Microsoft Excel and Google Sheets are the first choice of many users when it comes to handling large amounts of data. They’re readily available, easy to learn and support universal file formats. When it comes to using a spreadsheet application like Excel or Google Sheets, the point is to present data in a neat, organized manner which is easy to comprehend. They’re also on nearly everyone’s desktop, and were probably the first data-centric software tool any of us learned.

In this eBook, we are going to tell you some of the tips as to how to clean and prep up your data using Excel and Google Sheets, and make it accurate and consistent, and make it look elegant, precise, and user-friendly.

spreadsheet software

Risks of Cleaning Data In Spreadsheets

While spreadsheet tools are quite adequate for many small to mid-level data chores, there are some important risks to be aware of. Spreadsheets are desktop-class, file-oriented tools which means their entire data contents are stored in volatile RAM while in use and on disk while you’re not using them. That means that between saves, the data is stored in RAM, and can be lost.

Spreadsheet tools also lack any auditing, change control, and meta-data features that would be available in a more sophisticated data cleaning tool. These features act as backstops for any unintended user error. Caution must be exercised when using them as multiple hours of work can be erased in a microsecond.

Unnoticed sorting and paste errors can also tarnish your hard work. If the data saves to disk while in this state, it can be very hard, if not impossible, to undo the damage and revert to an earlier version.

Spreadsheets also lack repeatable processes and automation. If you spend 8 hours cleaning a data file one month, you’ll have to repeat nearly all of those steps the next time another refreshed data file comes along. More purpose-designed tools like Inzata Analytics allow you to record and script your cleaning activities via automation. Data is also staged throughout the cleaning process, and rollbacks are instantaneous. You can set up data flows that automatically perform cleaning steps on new, incoming data. Basically, this lets you get out of the data cleaning business almost permanently.

Performance and Size Limits in Spreadsheet Tools

Most folks don’t bother to check the performance limits in Spreadsheet tools before they start working with them. That’s because the majority won’t run up against them. However, if you start to experience slow performance, it might be a good idea to refer to the limits below to measure where you are and make sure you don’t start stepping beyond them. Like I said above, spreadsheet tools are fine for most small data, which will suit the majority of users.

Excel Limits

Excel is limited to 1,048,576 rows by 16,384 columns in a single worksheet.

  • A 32-bit Excel environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process.
  • 64-bit Excel is not subject to these limits and can consume as much memory as you can give it. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded.

Google Sheets Limits

  • Google Spreadsheets are limited to 5,000,000 cells, with a maximum of 256 columns per sheet. (Which means the rows limit can be as low as 19,231, if your file has a lot of columns!)
  • Uploaded files that are converted to the Google spreadsheets format can’t be larger than 20 MB and need to be under 400,000 cells and 256 columns per sheet.

In real-world experience, running on midrange hardware, Excel can begin to slow to an unusable state on data files as small as 50mb-100mb. Even if you have the patience to operate in this slow state, remember you are running at redline. Crashes and data loss are much more likely!

If you believe you will be working with larger data, why not check out a tool like Inzata, designed to handle profiling and cleaning of larger datasets?

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Categories
Data Prep Data Quality

Ebook Excerpt: Chapter 2 – Why Clean Data?

Chapter 2: Why Clean Data?

(Excerpt from The Ultimate Guide to Cleaning Data in Excel and Google Sheets)

Data cleansing is the process of spotting and correcting inaccurate data. Organizations rely on data for many things, but few actively address data quality. Whether it’s the integrity of customer addresses or ensuring invoice accuracy. Ensuring effective and reliable use of data can increase the intrinsic value of the brand.  Business enterprises must assign importance to data quality.

A data-driven marketing survey conducted by Tetra data found that 40% of marketers do not use data to its full effect. Managing and ensuring that the data is clean can provide significant business value.

Improving data quality can eliminate problems like expensive processing errors, manual troubleshooting, and incorrect invoices. Data quality is also a way of life because important data like customer information is always changing and evolving.

Business enterprises can achieve a wide range of benefits by cleansing data and managing quality which can lead to lowering operational costs and maximizing profits.

Who are the heroes who allow the organization to seize and enjoy all these benefits? I affectionately refer to these poor souls as PWCD’s, or People Who Clean Data[1].

These brave people, and hopefully you are reading this because you hope to be one of them, are the noblest. They often get little recognition even though they clean up the messes of hundreds, if not thousands of other people every day. They are the noble janitors of the data world. And I salute them.

Top 5 Benefits of Data Cleaning

1.  Improve the Efficiency of Customer Acquisition Activities

Business enterprises can significantly boost their customer acquisition and retention efforts by cleansing their data regularly. With the high throughput of the prospecting and lead process, filtering, cleansing, enriching having accurate data is essential to its effectiveness. Throughout the marketing process, enterprises must ensure that the data is clean, up-to-date and accurate by regularly following data quality routines. Clean data can also ensure the highest returns on email or postal campaigns as chances of encountering outdated addresses or missed deliveries are very low. Multi-channel customer data can also be managed seamlessly which provides the enterprise with an opportunity to carry out successful marketing campaigns in the future as they would be aware of the methods to effectively reach out to their target audience.

2.  Improve Decision-Making Processes

The cornerstone of effective decision making in a business enterprise is data. According to Sirius Decisions, data in an average B2B organization doubles every 12-18 months and though the data might be clean initially, errors can creep in at any time. In fact, in nearly all businesses where data quality is not managed, data quality decay is constantly at work. Each time new records are added; duplicates may be created. Things happening outside your organization, like customers moving and changing emails and telephone numbers will, over time, degrade data quality.

Yet the majority of enterprises fail to prioritize data quality management, or even acknowledge they have a problem! In fact, many of them don’t even have a record of the last time quality control was performed on their customer’s data. More often than not they merely discard or ignore data they believe to be of poor quality, and make decisions through other means. Here you can see that data quality is a massive barrier toward digital transformation and business intelligence, much less every company’s desire to become more Data-Driven.

Accurate information and quality data are essential to decision making. Clean data can support better analytics as well as all-round business intelligence which can facilitate better decision making and execution. In the end, having accurate data can help business enterprises make better decisions which will contribute to the success of the business in the long run.

3.  Streamline Business Practices

Eradicating duplicate and erroneous data can help business enterprises to streamline business practices and avoid wasteful spending. Data cleansing can also help in determining if particular job descriptions within the enterprise can be changed or if those positions can be integrated somewhere else. If reliable and accurate sales information is available, the performance of a product or a service in the market can be easily assessed.

Data cleansing along with the right analytics can also help the enterprise to identify an opportunity to launch new products or services into the market at the right time. It can highlight various marketing avenues that the enterprises can try. In practically any other business process you can name, decisions are made every day, some large, but many small. It is this systematic pushing of high-quality information down the chain of command, into the hands of individual contributors that helps them improve decisions made at all levels of the organization. Called Operational Intelligence, it is used more commonly for quick lookups and to inform the thousands of decisions that are made every day inside the organization.

4.  Increase Productivity

Having a clean and properly maintained enterprise dataset can help organizations ensure that the employees are making the best use of their time and resources. It can also prevent the staff of the enterprise from contacting customers with out-of-date information or create invalid vendor files in the system by conveniently helping them to work with clean records thereby maximizing the staff’s efficiency and productivity. High-quality data helps reduce the risk of fraud, ensuring the staff has access to accurate vendor or customer data when payments or refunds are initiated.

5.  Increase Revenue

Business enterprises that work on improving the consistency and increasing the accuracy of their data can drastically improve their response rates which results in increased revenue. Clean data can help business enterprises to significantly reduce the number of returned mails. If there are any time-sensitive information or promotions that the enterprise wants to convey to their customers directly, accurate information can help in reaching the customers conveniently and quickly.

Duplicate data is another aspect which can be effectively eradicated by data cleansing. According to Sirius Decisions, the financial impact of duplicate data is directly proportional to the time that it remains in the database.

Duplicate data can significantly drain the enterprise’s resources as they will have to spend twice as much on a single customer. For example, if multiple mails are sent to the same customer, they might get annoyed and might completely lose interest in the enterprise’s products and services.

[1] “People with Crappy Data” is an alternate interpretation coined by some of my clients.

Categories
Big Data Data Analytics Data Enrichment Data Monetization Data Prep

The Immense Value Behind Data Enrichment with Secondary Data

Techopedia defines data enrichment as “processes used to enhance, refine or otherwise improve raw data.” Raw data is just the seed, and data enrichment is the light needed to grow it into a strong, useful, and valuable mechanism for your business.

Ultimately, the goal of data enrichment is to boost the data that you are currently storing with secondary data. Whether it is at the point of capture or after the data is accumulated, adding insights from reliable information sources is where the real value is gained. In other words, data enrichment is journey of transforming your raw, commodity data into a true asset to your organization, project, or research.

Refining raw data should include the following steps:

  • Removing errors such as null or duplicate values
  • Using data profiling to clarify the content, relationships, and structure of the data
  • Improving the data quality overall to increase its reliability and analytical value
  • Strategically adding additional attributes, relationships, and details that uncover new insights around your customers, operations, and competition from secondary data

Data refinement avoids the negative outcomes of attempting to work with bad data. Low quality data can have serious negative impacts on your project. It can needlessly increase costs, waste precious time, cripple important decision making, and even anger clients or customers.

During or after the refinement of your data, enriching it with advanced data dimensions such as detailed time frames, geography details, weather history, and even a wide variety of customer demographics from multiple secondary data libraries is key to unleashing its true value to your company, customers, and shareholders.

  • What if you could predict which clients are most likely to buy, and exactly how much they will spend, just from their initial lead profile?
  • What if you could identify the key success characteristics of a new market or store location, just from viewing the demographics of the area?
  • How much easier would day-to-day decisions become if you could consider all of the factors involved, instead of just a few?

You will acquire a better and more complete understanding of your prospects and target market. You will learn more about your market by appending business information to the records that you capture and store, pinpointing key sociodemographic groups of business prospects, or improving efficiencies across your business units.

Most would agree that data enrichment with secondary data is valuable, but why do less than 10% of companies do it? The simplest answer is “it’s hard.” It’s time consuming and labor-intensive to gather and maintain all of these various enrichments. It’s hard to thread and blend data together AND keep it all accurate and organized. Let’s face it, most business professionals barely have time to analyze the data in front of them, much less go out and find other sources.

Let’s Talk About Inzata

Inzata is a data analytics platform designed to change all of that. Inzata offers a growing list of more than 25 separate enrichments, ranging from things like geospatial and location enrichments, to weather data and advanced customer demographics down with street level accuracy.

Data enrichment is a core function with Inzata, it’s designed as an integral part of our Agile AnalyticsTM, the workflow that uses technology to turn raw data into digital gold. 

Secondary data is the key concept of data enrichment, such as advanced customer demographics, which is arguably the strongest data enrichment a company could use to add an immense amount of value to their customer data. Unlike any other data analytics platform, Inzata has over 150 customer demographics from the entire nation built right into the platform for one-click access at all times. Some of these enrichments include:

  • Income brackets
  • Employment
  • Occupation
  • Housing occupant/valuation
  • Marital Status
  • Education level
  • Industry facts

Enriching your customer data in this way greatly increases the value and precision of your analysis, and allow you to answer much more complex questions about your business. Inzata makes enriching your data as simple as selecting which attributes you want to add, and instantly adding them to your data.

These enrichments are absolutely priceless for companies with big data on their hands. Being ableData Enrichment - Data Enhancement - Enriched Data to slice and dice your large datasets by these detailed demographics and behavioral characteristics makes them more precise, more manageable, and better able to tell you what’s actually going on inside your business. Think of enrichment as a force-multiplier for your big data initiative. Knowing more about your customers, your transactions. Failing to enrich a mass amount of simple customer data for your own benefit is like choosing a 2005 flip-phone over a 2018 smartphone.

A Harvard Business Review1 article mentions two very important statistics that easily prove why data enrichment is absolutely crucial:

  • On average, 47% of newly created data records have at least one critical & work-impacting error.
  • Only 3% of the data quality scores in their study can be rated “acceptable” using the loosest-possible standard.

Any business can easily avoid falling into these negative statistics by investing in the correct data analytics platform that provides powerful enrichments for top-notch data refinement and enhancement through a variety of secondary data sources.

Inzata’s platform is the first and only of its kind to include one-click enrichments for any data, from any source, for any business. Stay ahead of the curve in data analytics and invest in the best, invest in Inzata.

Sources

1Only 3% of Companies’ Data Meets Basic Quality Standards, https://hbr.org/2017/09/only-3-of-companies-data-meets-basic-quality-standards