Categories
Big Data Data Analytics Data Preparation

Data Wrangling vs. Data Cleaning: What’s the Difference?

There are many mundane tasks and time-consuming processes that data scientists must go through in order to prepare their data for analysis. Data wrangling and data cleaning are both significant steps within this preparation. However, due to their similar roles in the data pipeline, the two concepts are often confused with one another. Let’s review the key differences and similarities between the two as well as how each contributes to maximizing the value of your data.

What is Data Wrangling?

Data wrangling, also referred to as data munging, is the process of converting and mapping data from one raw format into another. The purpose of this is to prepare the data in a way that makes it accessible for effective use further down the line. Not all data is created equal, therefore it’s important to organize and transform your data in a way that can be easily accessed by others.

While an activity such as data wrangling might sound like a job for someone in the Wild West, it’s an integral part of the classic data pipeline and ensuring data is prepared for future use. A data wrangler is a person responsible for performing the process of wrangling.

Benefits of Data Wrangling

Although data wrangling is an essential part of preparing your data for use, the process yields many benefits. Benefits include:

  • Enhances ease of access to data
  • Faster time to insights
  • Improved efficiency when it comes to data-driven decision making

What is Data Cleaning?

Data cleaning, also referred to as data cleansing, is the process of finding and correcting inaccurate data from a particular data set or data source. The primary goal is to identify and remove inconsistencies without deleting the necessary data to produce insights. It’s important to remove these inconsistencies in order to increase the validity of the data set.

Cleaning encompasses a multitude of activities such as identifying duplicate records, filling empty fields and fixing structural errors. These tasks are crucial for ensuring the quality of data is accurate, complete, and consistent. Cleaning assists in fewer errors and complications further downstream. For a deeper dive into the best practices and techniques for performing these tasks, look to our Ultimate Guide to Cleaning Data.

Benefits of Data Cleaning

There is a wide range of benefits that come with cleaning data that can lead to increased operational efficiency. Properly cleansing your data before use leads to benefits such as:

  • Elimination of errors 
  • Reduced costs associated with errors
  • Improves the integrity of data
  • Ensures the highest quality of information for decision making

When comparing the benefits of each, it’s clear that the goals behind data wrangling and data cleaning are consistent with one another. They each aim at improving the ease of use when it comes to working with data, making data-driven decision making faster and more effective as a result.

What’s the Difference Between Data Wrangling and Data Cleaning?

While the methods might be similar in nature, data wrangling and data cleaning remain very different processes. Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data’s format, typically by converting “raw” data into another format more suitable for use. Data cleaning enhances the data’s accuracy and integrity while wrangling prepares the data structurally for modeling. 

Traditionally, data cleaning would be performed before any practices of data wrangling being applied. This indicates the two processes are complementary to one another rather than opposing methods. Data needs to be both wrangled and cleaned prior to modeling in order to maximize the value of insights.

Categories
Big Data Business Intelligence Data Analytics Data Preparation

How to Cure Your Company’s Spreadsheet Addiction

The use of spreadsheets in business today is essential for the vast majority of people. From quick calculations and generating reports to basic data entry, spreadsheets have a way of working themselves into our daily tasks. What’s not to love about spreadsheets? They’re quick, easy to use, require little to no training, and are quite powerful tools overall. However, developing too much of a dependency on them can be problematic, especially when used as a workaround for other solutions due to convenience.

This dependency problem is commonly referred to as spreadsheet addiction. While referring to this phenomenon as an addiction might seem a bit extreme, many organizations find themselves heavily reliant on the use of individual spreadsheets to perform core functions. This high usage rate causes many problems and can ultimately hinder a company’s growth. Let’s explore the potential causes of this addiction as well as review possible treatment plans of action.

What’s Wrong with Using Spreadsheets?

While Excel and Google Sheets can be quite effective in their own right, heavy reliance on spreadsheets can create risk and cause a number of negative effects. 

A few examples of potential problems they create are:

  • Things Start to Break – As the size of the dataset increases, things within your spreadsheet inevitably start to break. Once this starts to occur, it can be seemingly impossible to identify the source of the problem. You’ll likely drain more time and resources into finding and fixing the issue than on your actual analysis. These breaking points also create the risk for errors and other data corruption.
  • Static and Outdated Information – Exporting data from your CRM or ERP system instantly causes the information to become outdated. Spreadsheets don’t allow you to work with your data in real-time, additionally, it’s also extremely difficult to implement any form of automation within your sheets. This creates more work for users as well as poses the problem of inaccuracy.
  • Impedes Decision Making – Spreadsheets are notoriously riddled with errors, which can be costly when it comes to decision making. You wouldn’t want to base any kind of decision on a forecast that is more likely to be inaccurate than not. Reducing discrepancies, specifically human error will improve decision making overall.

Treatment Method to Spreadsheet Addiction

Regardless of the severity of your company’s spreadsheet dependency, effective treatment is no small feat. Change doesn’t happen overnight and you should approach your treatment plan as an iterative process. While this list is not exhaustive, here are a few pillars to consider when promoting change.

Evaluation of Symptoms

First, you must identify how spreadsheets are currently being used. It’s important to start with a comprehensive overview of your situation in order to form an effective plan of action. 

To access your addiction level, start by asking yourself questions such as:

  • How are spreadsheets used by an individual user and across departments?
  • What company processes involve the use of spreadsheets?
  • How are spreadsheets used as a method of collaboration?

Drill down how spreadsheets are being used from the company level down to the individual users. Determine not only how they are being used by departments but also their frequency, purpose, and role in daily operations.

Assign Ownership

Next, assign an individual to lead or build a small team to take ownership of the project. If no one feels they are directly responsible for driving change, the bystander effect will inevitably rear its ugly head on the situation. Assigning responsibility for the transition away from spreadsheets will facilitate the flow of change.

New Solutions and Systems

A new solution requires a widespread change to business processes and will ultimately impact day to day operations. This is the main reason spreadsheet addiction is prolonged in businesses, everyone is more comfortable with their usual way of doing things. But implementing the proper tools and systems is vital to decreasing dependency on spreadsheets. 

Use your evaluation to identify your company’s business intelligence needs and acquire tools accordingly. While this transition might require an initial upfront cost, investing in the proper data and analytics tools will help reduce costs in terms of efficiency and labor in the long run.

Promote User Buy-In 

Buy-in from employees across all levels is crucial to the acceptance of new solutions and business processes. Spreadsheet addiction will prevail if users aren’t comfortable with using the new systems put into place. Learning is required when it comes to any change, it’s essential to offer training and available support resources to aid the shift.

In the end, accept that there will always be some tasks and projects done through Excel or Google Sheets. The important thing is that not everything or even the majority of work will be done through these platforms. Though beating spreadsheet addiction might come with some withdrawals, driving change now will foster greater efficiency in the long run. 

Back to blog homepage

Categories
Big Data Data Preparation Data Visualization

3 Useful Excel Tricks You Wish You Knew Yesterday

Microsoft Excel has become an essential staple for workplace professionals across every industry, especially when quickly working with data and performing basic analyses. With hundreds of functions, it can be overwhelming to try to learn them all as well as know which are most effective. But if used correctly, these functions can help save you an immense amount of time and headache. Let’s explore a few classic Excel tricks every analyst should have in their toolbox.

1. Heatmap

It’s easy to get lost in the hundreds of rows and columns that make up most spreadsheets. You might be left asking yourself, what do all of these numbers mean? This is where data visualizations are key in helping you understand your data and generate insights quickly. 

One effective way to do this is with the use of color and Excel’s heatmap function. To put it simply, heat maps are a visual representation of your data through the use of color. These charts are perfect for comprehensive overviews of your data as well as additional analysis.

This trick can be broken down into three simple steps:

  1. Select the cells and values you want to include. 
  2. Under the Home tab, click on Conditional Formatting
  3. Select Color Scales from the drop-down menu and choose your desired color scale selection.

Following these steps, you can now see your data highlighted in a gradient-based on its value. This can visually assist, for example, in identifying critical dips in sales or inventory by highlighting those cells as red. Overall, heat maps are extremely versatile and can be used to understand data intuitively. They also make for a great visual stimulus in any dashboard or report!

2. Remove Duplicates

The last thing you want in your data is duplicate entries or values. This poses an issue of inaccuracy and other inherent risks in your analysis. Though, removing these duplicates is quite simple when using Excel, here are the steps to follow for one method.

To begin, we need to first identify if there are any duplicates present in your spreadsheet. We can do this by highlighting any duplicates through Excel’s Conditional Formatting function. 

  1. Under the Home tab, click on Conditional Formatting.
  2. Select Highlight Cells Rules from the drop-down menu, then select Duplicate Values.
  3. Determine your formatting preferences and click OK.

Any duplicates present in your data will be highlighted based on the color preferences you determined earlier. Now that you’ve detected the duplicates, you can easily remove them by going to the Data tab and clicking Remove Duplicates. Excel will then tell you how many duplicates were detected and the total removed from your sheet. Duplicate free in only a few simple clicks! This trick can help you minimize discrepancies as well as save time trying to manually detect and delete duplicate values.

3. Filling All Empty Cells 

Chances are your dataset contains a few empty cells, this could be due to incomplete data or any number of reasons. In order to avoid any issues when it comes to your analysis or when creating models, it’s important to fill these cells ahead of time.

Follow these steps to identify and fill all empty cells at once:

  1. Under the Home tab, click Find & Select.
  2. Select Go To Special from the drop-down menu and select Blanks from the provided menu options.
  3. Fill an empty cell with your desired value or text (e.g. N/A) and press CTRL + ENTER.

With this function, all of your empty cells can be identified and filled in a matter of seconds. This trick will help you save time ciphering through columns trying to manually detect and fill empty cells. Additionally, this can be especially helpful when working with large data sets used for creating models.

Back to blog homepage

Categories
Big Data Business Intelligence Data Analytics Data Preparation

Why Data Warehouse Projects Fail

As organizations move towards becoming more data-driven, the use of data warehouses has become increasingly prevalent. While this transition requires companies to invest immense amounts of time and money, many projects continue to fail. Let’s take a look at the most common reasons why data warehouse projects fail and how you can avoid them. 

There’s No Clear Big Picture

In most cases, these projects don’t fail due to technical challenges. While there might be some obstacles when it comes to loading and connecting data, the leading pitfalls of project failure are predominantly organizational. Stakeholders commonly feel that there is a lack of clarity surrounding the warehouses’ goals and primary objectives.

Companies often see this most prevalently in the division between technical teams and the ultimate end user. You don’t want your architect or engineers to be on a different page than your analysts. Therefore, it’s important to establish the high-level goals behind why you are undertaking this project to all members of your team before putting processes into place. 

Before beginning, the team should have definitive answers to questions like:

  • What are our data goals?
  • What insights are we looking for to satisfy our business needs?
  • What types of questions do we need the data to answer?

Developing a clear understanding of the big picture early on will help you avoid uncertainty around strategy, resource selection, and designing processes. Knowing the company’s “why” behind taking on the initiative will also allow those involved to recognize the purpose of their efforts.

The Heavy Load of Actually Loading the Data 

Despite the organizational obstacles, there are also many hurdles on the technical side of things. Before data can be loaded into the warehouse, it has to be prepped and properly cleaned. This poses an initial challenge as cleaning data is notoriously a time-consuming task. IT leaders are often frustrated by the wasted hours spent preparing data to be loaded.

The primary main concern is the ability of organizations to easily move and integrate their data. Movement and ease of access to data are crucial in order to generate any kind of insights or business value. According to a recent study conducted by Vanson Bourne and SnapLogic, 88% of IT decision-makers experience problems when it comes to loading data into their data warehouse. 

The most common data loading inhibitors were found to be:

  1. Legacy Systems – Migrating data from legacy technology can be time-consuming. However, the primary issue here is that these systems can be difficult to access, making any kind of data movement restrictive.
  2. Unstructured and Semi-Structured Data – Complex data types are tough to manage in any situation. Inconsistencies surrounding structure and formatting drains time and technical resources, preventing effective loading.
  3. Data Siloed in Different Infrastructures – Disconnection of data sources prevents integration across the organization. Many companies have hundreds of separate data sources as they continually grow across departments and with the addition of various projects. 
  4. Resistance to Sharing Data Across Departments – Oftentimes departments act as their own separate entities and aren’t willing to share. The sales team may not want finance to have access to their customer data due to misaligned goals. 

All of these warehouse factors drain an organization’s time and resources, contributing to a lengthier and more costly project overall. Additionally, improperly loading data can cause a number of problems in itself such as errors and data duplication.

Low End User Acceptance

So you’ve successfully moved your data into the warehouse, now what? Another issue that commonly contributes to the failure of data warehouse projects is end user acceptance. As much as new technologies can be exciting, people are inevitably creatures of habit and might not always delve into acceptance. This is where education and training come into play. Onboarding users is vital to the success of any project. 

Establishing a data-driven culture is the first step to promoting user acceptance and engagement. End users should be encouraged to indulge in their data curiosities. Implementing a form of self-service analytics will increase the ease of use for non-technical users and help them quickly gain access to information. These transitional efforts will not only help with the success and use of your data warehouse but also drive better decision making throughout the organization in the long run.

Conclusion

Overall, there are a variety of reasons that contribute to the failure of data warehouse projects. Whether those pitfalls are organizational or on the technical side of things, there are proven ways to properly address them in order to maximize investment and foster successful insights. 

Back to blog homepage

Categories
Data Modeling Data Preparation

The Marketing Analytics Tool You Need in 2019

The Purpose of Marketing Analytics Tools

The field of marketing is a very large, intense, and sometimes complicated web of customer data from a wide variety of sources. Not only do you have the umbrella categories of marketing tools such as CRMs, paid ad managers, social media, website analytics, etc., you also have the numerous tools that fall under each of those categories, leaving you with upwards of 10 data sources to attempt to collectively analyze without spending a week’s (maybe even a month’s) worth of time creating unattractive an unreliable pie charts on Excel that need to be updated everyday. Marketing analytics can be difficult to conquer…unless you have the right knowledge and marketing analytics tool.

Isn’t my CRM the Marketing Analytics Tool I Need?

Thankfully, CRMs such as HubSpot and SalesForce are very talented at keeping their data organized properly, and HubSpot has decent reporting tools. The problem is that these marketing analytics tools report only their data. Yes, some CRM’s, such as HubSpot, let you integrate your Facebook ads and Google analytics, but they are not included in their reporting tools when it comes to comparing marketing emails to social media ad performance to website website traffic… see what I mean yet?

Advertising Data

Paid ads managers and social media, such as Google, Facebook, LinkedIn, Twitter, YouTube, etc., each have their own reporting and marketing analytics tools, and while some of them are decently detailed, some of them also – for lack of a better phrase – totally suck! Not only that, but you’re limited to only analyzing their data. So, what if you want to know if your YouTube video views spiked when you ran your new Twitter campaign? What if you want to know if your LinkedIn profile engagement decreased because of a low quality Facebook campaign? Sure, go ahead and bounce around from website to website… go ahead and waste an immense amount of time.

Website Data…Yikes!

Your company’s website data is the kicker, and what will ultimately prove my point. Not only are you interested in your website visits, specific page visits, traffic sources, and about 35 other things, but you’re also A/B testing your landing pages, wondering why your pricing page has a 80% drop off rate, and why half of your visits are from people who live in a country you’ve never heard of. My point is, every business has a wide variety of numerous questions about their website traffic and conversions, or lack of. Is it being affected by your email campaigns? Or by your Facebook and LinkedIn ads? Or by your YouTube videos? Or by your unknowingly low quality landing pages? There are so many questions, and even more data sources. Utilizing the right marketing analytics tool along with a pinch of automation is the key to answering your vast list of questions about your website traffic behavior.

How Do I Combine All of This Data?!

The solution? An end-to-end marketing analytics tool to collect your data from each and every source, in real time, modeling it into a single dashboard that can virtually answer any question you have about your marketing and customer data. Given the opportunity to pull any single piece of data and compare to another single piece of data, your questions and answers about what you are doing right – and more importantly, what you are doing wrong – are endless. Make the choice to stop wasting your time and money on bad marketing decisions and analyze your data with immense precision and speed, using Inzata. Compare Hubspot to Facebook to YouTube to Google to MailChimp to Salesforce to Twitter to LinkedIn to any source you can think of, in real time, with ease. Skyrocket your marketing team’s performance with Inzata. 

Categories
Data Modeling Data Preparation

Data Lake? More Like Data Swamp!

Building a collection of data sources that a business or an organization has into a data lake that everyone can access is an idea that inspires a lot of interest. The idea is to make the data lake into a resource that will drive innovation and insights by allowing clever team members to test ideas across many sources and variables. Unfortunately, a lack of good data curation techniques can lead that lake to become a data swamp in no time at all.

An Example

Let’s say you want to take a database that contains information about all the employees at your company. There are two data sources, with one that includes an employee’s name, salary, birthday and current address. Perhaps the second source includes information about their name, current city of residence, listed hobbies from their application and salary.

You want to bring these collections of information together. That’s the data ingestion process. Data_Lake_721_420_80_s_c1

There will be transformation needs, as you’ll have to breakdown information like the address into its constituent pieces, such as street, city, state and ZIP code. Similarly, the street address itself may be one or two lines long, depending on things like whether there’s an apartment number or a separate P.O. box. There may also be more advanced issues, such as differences in formatting across countries.

Schema issues also present problems. For example, let’s say you have an entry in your first source for “John Jones” and another for “John J. Jones” or something similar. How do you decide what constitutes a match? More importantly, what criteria can be used to ensure actual matches are obtained through the kinds of automated processes that are common during data ingestion?

In the best-case scenario, good data curation practices are in place from the start. Some sort of unique identifier is employed across all your data tables that matches people based on, for example, employee ID numbers that are never reused. In the worst-case scenario, you simply have a bunch of mush that’s going to have to be stabbed at in the dark.

The Role of Human Curation

Even if your organization employs best practices, such as unique IDs for entries, date stamps and preservable identifiers across transforms, there are going to be curation needs in virtually every data set. Perhaps you get super lucky and all the data lines up perfectly based on those ID tags, too. Many other things can go wrong.

For example, what happens if there’s a scrubbed or foreign character in an entry? For example, HTML entities are often transformed by security protocols prior to database insertion to prevent SQL injection attacks.

Data sources can also induce problems. Perhaps you’ve been importing information from a CSV file, and you don’t notice one or two entries that throw the alignment off by one or two columns. Worse, instead of getting a runtime error from your code or your analytics package, it all appears to be good. Without a person scanning through the data, you won’t notice a flaw until someone pulls one of the broken entries. In the absolutely worst scenario, critical computational data ends up being passed along and ends up producing a flawed work product.

Providing Access

Okay, you’ve gotten all that business straightened out. Curation superstar that you are, everything aligns beautifully, automated processes flag issues and humans are double-checking everything. Now you have to put usable information into your employees’ hands.

First, you need to know the technical limits of everyone you employ. If someone can’t code an SQL entry, you need to have data in additional formats, such as spreadsheets, that will allow them to load it into their own analytics packages. Will you walk back those transforms in the output process? If so, how do you confirm they will be accurate renderings of the original input?

Likewise, the data needs to be highly browseable. This means ensuring that servers are accessible, and they also need to contain folders with structures and names that make sense. For example, the top level folders in a system may place an emphasis on generalizing their contents, such as naming them “employees” and “customers” for easier reading.

Data curation is a larger cultural choice for an organization. By placing an emphasis on structure the whole way from ingestion to deployment, you can ensure that everyone has access and quickly begin deriving insights from your data lake.

Polk County Schools Case Study in Data Analytics

We’ll send it to your inbox immediately!

Polk County Case Study for Data Analytics Inzata Platform in School Districts

Get Your Guide

We’ll send it to your inbox immediately!

Guide to Cleaning Data with Excel & Google Sheets Book Cover by Inzata COO Christopher Rafter