The era of Big Data is upon us, and with it, business leaders are finding new insights in the data to drive their tactical and strategic decisions. Data visualization tools are widely available from many vendors including Tableau, Qlik (Qlikview) Microsoft (PowerBI). The question is no longer ‘are you using Big Data?’ but rather, ‘why not?’

 
Visualization vendors make it sound easy; make your data accessible to our tools, push a button, and wondrous visual displays uncover never before seen insights into your business. One fact, however, is always downplayed; you actually have to prepare data for analysis. Great visualization tools are worthless if the data is not organized and prepped beforehand. For many data analysis projects and initiatives, the data prep itself can be the most time-consuming and repetitive step.
 
Here are, from our point of view, the top 3 challenges of the data prep process, and how to overcome them.
 
Frustration #1: Merging Data from Different Sources
 
Analysts want to jump right into the analytics and uncover the promised insights, but first they have to follow the processes for data loading and making the data available to the analytics engine. Easily done if all of the necessary data is in a single data set; but it rarely is.
 
Data exists in many different systems, from finance to engineering to CRM systems, and both public and private sources. The number one challenge for data prep is the data munging and merging that must take place as you merge data from different systems. And it’s never easy. Simple nuances in the data are often the toughest part. Different formats of data.
 
The data, data structures, and even the definition of what the data reflects varies from one system to another, and the primary challenge of the data transformation is to merge it together in a consistent manner.
 
Time stamps that contain both time and date in one file, but time and date are in separate columns in another file, and must be merged first. Something as simple as how phone numbers and zip codes are formatted can wreak havoc on your results. This is the unspoken reality for the data analyst or scientist: the data is often not your friend.
 
At Inzata, we watched customers struggling with this challenge, and we have a better way. We noticed that a lot of the work was repetitive, and often involved simple (for a human) operations on the data. So we developed Artificial Intelligence that could perform these tedious tasks, requiring only occasional guidance from a human. We call it AI-Assisted Data Modeling, and it takes you from raw, disorganized data to beautiful analytics and visualizations, in about 30 minutes.
 
 
Frustration #2: Lack of Common Ground between the Analyst and IT
 
The analyst is a subject matter expert in her field, the IT pro knows the systems. But quite often, they don’t know much about the other’s role, and can’t speak the same language on requirements to prepare data for analysis. The analyst requests data from the IT pro, and files get sent and delivered in email and dropboxes.
 
In many cases, the data munging process becomes one of trial and error (request a file, work on it, discover changes, request a new file) until finally, after many iterations, the output of Microsoft PowerBI, Qlikview, Tableau or whatever other analytics tools are used delivers the right content.
 
But what if you could work with data in its native source format, coming directly from your source systems, with no ETL middleware or IT guy to call? Inzata lets you organize your connections to source systems (both inside your company and in the cloud) and reads in data in its native physical form.
 
Inzata then helps you rapidly create business data models mapped to this native structure. Your days of transforming raw data and generating new files for analysis are behind you.
 
Everything else you do in Inzata is driven by these virtual data models, so your users and analysts only see highly organized data, structured and merged in a way they can understand it, because it resembles your actual business. Updates are no problem, when new data is ready from source systems, it automatically updates the Inzata dataset and your reports update in real-time.
 
Field names are in English, not computer-ese, and oriented around the things you care about. Customers. Transactions. Employees. These are the “things” you interact with in Inzata, just like in the real world. Data is displayed visually, no code to write. Data aggregations, rollups and filters become their own reusable objects. Creating a new dashboard is as simple as dragging those elements to a blank canvas and giving it a name.
 
What if something changes in the source system; Field names change or new fields are added? In the past this would wreck everything. Reports would stop working and you’d have to start over from scratch. Inzata anticipates these slowly-changing-dimensions, and detects when they happen. It automatically reallocates data elements to accommodate changes, and everything keeps working. The end result: you don’t need to worry about changes in source systems wrecking your reports and dashboards.
 
Frustration #3: Missing Audit Trail
 
This part is very important for anyone who uses (or is thinking about using) a Data Prep tool, Excel or something similar that outputs files.
 
Insights gained through data analytics can give decision makers reason to make significant changes to the business. A lot is riding on these decisions, and there has to be confidence and accuracy in the data and insights. But after the data is merged from several sources, goes through various transformations, and gets reloaded, it becomes hard to track backwards from the insight to the original data. Honestly, are you going to remember the exact transform steps you did on a dataset you haven’t touched in 3 months? The lack of an audit trail weakens the confidence that the team can have in the outputs.
 
As a data scientist, you’ve worked hard to develop your reputation and credibility. Shouldn’t your tools also be up to the challenge?
 
By their very nature, file-based Data Prep tools cannot deliver this kind of confidence and auditing, because they are only in possession of the data for a short time. There’s nothing to link the final file with the original data, or the process it underwent in the tool. They don’t maintain chain-of-custody to protect the data.
 
Inzata does.
 
From the moment your data enters Inzata’s ecosystem, every activity is meta-tagged. We track everything that happens with your data, who touches it, accesses it, what transformations or enrichments it goes through. We also have intelligent temporal shifting, which is a fancy way of saying we invented time travel. (At least for your data, that is.)
 
Here’s how: Inzata stores each incremental state of your data. If you want to see exactly how a report looked 3 months ago, we can show it to you. If you need to re-create a list-query exactly as it would have looked 5 days ago, we can do that.
 
Conclusion
 
Data preparation, and the challenges entailed, is the dirty little secret of big data analytics. The potential insights into your business are valuable, but the process can be so frustrating at times that the projects die on the vine. It’s time to spend as much time looking at data transformation tools that can take the human out of the equation as you do looking at data analytics tools.