Categories
Big Data Data Analytics

How To Use Big Data to Improve Your Customer Service

Customer experience is everything.

Recent research has revealed that 90 percent of buyers are willing to pay a premium for better customer experience. The key is understanding what an improved experience actually means for a customer, however.

The rise associated with analytics has positioned companies to achieve closer customer analysis—on a far greater scale than feedback surveys or social media comments. With access to a mix of complex data sets from an array of sources, companies now have better insight into customer behavior, leading to higher sales numbers and better customer service.

With that will in mind, here are five ways you can use this new emphasis on information to deliver much better customer care.

1 . Know Your Target Audience Much Better

In the past, data collected on customer interactions were primarily drawn from observation and direct engagement. These sources provided some level of insight but were difficult to aggregate—making it a challenge to get a comprehensive view. Today, companies are able to examine thousands of information points on each customer to better understand and segment their best customers.

For example , companies have used big data to figure out how millennial buying habits differ from previous generations. In terms of a singular product, companies now understand why the product…

Read More on Dataflow

Categories
Big Data Data Analytics Data Enrichment

Where to Get Free Public Datasets for Data Analytics Experimentation

Many data companies believe that they have to create their own datasets in order to see the benefits of data analytics, but this is far from the truth. There are hundreds of thousands of free datasets on the internet that anyone can access completely free. These datasets can be useful for anyone who is looking to learn how to analyze data, create data visualizations, or just improve their data literacy skills.

Data.gov

In 2015, the United States Government pledged to make all government data available for free online. Data.gov allows you to search over 200 thousand datasets from a variety of sources and pertaining to many different topics. They offer datasets about Agriculture, Finances, Public Safety, Education, The Environment, Energy, and many other topics that span over a wide range of subjects.

Google Trends

With Google Trends, users are able to find search term data on any topic in the world. You can check how often people google your company, and you can even download the datasets for analysis in another program. Google offers a wide variety of filters, allowing you to narrow down your search by location, time ranges, categories, or even specific search types (ex. Image or video results).

Amazon Web Services Open Data Registry

 

Amazon offers just over 100 datasets for public use, covering a wide range of topics, such as an encyclopedia of DNA elements, Satellite data, and Trip data from Taxis and Limousines in New York City. Amazon also includes “usage examples” where they provide links to work that other organizations and groups have done with the data.

Data.gov.uk

Just like the United States, The United Kingdom posts all of their data for public use free of charge. This is also the case with lots of other countries such as Singapore, Australia, and India. With so many countries offering their data to the public, it shouldn’t be hard to find a good data set to experiment with.

Pew Internet

The Pew Research Center’s mission is to collect and analyze data from all over the world. They cover all sorts of topics like journalism, religion, politics, the economy, online privacy, social media, and demographic trends. They are nonprofit, nonpartisan and nonadvocacy. While they do their own work with the data they collect, they also offer it to the public for further analysis. To gain access to the data, all you need to do is register for a free account, and credit Pew Research Center as the source for the data.

Reddit Comments

reddit datasetsSome members of r/datasets on Reddit have released a dataset of all comments on the site dating back to 2005. The datasets are categorized by year and are available to download for free by anyone and it could be a fun project to analyze the data and see what could be discovered about reddit commenters.

Earthdata

Another great source for datasets is Earthdata, which is a part of NASA’s Earth Science Data Systems Program. Its purpose is to process and record Earth Science data from Aircraft, Satellites, and field measurements.

UNICEF

UNICEF’s data page is a great source for data sets that relate to nutrition, development, education, diseases, gender equality, immunization and other issues relating to women and children. They have about 40 datasets open to the public.

National Climatic Data Center

The National Climatic Data Center is the largest archive of environmental data in the world. Here you can find an archive of weather and climate data sets from all around the United States. The National Climatic Data Center also has meteorological, geophysical, atmospheric, and oceanic data sets.

Read More Here

Categories
Big Data Data Analytics

How to Increase Diversity in the Tech Workplace

Diversity in the workplace is something that all tech companies should strive for. When appropriately embraced in the technology sector, diversity has been shown to increase financial performance, increase employee retention, foster innovation, and help teams to develop better products. For example, data marketing teams that have equitable hiring practices in regards to gender exemplify this.

While the particular benefits of a diverse workplace can help any company thrive, figuring out how exactly to improve diversity within tech workplaces can be a challenge. However, employing the diverse team is not impossible, and the rewards make diversification efforts well worth it.

Diversity Is Less Common Than You Might Think

Though the tech industry will be far more diverse today than it has been in the past, diversity still remains an issue across the sector. Even if those heading tech companies don’t engage in outright racism by fostering a hostile work environment towards people of color or discouraging the hiring of diverse groups, many tech companies still find themselves with teams that look and think alike. Homogeny creates complacency, insulates a workforce from outside perspectives, and ultimately prevents real innovation plus creativity from taking place.

Tech companies can be complicit in racism through hiring practices, segregation of existing…

Read More on Dataflow

Categories
Big Data

Predicting Housing Sale prices via Kaggle Competition

 

Kaggle Competition / GitHub Link

Intro

The objective of this Kaggle competition was to accurately predict the sales prices of homes in Ames, IA, using a provided training dataset of 1400+ homes & 79 features. This exercise allowed both experimentation/exploration for different strategies of feature engineering & advanced modeling.

EDA

To familiarize with the problem, some initial research was done on the town of Ames. As a college town, home to Iowa State University, everything (including real estate) can be tied to the particular academic calendar. The location of airports & railroads were also noted, as well as which neighborhoods are rural vs . mobile homes versus dense urban. Another interesting discovery was the Asbestos Disclosure Law, requiring sellers to notify buyers if the material is in or on their homes (such as roof shingles), which may have a direct impact on home’s price.

To acquaint ourselves with the dataset’s features, features were divided into Categorical & Quantitative categories, where some could be considered both. A function was written to visualize each through either box plots (abc) or scatter plots (123) to gain quick insights such as NA / 0-values, value/count distribution, evidence of relationship with the target, or obvious outliers.

To dig-in a little deeper, two functions, catdf() and quantdf() , were scripted to create the dataframe of summary details for each types of features:

The CATDF dataframe includes number of unique factors, the total set of factors, the mode, the mode percentage & the quantity of NAs. It also ran a simple linear regression with only the function & sales cost, and returned the particular score when the feature was converted into dummy variables, into a binary one (mode) vs. rest variable, or even a quantitative variable (eg Poor = 1 while Excellent = 5). This would also suggest an action item for the specific feature depending on results.

The QUANTDF dataframe includes range of values, the mean, the number of outliers, NA & 0-values, the pearson correlation with saleprice and a quick linear regression score. This also flags any high correlation along with other variables, in order to alert of potential multicollinearity issues. This proved particularly useful when comparing the TEST vs. TRAIN datasets – for example patios sizes had been overall larger in the TEST set, which may affect the overall modeling performance if that particular feature were utilized.

Feature Engineering & Selection

The second step has been to add, remove, create & manipulate additional features that could provide value to our modeling process.

We attempted to produce multiple versions of our dataset, to see which “version” associated with our feature engineering proved most beneficial when modeling.

Here are our dataset configurations, created to compare model performance:

NOCAT only quantitative features were used
QCAT quantitative functions + converted ordinal categorical (1-5 with regard to Poor-Excellent)
DUMCAT using all original features but dummified all categorical
OPTKAT using some new features & converted categorical based upon CATDF suggested actions
MATTCAT all feature executive (+ a few extra), intelligent ordinality , usually our best

Missingness

Missingness was handled differently depending on our own dataset configurations (see below). Particularly in the MATCAT dataset, significant time plus energy was spent meticulously choosing appropriate missing values within the dataset, using the general assumption made that if a home contained the null value related to area-size, that the home did not include that area on its lot (i. e. if the Pool Square Footage was null, we assumed the particular property did not really contain a pool). Some of the earlier versions of the models, such as our initial simple linear regression design makes use of mean imputation regarding numeric columns (after outlier removal), and mode imputation intended for categorical values prior to dummification.

Feature Combinations

Upon analyzing the dataset, it was clear that several features needed to be combined prior to modeling. The dataset contained square-footage ideals for multiple different types of porches and decks (screened-in, 3Season, OpenPorch, plus PoolDeck) which combined neatly to become a Porch square-footage variable. The individual features were then removed from the dataset.

Other functions were converted from square-footage units in order to binary categories, denoting whether the home contained that item, feature, or room or not.

The function written to create the particular MATCAT dataset allows the user to apply scaler transformations, and boxcox changes for largely skewed features. These conversions generally improved the models’ accuracy, especially in the linear models.

Additionally, the particular MATCAT dataset makes use of intelligent ordinality while handling NA values to get categorical features being converted to numeric. We found that in certain cases, having a poor-quality room had been more detrimental in order to a home’s saleprice than a home not possessing that will room or product at all. For instance, in our dataset, homes without a basement have the higher average saleprice than homes that have basement of the lowest quality . In cases such as this one, NA values were given the numerical value closest-matching the average saleprice associated with homes with NA for that category.

Other feature selection strategies used had been:

    • Starting with all of the functions, running a while-loop VIF analysis to remove anything > 5
    • Starting with single feature, adding new features iff it contributes to a better AIC/BIC score
    • Converting selected features in order to PCA and modeling with new vectors
    • Using Ridge/Lasso to remove features through penalization
    • Using RandomForest Importance listing to use top subset pertaining to decision tree splits

Models & Tuning

Linear Modeling – Ridge, Lasso & ElasticNet were used, GridSearchCV optimized meant for alpha and l1_ratio. Since many significant features have a clear linear partnership with all the target variable, these model gave a higher score than the non-linear models.

Non-Linear Modeling – Random Forest (RF), Gradient Boosting (GBR) and XGBoost were used, GridSearchCV optimized for MaxFeatures for RF, because well as MaxDepth & SubSample designed for GBR. The performance was not improved using our optimized dataset, since the optimized dataset was optimized just for linear regression just. In addition , it was difficult to balance over-fitting when making use of the GBR model.

Model Stacking – H20. ai is an open-source AutoML platform, and when it was asked to predict saleprice, based on our MATCAT dataset, the AutoRegressor utilized various models (RF, GLM, XGBoost, GBM, Deep Neural Nets, Stacked Ensembles, etc) that ultimately lead to our best Kaggle Score. While it is more difficult to interpret this particular model’s findings compared to traditional machine learning techniques, the particular AutoML model neutralizes any major disadvantages any specific design may have while taking the greatest of each family.

Collaboration Tools

In addition to our standard collaboration tools (github, slack, google slides) – we also utilized Trello organize our thoughts on the different features & Google Python CoLab to work on the same Juptyer notebook file. This allowed us to work together virtually anywhere & at anytime.

Categories
Big Data Data Analytics

An Analysis of Facebook’s Cryptocurrency Libra and What it Means for Our World

After months of speculation, Facebook has revealed its Libra blockchain and the Libra coin to the world. The highly-anticipated cryptocurrency ran into immediate opposition in Europe and the United States. The French Finance Minister Bruno Le Maire said it was “out associated with question” that Libra would “become a sovereign currency”. Meanwhile, Markus Ferber, the German member of the European Parliament, said that Libra has the potential to become a “shadow bank” and that regulators should be on high alert. In addition, both Democrats and Republicans raised their concerns with Representative Patrick McHenry, the senior Republican on the House Financial Services Committee, calling for a hearing on the initiative.

It was to be expected that when the particular social media giant, who has seen numerous scandals in 2018, would launch a cryptocurrency, there would be opposition. Many people, organisations and governments no longer trust Facebook with the social media data, let alone along with their financial information. The main concerns from regulators and lawmakers around the particular world are that Facebook is already too massive and careless with users’ privacy to launch an initiative like Libra.

However, before we judge too quickly, let’s first dive into the Libra blockchain as well as the Libra coin to understand it…

Read More on Dataflow

Categories
Big Data Data Analytics

Why are Consumers So Willing to Give Up Their Personal Data?

Data privacy is a hot-button topic. Most people can agree that it’s important to keep personal data private, but are you really doing much to keep your data safe?

Consumers are fervent in their fight to protect their data, but they do little to maintain it safe. It’s known as the privacy paradox, and it may be hurting consumer efforts in order to keep their information out of third-party hands.

What makes consumers so willing to give up their personal data?

Data in Exchange for Something Valuable

According to recent research, most (75%) of internet users don’t mind sharing personal information with companies – as long as they get something valuable in return.

A recent Harris Poll also found that 71% of adults surveyed in the U. S. would be willing to share more personal data with lenders if it meant receiving a fairer loan decision. Lenders typically ask for information about the applicant’s personal financial history, but the particular poll suggests that borrowers may be prepared to give up even more information.

Research suggests that consumers are well aware of and understand that data exchange is a sensitive matter, and they’re willing to be participants in the “game. ” But they want the particular game to be fair. In other words,…

Read More on Dataflow

Categories
Big Data Data Analytics

Why The Future of Finance Is Data Science

The entire process associated with working is going through fast changes with every advance in technology. Top financial advisors and leaders now see the future completely reliant on data science.

Automation is occurring in all industries, plus while some jobs will become streamlined, that does not necessarily mean lowering the number of employees. With new technology, people need to reexamine software, information storage and even give up some responsibilities to Artificial Intelligence.

Statistics vs . Data Analytics

Statistics are a vital part of learning customer basis and seeing exactly what is occurring within the finance company and how it can be improved. There is a difference between analytics plus statistics.

Vincent Granville, information scientist and data software pioneer explains this in the particular simplest forms, “An estimate that is slightly biased but robust, easy to compute, and easy to interpret, is better than one that is unbiased, difficult to compute, or not robust. That’s one of the differences between information science and statistics. ”

Data science did evolve from a need for better data, and once big data arrived, the particular standard statistical models could not handle it. “Statisticians claim that their methods apply to big data. Information scientists claim that their methods do not affect small data, ” Vincent…

Read More on Dataflow

Categories
Big Data Data Analytics

Does Big Data Have a Role in 3D Printing?

Most modern technologies complement each other nicely. For example, advanced analytics and AI can be used collectively to achieve some amazing things, like powering driverless vehicle systems. Big data and machine learning can become used collaboratively to build predictive models, allowing businesses plus decision-makers to react and plan for future events.

It should come as no surprise, then, that big data and 3D printing have a symbiotic nature as well. The real question is not “if” but rather “how” they will influence one another. After all, most 3D prints come from a digital blueprint, which is essentially data. Here are some of the ways in which big data and THREE DIMENSIONAL printing influence one another:

On-Demand and Personalized Manufacturing

One of the things 3D printing has accomplished is to transform the modern manufacturing market to make it more accessible plus consumer-friendly. There are usually many reasons for this.

First, 3D printing offers localized additive manufacturing, which means teams can create and develop prototypes or concepts much faster. The technology can also be augmented to work with a variety of materials, from plastic and fabric to wood plus concrete.

Additionally, the production process itself will be both simplified and sped up considerably. One only needs the proper digital formula…

Read More on Dataflow

Categories
Big Data Data Analytics

New IT Process Automation: Smarter and More Powerful

If we think of the newest trends in IT service automation, or try to follow the recent research, or listen to the particular tops speakers at conferences and meetups — they all will inevitably point out that automation increasingly relies on Machine Learning and Artificial Intelligence.

It may sound like the case when these two concepts are used as buzzwords to declare that process automation follows the global trends. It is partially true. In theory, machine learning can enable automated systems to test and monitor themselves, to provide additional resources when necessary to meet timelines, because well as retire those resources when they’re no longer needed, and in this way to enhance IT processes plus software delivery.

Artificial Intelligence in turn refers in order to completely autonomic systems, that can interact with their surroundings at any situation and reach their goals independently.

However , most organizations are in very early days in terms of actual implementations of such solutions. The idea lying behind the need for AI and related technologies will be that many decisions are still the responsibility of the particular developers in spheres that can be effectively addressed by adequate training of computer systems. For example, it is the developer who decides what needs to be executed, but identifying…

Read A lot more on Dataflow

Categories
Big Data Data Analytics

Setting up an Analytics Team for Success = Get Fuzzy!

Building on our month focussed on controversial topics, let’s turn to what will set your team up for success.

Different contexts can require different types of the analytics team. A lot of the advice that I offer within the Opinion section of this blog is based on a lifetime leading teams in large corporates. So , I’m pleased to partner with guest bloggers from other settings.

So, over to Alan to explain why getting “fuzzy” is the way for an analytics team to see success in the world of startups…

Get fuzzy! Why it will be needed

My co-founders and I have recently had to face up to this challenge of creating a new data analytics team having set up our new firm Vistalworks, earlier in 2019. Thinking about this challenge, reflecting on what we know, and getting the right answer (for us) has been an enlightening process.

With 70-odd years of experience between us, we have plenty of examples of what not to do in data analytics groups, but the really valuable question has been what should we do, and exactly what conditions we should set up in order to give our new team the best chance to be successful.

As all of us talked through this issue my main personal observation was that successful data analytics teams, of whatever size, have…

Read More on Dataflow

Polk County Schools Case Study in Data Analytics

We’ll send it to your inbox immediately!

Polk County Case Study for Data Analytics Inzata Platform in School Districts

Get Your Guide

We’ll send it to your inbox immediately!

Guide to Cleaning Data with Excel & Google Sheets Book Cover by Inzata COO Christopher Rafter