Data Has 99 Problems and Quality is Just One

The data economy is booming and showing no sign of slowing down. To keep up, make sure your data addresses these 4 key challenges to data quality.

Data quality. Two words that I hear in every conversation, at every conference, and thrown into the title of article after article in the market research space. But what is encapsulated in those two words? Data quality is an important topic and, arguably, when discussing survey data and the issues around fraudsters, bots, speedsters, professional survey takers, it could be considered “the” problem. I see why in a world of consumable insights, we throw a range of issues into a “data quality box” and make the problems the industry faces easier to divert or put on someone else’s plate.  

Here’s the industry-shaking challenge. As companies branch more into “Big Data” or even “Small Data”, data quality overshadows and often oversimplifies the problems of the current data landscape to a point most people ignore or altogether push other issues in research to the background. To get quality insights, you need quality data behind them, which means creating a robust research methodology that also considers all of the biggest problems with big data.

Here are 4 other equally important problems facing the data economy and the market research industry.

1. Data Security (and No, Throwing it on a Blockchain Isn’t a Cure)

When it comes to security, sometimes it is easier to protect one centralized data lake, also known as protecting the Fort Knox of data, instead of securing multiple individual data silos. With hackers shooting arrows at a central target, your system is a sitting duck. If you are collecting, storing, and securing your data in one place you’re always one arrow away from your company’s feature in a data breach headline.

Here are some alternative methods to centralized repositories to think about:

  • Build encryption models that utilize local storage to doubly encrypt your data.
  • Push machine learning models to the data and work towards a system that can handle on-device analytics and edge processing.
  • Track data, whether it be through new models like blockchain or others, so contributors know where their data went and you know exactly where the data came from so you can quickly identify a breach and trace the trail.
  • Give users choices in protecting their data and make them part of the process.

2. Data Wrangling and Cleansing

For data from things like steps on a Fitbit, YouTube subscriptions, or even Spotify playlists:

  • Fingerprint data sources to make data duplicates obsolete without needing to know who the individual is.
  • Create averages and algorithms for missing data values. For example, when people charge their Fitbits how do we begin to account for missing values that don’t reflect a person’s activity? Build models with your own datasets to define average usage across data sources and reject incomplete data that doesn’t meet your thresholds. This will reduce the work you or your client have to take on later.
  • Start normalizing your data. Make data readable by the average person who isn’t a data analyst or data scientist. Creating a robust ontology across data source types will reduce you or your clients’ work and make data easier to handle. While doing this, think about how you can best visualize the end result.
  • Double validate your data. Try to reduce or at least validate survey responses with permissioned data to ensure its accuracy.

3. Data Privacy

At the root of privacy is the power and choice to share data, deny data from being used at high and granular levels, delete or revoke that data access, and minimize the information needed to get to an insight.

  • Privacy should not be an afterthought. We need to move towards secret identities with verified attributes instead of appending data from data brokers whose job is to collect information individuals have no control over. By taking part in this practice to avoid PII, the industry keeps a very dark and inaccurate system alive.
  • Bad actors can almost always reverse the de-identification of an individual so finding creative ways to validate the accuracy of respondents and their responses without referencing PII and aggregating raw data wherever possible is essential to preserving the privacy of your participants.
  • Think harder. Shifting big data to small data is possible, but it requires individuals and companies to get smarter about the pieces of information they need versus want in order to see the best results. Instead of asking users for every data point imaginable because you aren’t quite sure of what you are looking for, take more time thinking through the project and outcomes you desire and then go after the bits of data you need.

4. Data Accessibility: A world of API’s and Silos

The market research industry (especially the big fish like Nielsen) should stop cutting backdoor deals with companies like Netflix and use their power and influence to open up API’s that have been conveniently shut down to protect walled-gardens. At the end of the day, it’s the users’ viewership so shouldn’t they be the central part of the process of accessing and sharing this data?

As an industry, those with power should embrace working directly between brands (who often hold massive amounts of consumer data) and panelists (who have access to the most complete sets of their own data) so that data silos can be opened up to individual users to share as they choose for the purposes of research.

Let’s look at the YouTube API. YouTube gives access to subscriptions, likes, and playlists, but leaves out the most important data of all: watch history. Yet, with recent legislation, a Google account holder can now download their entire history, without needing Google’s permission. To protect their assets, this copy of the data is made intentionally difficult to utilize, making it less secure in the process.

We all know large companies make deals behind the scenes or break company terms of conditions to access this data–but what if the industry came together to empower individuals with their data. As personal data becomes more accessible to the people generating it, will they begin to choose what research firms or brands are worthy of their information? Only with true competition in the data marketplace (between the creators and the controllers), can data-driven insights become more ubiquitous.

Other blog posts