Geek looking at data


Avoiding Garbage In, Garbage Out: The Importance of Data Quality - Part 1

May 1, 2016, 12:51 PM by Sean Howard

Quality analytics depends on many things, not the least of which are an incisive understanding of the business problem to be analyzed and an experienced, knowledgeable team of data pros who have the right tools and techniques to conduct the analysis.

But the single most important ingredient to effective analytics is quality data. In this three-part series we will look at what constitutes quality data and how to go about ensuring that your analytics are based on the best data possible. Essentially, data quality comes down to three factors: input data, methodology and quality control. In this initial post we will examine input data.

Hear more from Sean Howard as he discusses data quality and methodology for DemoStats.

Understanding Input Data

As the saying goes: garbage in, garbage out. High-quality input data are fundamental to producing reliable models and datasets. Regardless of how good your models are, if the input data used to build and implement them are bad—incomplete, outdated, biased or otherwise inaccurate—the resulting predictions or datasets have little chance of being reliable.

Now, no data are perfect, which is something you don’t often hear stated so bluntly by a data provider. But the reality is data are subject to how, where, when, and from whom they were captured. Any of these aspects can be a source of bias or error. So it is imperative to understand the pedigree of input data and determine how “clean” the data really are before embarking on any analytics effort.

Data, no matter how “up-to-date,” always present a snapshot in time, and that time is, by necessity, in the past. Knowing when (recency and frequency) and how (process) the data were collected is critical to determining the degree of data “cleanliness” and also assists researchers in making informed choices about methodology and what types of analysis may or may not be appropriate.

The recency of input data greatly determines how well they reflect the current state of affairs. Data that are 5 years old are bound to be less representative of the present than data that are 5 minutes old, all other things being equal. Further, the frequency of data collection is also important because it influences the types of models that a researcher can use and how often those models can be calibrated and their predictions tested. As researchers, we have to use history to predict the future. There is no changing this fact. And as researchers, it is our job to determine how well historical data reflect the present or predict the future—and make adjustments where necessary. This is where the skill, experience and domain knowledge of the researcher is critical. It is quite straightforward to build most models. The real challenge is intelligently using the results.

The second critical component to understanding input data is knowing how the data was collected. The data collection process is always flawed, which leads to errors, outliers and biases in the resulting data. While in many cases there is little researchers can do about the flaws in the collection method, it is vital that they be aware of those flaws. For example, data on purchase behaviour collected via a survey will be quite different than data collected at the point of sale (POS). What people say they did is typically quite different from what they actually did. Consequently, the way that researchers work with data from a survey versus a POS system should also be quite different. In some cases, the “where,” “how”, “when” and “from whom” of data collection greatly limit the types of techniques and analysis that we can undertake.

When we receive data at EA we put it through a series of checks and questions before applying it. Here are just some of the elements we examine and the level of granularity that we drive down to help us assess the input data:

  1. How many unique records are there?
  2. How many duplicate records are there? Should there be duplicates?
  3. How many fields are available in the dataset and what data types are they?
    1. For string fields, should they have a specific structure? For example, a column representing postal codes should have 6 characters and those characters should have specific structure.
      1. How many records conform to the prescribed format?
      2. Is there a methodology that can be used to clean up records that do not conform to the prescribed format?
      3. Should records that have been cleaned be handled differently in later analysis?
    2. For numeric variables, what is the range, variance and central tendency?
      1. Are they logical? For example, if 99 percent of the data range between 0 and 100, but 1 percent of the data are negative or over 1,000, does this make sense?
      2. Are these outlier values real, artifacts of the data collection process, data entry or processing errors?
      3. If there are outliers detected, how should they be handled? Should they be excluded from all analysis or replaced with an alternative estimate? (The answer depends on the nature of the outlier and the purpose of the analytical exercise and the type of model being used.)
      4. Do the variables sum up to meaningful totals?
    3. For categorical variables, are all categories represented?
      1. Are categories labelled consistently and correctly?
  4. Are there any missing values? Are there null or blank entries for specific cells in the dataset?
    1. Do some records have more missing values than others?
    2. Do some fields have more missing values than others?
    3. How should missing values be handled? Should they be excluded from analysis or replaced with an alternative estimate? (The answer depends on the purpose of the analytical exercise and the type of model being used.)
  5. How representative are these data?
    1. Is there a known bias in the way that the data were collected? For example, because online surveys require that participants have Internet access, the results cannot be generalized to the entire population.
    2. Which geographies are represented in the data?
    3. How well do the data reflect the relative distribution of people or households across a geography?
    4. How well do attributes compare to similar attributes from other authoritative data sources? For example, if a customer database includes age, how well do the ages match the ages of the total population?
    5. If there are known gaps or biases in the data, is there enough information available to correct for those gaps and biases?

Answering these questions helps researchers understand the input data and begin to map out an approach for using the data to build reliable datasets and models. In the next instalment of this three-part series we will look at the important role that methodology plays in ensuring data quality.

Click here to read part 2.

Sean Howard is Vice President, Demographic Data at Environics Analytics