Quality analytics depends on many things, not the least of which are an incisive understanding of the business problem to be analyzed and an experienced, knowledgeable team of data pros who have the right tools and techniques to conduct the analysis.
But the single most important ingredient to effective analytics is quality data. In this three-part series we will look at what constitutes quality data and how to go about ensuring that your analytics are based on the best data possible. Essentially, data quality comes down to three factors: input data, methodology and quality control. In this initial post we will examine input data.
Hear more from Sean Howard as he discusses data quality and methodology for DemoStats.
Understanding Input Data
As the saying goes: garbage in, garbage out. High-quality input data are fundamental to producing reliable models and datasets. Regardless of how good your models are, if the input data used to build and implement them are bad—incomplete, outdated, biased or otherwise inaccurate—the resulting predictions or datasets have little chance of being reliable.
Now, no data are perfect, which is something you don’t often hear stated so bluntly by a data provider. But the reality is data are subject to how, where, when, and from whom they were captured. Any of these aspects can be a source of bias or error. So it is imperative to understand the pedigree of input data and determine how “clean” the data really are before embarking on any analytics effort.
Data, no matter how “up-to-date,” always present a snapshot in time, and that time is, by necessity, in the past. Knowing when (recency and frequency) and how (process) the data were collected is critical to determining the degree of data “cleanliness” and also assists researchers in making informed choices about methodology and what types of analysis may or may not be appropriate.
The recency of input data greatly determines how well they reflect the current state of affairs. Data that are 5 years old are bound to be less representative of the present than data that are 5 minutes old, all other things being equal. Further, the frequency of data collection is also important because it influences the types of models that a researcher can use and how often those models can be calibrated and their predictions tested. As researchers, we have to use history to predict the future. There is no changing this fact. And as researchers, it is our job to determine how well historical data reflect the present or predict the future—and make adjustments where necessary. This is where the skill, experience and domain knowledge of the researcher is critical. It is quite straightforward to build most models. The real challenge is intelligently using the results.
The second critical component to understanding input data is knowing how the data was collected. The data collection process is always flawed, which leads to errors, outliers and biases in the resulting data. While in many cases there is little researchers can do about the flaws in the collection method, it is vital that they be aware of those flaws. For example, data on purchase behaviour collected via a survey will be quite different than data collected at the point of sale (POS). What people say they did is typically quite different from what they actually did. Consequently, the way that researchers work with data from a survey versus a POS system should also be quite different. In some cases, the “where,” “how”, “when” and “from whom” of data collection greatly limit the types of techniques and analysis that we can undertake.
When we receive data at EA we put it through a series of checks and questions before applying it. Here are just some of the elements we examine and the level of granularity that we drive down to help us assess the input data:
Answering these questions helps researchers understand the input data and begin to map out an approach for using the data to build reliable datasets and models. In the next instalment of this three-part series we will look at the important role that methodology plays in ensuring data quality.
Click here to read part 2.
Sean Howard is Vice President, Demographic Data at Environics Analytics