Understanding Data Quality: The Problem of Understanding Data
We all want good quality in the goods and services we consume, but what does “quality”mean when it comes to data?
This is a very broad topic and Anchoring Data will be providing posts on different aspects of data quality going forward. It is also a very important topic. We all make decisions based on data in our personal and professional lives. In doing so we are consciously or unconsciously judging – or merely accepting - the quality of the data involved. Data quality has a direct impact on our lives and like it or not we have to be concerned about it.
Now, you might think that “data quality” simply means the data is right and not wrong, and while that is a valid point, data quality involves a lot of other factors too. In fact, data gets blamed for being of poor quality when it actually is not. One of the main reasons this happens is that people do not understand the data, or are not able to understand the data, they are trying to use. They will typically say the data is “bad”. Such statements eventually get back to the technical administrators of data who may then try to correct nonexistent problems in the data. It all adds to the “data mess” we see in many organizations.
Quality vs. Understanding
Let us suppose there is such a thing as “perfect” data. It is quite possible to have access to perfect data but not understand what the data actually represents. Trying to use the data without understanding it will almost certainly create problems due to misapplication of the data.
Of course, there probably is no such thing as “perfect” data, and the situation is even worse with the imperfect data in the real world.
A good example of this is the US Bureau of Labor Statistics, which publishes two estimates of employment every month. One, the Establishment Survey, counts each person repeatedly for each job they hold. The other, the Household Survey, just counts a person once if they are employed.
The Establishment Survey is the “Headline Jobs Number” that appears every month in the media and causes the Stock Market to go up or down. It is also used by a lot of financial analysts. Yet all these people seem blissfully unaware that the Establishment Survey is overstating the number of people who are employed.
It seems fairly obvious that if someone is making decisions based on data like the Establishment Survey they should be aware of what the data means. And yet, it seems very often the data is just taken at face value.
The Role of Definitions
All this is quite easy to say, but in practice it can be very difficult. How do you go about understanding data? Well, data should be defined, but alas a lot of data is not defined, or has very poor definitions.
Technical people, like programmers, do not enjoy creating documentation, and quite frankly are often not good at it. There used to be technical writers who would do write data definitions, but these positions have largely been eliminated in recent years. There are more technical roles, like data modelers who design databases, and who also capture data definitions. However, their definitions are often in specialized tools that are inaccessible to ordinary users.
Furthermore, we cannot expect too much from a definition. A definition is important as it is an explanation of the essence of something – for data this would be what it means to non-technical businesspeople. Other information is usually missing, e.g. if a data element is calculated, the actual calculation is often not in the definition.
Even more alarming is that for data, the definition can be what the data is supposed to be, not what it actually is. For instance, the data element “Address Line 3” might be defined as “To be used for the third line of a street address, but not the city, higher geopolitical unit, or post code of the address” and in reality be used to house fax numbers. Such things can happen with data.
What’s In the Data?
The philosophers tell us that definitions give us intension but not extension. What is “extension”? It is the range of instances that go into data. Suppose we have a database table that contains information about Employees. On closer inspection we find that this table contain US Employees but excluding employees Puerto Rico, Guam, and the Marshall Islands. We also find that the table only gets updated every 6 months.
In this example we may have a perfect definition of what an employee is, but the “extension” of the table seems to have limitations that could be important. Unfortunately, people using data only looking at the definitions can get a false sense of security from that.
Where Did the Data Come From?
Yet another issue that is becoming increasingly important is the provenance of the data. A good deal of data is now purchased by organizations, or scraped from the Internet by them. Years ago, the only data most organizations had was what they produced from their operational systems.
Obviously, some sources of data are going to be more reliable than others. Many data vendors have established reputations that they want to preserve and so they are going to be careful about the quality of the data they sell. Other sources could be a lot dodgier, and we now have the prospect of datasets assembled by AI, which might hallucinate data into existence.
Knowing something of where the data comes from is very likely to affect how it will be used, but, again, this is largely ignored and the data is just taken at face value.
Don’t Blame Data Quality if You Don’t Understand the Data
Alas, people will often not bother to make the effort to get a level of understanding they need to use the data. When things go wrong – which they almost certainly will – these people are not going to blame themselves. Instead, they are going to blame the data and say it had poor data quality.
If anyone has a need to use data, they should think through the assumptions they are making about the data and its intended use, and understand the data to the level that is needed. This requires time and effort, and might be difficult. Maybe it is not even possible to fully understand the data to the level required. But at least an informed decision can be made whether to accept the risk or not.
In the end, understanding the data we are using is a personal responsibility, and it is ingenuous to blame the data if we ignore this responsibility.


