When it comes to analytics, in particular for product ideation and optimisation, listening to what the data does not say is often as important as listening to what it does. There can be various types of "silences" in data that we must get past to take the right actions. Here I will focus on the most common.
Frequently very large data sets will have a proportionately small number of items that will not "parse" (be converted from raw data into meaningful observations with semantics or meaning) in the standard way. A common response is to ignore them under the assumption there are too few to really matter. The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other. So, even though it may only be .1 percent of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems. Do not allow syntactically inconsistent data to be silent.
In real data sets, we often find semantic discrepancies (differences in meaning) from one item to the next where we expect similarity. A common example is "omission values". Some items may have a zero, some may have the special value NULL, some may have blanks, some may have user-entered values such as "?" or "N/A". Do these all mean the same thing or not for our analysis?
Another place semantic gaps often form is in the relationships between data items/records. For example, if we expect to see the VIN of a new car linked to the final assembly plant in which it was produced, then what does it mean if the pointer to that plant is invalid (contains one of these "omission values" or a value that simply doesn't resolve to an existing plant in our dataset)? Presumably all cars have to undergo final assembly in some plant, so the silence here is trying to tell us something that we should not ignore.
Assume we are given the data printed on car window stickers for vehicles actually sold in the first quarter of the year. We may find that 41 percent of them were blue while the next most prevalent colour was only 18 percent of the vehicles sold. We might conclude that customers bought more blue cars because they preferred blue. In drawing that conclusion, however, we have allowed all the other data to be silent. With a bit more thought, we realise that we only have sticker prices, so direct evidence of discounting is not available.
However, this data set does have model year, so we are able to see 70 percent of the blue cars sold were from the previous model year. It is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold. So maybe blue wasn't so popular after all. Do not allow relevant data to be silent in drawing inferences.
Sign up for Computerworld eNewsletters.