Marketechnics Report: What Do You Mean I Have Dirty Data?

Commentary by Bill Bittner

One of the big themes at Marketechnics this year was how to clean up your data.

Software vendors are providing various tools to synchronize retailer data, both externally with the manufacturer and internally among their own disparate systems. It is my contention
that companies do not have dirty data, but rather misunderstood data.

In the world of retailing, probably the most misunderstood data entity is the “item.” It is often used generically, yet in different circumstances it acquires very specific meaning.
Retail computer applications must distinguish between all the various nuances of item and understand their implication.

In fashion retail, we talk about men’s jeans as an item, but must realize they come in a variety of styles, colors and sizes. If we are using the term item to really mean SKU,
we must know these attributes and the brand in order to define an SKU.

In general merchandise, the concept of “kitting” recognizes that certain combinations of items can be created as another item to encourage purchase of all the things in the kit.
Computer applications must understand the effect on inventory when a kit is constructed at store level or broken apart into its components. The kit requires its own UPC and must
be considered a special type of item until the overwrap is removed.

Probably the most fundamental error a technology provider can make is to equate UPC or GTIN with item. In the supermarket, there are a variety of physical units with different
UPCs that can represent the same item. Manufacturer sponsored promotions in the form of pre-priced, cents-off or bonus packs of an item are all the same item from a replenishment
perspective, but are not the same when considering forecasting and pricing. Various packaging configurations, such as multi-packs or cases, are also the same item, just instances
where it is packaged differently.

Downsizing is a FMCG practice of preserving a price point by reducing the net contents of a retail unit. Thus, the pound can of coffee is now down to 13 ounces but it is still
the same item and, although it shouldn’t, has often kept the same UPC.

Greeting cards and plant seeds represent yet a different type of item. Most retailers do not keep inventory on each individual greeting card and type of plant seed. They have
a price point item that represents the selling price of a particular brand and type of greeting card. All the “$2.99 Cards” are presented by one retailer item number. While the
manufacturer will monitor individual UPCs to determine how well the specific UPC sells, the retailer will only track the generic price point. As cards go in and out of season,
the price point item that defines them can literally have 1000’s of UPCs associated with it. Markdowns on individual UPCs for price point items are a challenge. Some retailers
may assign separate item numbers to holiday card price points separately so they can be discounted when the holiday passes.

Variable weight products create another set of nuances as the UPC does not represent an item at all but defines a category and contains the extended retail price. Variable weight
UPCs are not GTINs because their definition is not global. Industry groups issue recommendations for use of UPC ranges, but individual retailers also make their own assignments.
The same item selling in different stores can have different UPCs. Attempts to reference these UPCs as the same item will fail.

Finally, many organizations seem to still confuse selling units and logistics units. Selling units are “anything that is merchandised separately.” If I am going to merchandise
the same item in cases and multi-packs and “eaches,” then my replenishment applications must regard each of them as separate items and replenish them appropriately. On the other
hand, if I am only going to merchandise an item in eaches, then my applications must replenish it using the logistics unit that makes the most sense for the presentation stock
I have in the store. Suburban stores with large shelf space may replenish in full cases while city stores with small shelf space might replenish in smaller case packs or even
use break-pack options to minimize handling. Demand based on retail units must be linked to logistics units based on what is carried in the warehouse and what is merchandised
in the store and this conversion must be as flexible as possible so that substitutions can be used to minimize out of stocks.

Item is just one area where I believe the simple declaration of dirty data has more subtle implications. My experience with many retail packages has been that their simplified
view of the business has made them difficult to use and resulted in artificial constraints on the business. Instead of blaming bad data, I feel applications must better understand
the relationship between various retail entities, and be able to use and preserve those associations within the context of their function.

Moderator’s Comment: Are there other nuances to the term item? Have you had situations where you believe the term “dirty data” was really “misunderstood
data”? Is there a way to avoid these situations?

I believe the subtlety of the item definition is just one example of how we must improve the ability of computer applications to make inferences from the
real world around them. We must also understand the association between “business partners,” as the same individual may be a supplier, employee, customer, and someone’s cousin.
The association between weather and events in the store is important for predicting “snow days.” (Is that spike in sales really bad data?).

Understanding of these relationships is critical and can be leveraged across many store locations to improve results. It is worth the extra effort that
may be necessary to make associations between UPCs or other entity identifiers so that the relationship can be taken into consideration by the application designers.

Bill Bittner – Moderator

BrainTrust

Discussion Questions

Poll

5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
M. Jericho Banks PhD
M. Jericho Banks PhD
19 years ago

Another nuance to “item” has to do with romantic relationships. For example, Nancy and Mark are dating and are an “item.” (Well, you asked, didn’t you?) In this context, “dirty data” takes on a whole new connotation as well, which we’d be wise not to pursue in this forum.

I learned binomial nomenclature (kingdom, phylum, class, order, family, genus, species) in high school and used it extensively later as I collected undergrad minors in entomology, zoology, and botany. I always marveled that, no matter how many thousands, even millions, of species and subspecies we discovered, we had a unique, predictable, intuitive, and reasonable way to name them.

Same deal with Dewey’s Decimal System.

So, why do we find it so difficult to identify products for sale “by the each” in a predictable, intuitive, and reasonable way – especially now that we have computers, which weren’t around when Mr. Dewey and Mr. Binomial (joke) were designing their systems? A logical naming convention is required, and there are some out there. I was associated with one a few years back that solved the problem of unique product identification, but it never caught on. We’ve settled on the ubiquitous UPC, but the numerical portion of its base makes it less intuitive to the human eye when printed in reports. There’s gotta’ be a better way.

James Tenser
James Tenser
19 years ago

Simply put, the retailer who plans to apply advanced analytics or decision support tools had better know enough about the quality of its POS data. The old cliché, GIGO applies here.

As Bill makes clear, retail data quality has at least two dimensions. There is the accuracy and completeness of the information itself (scan data) and the structure of the data (to permit useful inferences to be derived). Proper assignment of UPC codes goes a long way toward permitting useful analyses.

I would add a third consideration too – velocity. Data delayed is decision denied. A retail store is a dynamic environment that cannot be well-managed based on stacks of green-bar paper. A perfect report on last quarter’s item movement offers little to help you fix an out of stock this week.

Thomas A. Smith
Thomas A. Smith
19 years ago

While I agree with some of your comments, a common mistake is made in underestimating the real business value in the collection and correlation of specific and unique item data and relying on the term “Dirty Data” to do the dirty work. In other words, using a smoke screen to avoid commitment.

Many a business manager has excused themselves declaring their item data is not dirty and that item data in the enterprise has been painstaking “cleaned.” Still others hold that the attribute terms are ill defined or perhaps poorly explained, something that can never be overcome by software.

While I agree that the term “Dirty Data” is not an accurate term or measurement of value in the supply chain, what is of value is the actual collection and understanding of the data values. In other words, the value is in the cleansing activities and around understanding the information represented by the data.

“What is of value, is the actual collection and understanding of the data.”

In circumstances where it has been asserted that applications must better understand the relationship between various retail entities, and be able to use and preserve those associations within the context of their function; functionally, just the opposite is true. Retailer and Manufacturers both need to better understand the implication that Data Sync has in making them successful in the years to come.

“Streamlining your item data collection not only prepares you for the future of electronic commerce but makes your organization more effective managers of Product Specification Sheets, Sales Literature and Marketing Material.”

The term “cleaning up” is a term that should be used in the context of true collaboration. Item clean up does not imply that your item is dirty, rather, item clean up is more an exercise in understanding the information required for data synchronization and that the information required is generally not found in one’s enterprise or even in one place. It’s found all over the business, on personal computers, in files, and in sales literature. To make matters worse, the information is not frequently found in the same format or even the same type of file. In fact “Cleaning up-Dirty Data” is all about understanding; understanding ones own business and understanding the business of one’s trading partners, thereby enhancing the relationship between them.

“Using the right set of standards and tools, you can manage all of these activities with the same synchronized content.”

Retail and manufacturing companies alike must learn to understand the value and uses of standardized data and format and where differences in understanding appear and where corrective actions must be initiated. Too often companies rely on required and optional signals from trading partners to decide which attributes are important or which to populate in their GDS utility making them reactive in the supply chain instead of pro-active.

The fact remains that all attributes are important, but all attributes may not be important to everyone. It’s incumbent on the participants of such initiatives to collect and deliver as much “clean” information as possible to their trading partners to foster the relationship between them. Understanding fosters a greater commitment and, therefore, a better value in the relationship between them and, ultimately, a better trading community among us.

Bernice Hurst
Bernice Hurst
19 years ago

Perhaps the biggest challenge in any database is communication between the people who will compile and use the data and the people designing the system in which it will be stored and retrieved. I spent years, in several different environments, attempting to achieve a simple method of correlation and have to admit to only very limited success. Human incompatibility is at the core of any dirty or misunderstood data and probably will be forevermore.

Franklin Benson
Franklin Benson
19 years ago

The accusation of “dirty data” seems to happen only with bad news. It might be privately thought, on occasion, with good news, but the public utterances of the phrase only seem to happen with news that is bad “beyond belief.”

I think a Social Psychologist would have a more erudite answer as to why that is, but an economist or an analyst should be able to tell you that dirty data happens with the same frequency in good news as it does in bad.

There may be good ways of convincing people that data is good, or that it is dirty, and in fact I would be highly interested in finding out what those ways are. As an analyst, I know that saying “told you so.” doesn’t work, and generally does more harm than good.

Really, saying “told you so” represents a failure anyway. It means that the analyst was unable to present the information with enough credibility or enough detail to convince the decision makers. But having to convince the decision makers of the cleanliness or dirtiness of the data in the first place takes it to a whole other level.