Share what you know with millions of people
Focus is the best place to turn what you know into remarkable content
Garbage In, Garbage Out: Getting Good Data Out of Your BI Systems
Introduction
Recently, a big topic in BI has been the importance of “data quality” – in essence, cleaning up the data that is presented to the data warehouse so that the errors, inconsistencies, and so on are eliminated as much as possible. What many users do not realize is that this is just a part of a larger problem: How do you get the most out of the business-relevant data floating around, inside and outside the organization? As it happens, I performed a survey and study on the subject last year.
Analysis
What does this study recommend? The short answer is, you should start thinking of your organization as being in the business of gathering data, turning it into information, and using that information as effectively as possible. In other words, you think of your organization as trying to get as much high-quality, potentially useful information to your BI solution as possible, and then analyzing that information, followed by using that analysis to make decisions as rapidly as possible. Then you try to develop a set of metrics that will tell you how well you are doing, and what are the weak points in the process.
This set of metrics measure what I call data usefulness. I define data usefulness as the ability to deliver all needed accurate, consistent, and appropriate data to the right user in a timely fashion. My survey in the study convinces me that there are significant and growing problems at every point in the process of converting data into useful information. Table 1 shows my take on the typical steps in the data-delivery process, the metrics by which the effectiveness of each step should be judged, and the problems that many are seeing today at each step. The key take-away point is that fixes to one or two steps will not in the long run fix the overall data-usefulness problem. Rather, organizations of all sizes need to take a comprehensive, long-term approach to ensuring data usefulness.
The Data Delivery Cycle
|
Step |
Metric |
Example |
Problem |
|
Data entry |
Accuracy |
Percent of data items with errors |
Majority of businesses report more than 15% of items with errors |
|
Data consolidation |
Consistency |
Number of data items with multiple records and no master record |
Majority of businesses report more than half their data inconsistent |
|
Data aggregation |
Scope |
Percent of data sources on which a cross-data-source query can be performed |
Majority of businesses report they can’t do cross-database query on more than 2/3 of company data |
|
Information targeting |
Fit |
Percent of time data delivered that is not appropriate to end user |
Majority of businesses report more than 60% of the time, data delivered to executives inappropriate |
|
Information delivery |
Timeliness |
Time taken to deliver (entry to arrival on screen) to average user |
Majority of businesses report a week or more average time to deliver |
|
Information analysis |
Analyzability |
Percent of time user can’t immediately do online analysis of data received |
Majority of businesses report can’t do immediate online analysis more than ½ the time |
|
Process adjustment |
Agility |
Percent of new outside data sources not available within 1/2 year |
Majority of businesses report more than ¾ of relevant new Web information not made available inside the company within ½ year |
The Data Delivery Cycle also shows that, by users’ own estimation, more than 2/3 of the data that flows into the organization is not used effectively. In fact, if you include the inability to flow new sources of data into BI, more than ¾ of the useful data out there never gets used right.
Conclusion
There is a great deal of more in this study about how to tackle particular parts of the problem, what vendor products and solutions deliver the biggest bang for the buck, and how to monitor the process – material that is far too long to cover here. However, users can take certain steps immediately, without knowing that much about data usefulness. Those steps are:
- If your company now includes 2 or more (usually acquired) organizations whose operational data is at least partially kept separately, put an EII (Enterprise Information Integration) or “data virtualization” tool in front of your BI, so you can query both sets of data at once. Examples: Composite Software, IBM Information Server.
- Get a dashboard BI add-on for senior executives, so that you can fine-tune the relevance of the information presented to them, and perhaps be able to display more current information than standard BI reporting tools. Examples: BI solution vendors such as SAP Business Objects and Microstrategy.
- Make sure you have an Excel option for your BI solution, so useful information can be delivered as spreadsheets as well as hard copy. Examples: Oracle OLAP Option and 1010data.
One caution: today as in the past, the first reaction to problems with data usefulness is to attempt to put everything to do with data in the BI organization, and all data in the BI data store. Experience has shown that this never, or only temporarily, works. There’s just too much data floating around out there, and more new types of data coming all the time. Improving the quality of data coming into the BI solution is one thing; attempting to force everyone to enter all data into the BI solution is quite another. Don’t waste your time and money on the latter.
Data Delivery Cycle source: Infostructure Associates, March 2009
Events
- HR & Recruiting Blues in the News May 22 @ 3 pm PT
- Marketing Thought Leaders: A Conversation with Julia Fajgenbaum May 25 @ 11 am PT
- The Do’s and Don'ts of Small Business Marketing May 29 @ 11 am PT







2 Comments
I would also make the following observations:
1) "bad data" has it's place too - especially in tying the business processes to the systems, and to the places in the critical path where the perception no longer meets the reality. By simply asking IT to "mask" out, change, or quality cleanse the data set we are saying that we don't care why, how, or where the problems come from. I've worked in companies Like Dept of Defense, and Lockheed Martin, long enough to know that Cycle Time Reduction, Critical Path Analysis, and Business Process Alignment takes a FIRM understanding of where the errors in the data are ocurring and why.
The suggestions in this article are good, however don't throw the baby out with the bathwater. it is one thing to "demand" better quality data, but it's an entirely different set of analytics to actually try to understand WHY it's bad in the first place, and then set about correcting the business processes through gap analysis and alignment with the systems.
Without analytics of dirty data, you would never get there.
2. Simply cleaning the data as it streams out of the source systems into your analytics solution (be it: Operational Data Store, Data Warehouse, or Business Intelligence system) is not enough. Clearly many of the business users force IT to simply "clean it up". The RIGHT place and the RIGHT time to clean data is AT THE SOURCE, again - find the source of the problem and fix it. But to understand the source, one must analyze the patterns of the bad data and trace it back.
3. I whole heartedly agree, better data = better business decisions. It's merely the method that is applied in most businesses that I tend to disagree with. I DO NOT believe in the statement: Garbage In = Garbage out, I think there are diamonds of discovery lying within the "bad data" as it were, and it's high time we began too treat ALL data as a business asset.
Yes, bad data costs us money - but the broken business processes and the mis-aligned perceptions up stream cost us ten-fold more.
Find and fix the PROBLEM, and new data will be clean upon arrival.
One more note: Simply cleansing/changing/merging and altering BAD data brings us way out of compliance with auditors... be cautios when assigning the term: "Garbage in = garbage out", following this mentality can get your business in to a lot of hot water quickly.
Hope this helps,
Dan Linstedt
Dan, thanks for these valuable observations. - Richard
Answer This Question