Share what you know with millions of people

Focus is the best place to turn what you know into remarkable content
×
0

Garbage In, Garbage Out: Getting Good Data Out of Your BI Systems

Introduction

Recently, a big topic in BI has been the importance of “data quality” – in essence, cleaning up the data that is presented to the data warehouse so that the errors, inconsistencies, and so on are eliminated as much as possible. What many users do not realize is that this is just a part of a larger problem: How do you get the most out of the business-relevant data floating around, inside and outside the organization? As it happens, I performed a survey and study on the subject last year.

Analysis

What does this study recommend? The short answer is, you should start thinking of your organization as being in the business of gathering data, turning it into information, and using that information as effectively as possible.  In other words, you think of your organization as trying to get as much high-quality, potentially useful information to your BI solution as possible, and then analyzing that information, followed by using that analysis to make decisions as rapidly as possible. Then you try to develop a set of metrics that will tell you how well you are doing, and what are the weak points in the process.

This set of metrics measure what I call data usefulness. I define data usefulness as the ability to deliver all needed accurate, consistent, and appropriate data to the right user in a timely fashion. My survey in the study convinces me that there are significant and growing problems at every point in the process of converting data into useful information.  Table 1 shows my take on the typical steps in the data-delivery process, the metrics by which the effectiveness of each step should be judged, and the problems that many are seeing today at each step.  The key take-away point is that fixes to one or two steps will not in the long run fix the overall data-usefulness problem. Rather, organizations of all sizes need to take a comprehensive, long-term approach to ensuring data usefulness.

The Data Delivery Cycle

Step

Metric

Example

Problem

Data entry

Accuracy

Percent of data items with errors

Majority of businesses report more than 15% of items with errors

Data consolidation

Consistency

Number of data items with multiple records and no master record

Majority of businesses report more than half their data inconsistent

Data aggregation

Scope

Percent of data sources on which a cross-data-source query can be performed

Majority of businesses report they can’t do cross-database query on more than 2/3 of company data

Information targeting

Fit

Percent of time data delivered that is not appropriate to end user

Majority of businesses report more than 60% of the time, data delivered to executives inappropriate

Information delivery

Timeliness

Time taken to deliver (entry to arrival on screen) to average user

Majority of businesses report a week or more average time to deliver

Information analysis

Analyzability

Percent of time user can’t immediately do online analysis of data received

Majority of businesses report can’t do immediate online analysis more than ½ the time

Process adjustment

Agility

Percent of new outside data sources not available within 1/2 year

Majority of businesses report more than ¾ of relevant new Web information not made available inside the company within ½ year

The Data Delivery Cycle also shows that, by users’ own estimation, more than 2/3 of the data that flows into the organization is not used effectively. In fact, if you include the inability to flow new sources of data into BI, more than ¾ of the useful data out there never gets used right.

Conclusion

There is a great deal of more in this study about how to tackle particular parts of the problem, what vendor products and solutions deliver the biggest bang for the buck, and how to monitor the process – material that is far too long to cover here. However, users can take certain steps immediately, without knowing that much about data usefulness. Those steps are:

  1. If your company now includes 2 or more (usually acquired) organizations whose operational data is at least partially kept separately, put an EII (Enterprise Information Integration) or “data virtualization” tool in front of your BI, so you can query both sets of data at once. Examples: Composite Software, IBM Information Server.
  2. Get a dashboard BI add-on for senior executives, so that you can fine-tune the relevance of the information presented to them, and perhaps be able to display more current information than standard BI reporting tools. Examples: BI solution vendors such as SAP Business Objects and Microstrategy.
  3. Make sure you have an Excel option for your BI solution, so useful information can be delivered as spreadsheets as well as hard copy. Examples: Oracle OLAP Option and 1010data.

One caution: today as in the past, the first reaction to problems with data usefulness is to attempt to put everything to do with data in the BI organization, and all data in the BI data store. Experience has shown that this never, or only temporarily, works. There’s just too much data floating around out there, and more new types of data coming all the time. Improving the quality of data coming into the BI solution is one thing; attempting to force everyone to enter all data into the BI solution is quite another. Don’t waste your time and money on the latter.

Disclosures and References

Data Delivery Cycle source: Infostructure Associates, March 2009

0
Dan Linstedt
President, Empowered Holdings, LLC

I would also make the following observations:

1) "bad data" has it's place too - especially in tying the business processes to the systems, and to the places in the critical path where the perception no longer meets the reality. By simply asking IT to "mask" out, change, or quality cleanse the data set we are saying that we don't care why, how, or where the problems come from. I've worked in companies Like Dept of Defense, and Lockheed Martin, long enough to know that Cycle Time Reduction, Critical Path Analysis, and Business Process Alignment takes a FIRM understanding of where the errors in the data are ocurring and why.

The suggestions in this article are good, however don't throw the baby out with the bathwater. it is one thing to "demand" better quality data, but it's an entirely different set of analytics to actually try to understand WHY it's bad in the first place, and then set about correcting the business processes through gap analysis and alignment with the systems.

Without analytics of dirty data, you would never get there.

2. Simply cleaning the data as it streams out of the source systems into your analytics solution (be it: Operational Data Store, Data Warehouse, or Business Intelligence system) is not enough. Clearly many of the business users force IT to simply "clean it up". The RIGHT place and the RIGHT time to clean data is AT THE SOURCE, again - find the source of the problem and fix it. But to understand the source, one must analyze the patterns of the bad data and trace it back.

3. I whole heartedly agree, better data = better business decisions. It's merely the method that is applied in most businesses that I tend to disagree with. I DO NOT believe in the statement: Garbage In = Garbage out, I think there are diamonds of discovery lying within the "bad data" as it were, and it's high time we began too treat ALL data as a business asset.

Yes, bad data costs us money - but the broken business processes and the mis-aligned perceptions up stream cost us ten-fold more.

Find and fix the PROBLEM, and new data will be clean upon arrival.

One more note: Simply cleansing/changing/merging and altering BAD data brings us way out of compliance with auditors... be cautios when assigning the term: "Garbage in = garbage out", following this mentality can get your business in to a lot of hot water quickly.

Hope this helps,
Dan Linstedt

0
Richard Furlong
CEO/Founder, public-sector-lists.com
  • Recommended by:

Dan, thanks for these valuable observations. - Richard

Answer This Question