Share what you know with millions of people
Focus is the best place to turn what you know into remarkable content
0
What are the top 3 pros and cons of data warehouses?
Data Warehouses are around since the mid 80s. And still millions of $$ are spend, and still we see many failures. Why do we need data warehouses? Or why don't we?
What are the top 3 pros and cons of Data Warehouses?
Events
- Dos and Don'ts of Small Business Marketing May 29 @ 11 am PT
- Lead Nurturing 202: The Next Generation May 31 @ 11 am PT
- The Tricks to Paid Media June 6 @ 11 am PT
- Display Advertising for Brand Awareness June 20 @ 11 am PT





6 Answers
There are many more than 3 pros and cons of data warehouses, and Hrvoje's choice is as good as any - although they seem to focus more on the original justification for data warehouses than current practice. I'd like to approach Gertjan's question from the viewpoint of what he seems to imply in the lead-in: do we STILL need data warehouses, especially in the light of the many failures we see?
The link to project failures is unfortunate, I feel. Data warehouse projects fail because they are complex projects with multiple external dependencies - not because they are data warehouse projects per se. The fundamental purpose of a data warehouse is to consolidate and reconcile data from disparate sources to create a trustworthy source for BI analyses. This is a complex and interdependent process, no matter whether it is done via a warehouse or not. Today, there are other possible approaches to meeting this need, virtualization, for example, but all suffer from similar complexity and interdependency issues. And are also prone to failure.
This leads to the question of whether we need a data warehouse any more. My belief is that we do, but that its size will decrease, certainly in proportion to other data needs and most likely in absolute terms too. Functions like financial reporting and accounting do need a shared and consistent view of the business - "a single version of the truth" - and a data warehouse is the ideal place to create and store it. However, there are many other functions where speed of access, huge volumes of data, or simply a need to see multiple versions of reality are more important. We see this in Hadoop systems, DW appliances and spreadsheets among other things. So, the role of the traditional data warehouse becomes more limited.
The growing challenge, of course, is how to manage the quality and integrity of data over such varied and distributed data sources.
I would tend to say: we do need a data warehouse, and will continue to need a data warehouse. I would also tend to disagree with Mr. Devlin, in that I believe Data Warehouses size will not shrink, but in fact grow (this is what has been happening). The need is for RAW data warehouses, not data warehouses that house "business massaged data sets".
There is still a need for business massaged data sets, and that need is growing: but it is separating from the data warehouse, and is typically known as Master Data system (hopefully with Master Data Mangement in place on top of that). The MDM solution (people, technology, process, data), will be the "single-version-of-the-truth" that the business is so desperately seeking. Leaving the Raw Data Warehouse to do it's job: record an integrated snapshot of all data coming in or leaving the organization - that is necessary from an auditability standpoint.
Again, in my opinion an EDW is needed, will continue to be needed, and will grow (not shrink). In my opinion, the RAW data Warehouse is rising, and the "massaged" or business altered SVOT (single view of the truth) will be a perception / requirement placed on Master Data Sets.
This is what I see happening all over the world, at customers I visit - from Financial companies to Oil & Gas, to Insurance.
Regards,
Dan Linstedt
Hi Gertjan,
from my point of view and experience here are 3 pros and cons:
Pros:
1. Reporting and analysis are made on data warehouse data, not directly on ERP, transactional data. That way we avoid unneeded load on ERP system.
2. With data warehouse you are able to consolidate data from many different sources, several databases, xml data, excel spreadsheet etc...
3. In data warehouse you can update data in the way maybe impossible otherwise, you can enforce different business rules without touching EPR systems and you can prepare data by aggregating them in order for reporting to be fast.
Cons:
1. Warehouse Data is usually not up-to date. Most of the time, implementation is made so that transactional data are transferred into data warehouse during night jobs.
2. In order to have a data warehouse, you usually need additional hardware and software expenses
3. With data warehouse the whole system become more complex.
As a researcher and business modeler, I feel that data warehouses are necessary for regression analysis and even predictive behavior Whether they are current or not may not be so vital, although currency can improve demand forecasting and predictive behavior.
What I find most disturbing about data warehouses are there unstructured nature. They are more patterned after form-based queries than statistical analysis. Fortunately, with SQL commands, one can construct tables that are prudent or useful.
In summary, data warehouses for statistical analysis are helpful, but the information could be structured better for the purpose of being a data set There are query or report languages that can structure them for better data manipulation, but that usually adds CPU and network time to any project.
Hi Barry,
I am not so sure I do agree with your definition. Perhaps it is my perception of what Data Warehouse means. Let's see if I define it the same way...
The raw DW that I see being implemented includes both structured and unstructured data (although the unstrcutured is in fact searched and the results are "structured" in order to link to the structured world), and the raw DW is being fed by real-time messaging as well as batch. I've also seen the trend to absorb more and more external data from the outside world, in regards to XML. I've continued to see a rise in growth of the data sets from all of these sources going in to "traditional Data Warehouse."
However the traditional data warehouse (definition) has changed: it is now a passive integration point (integrated by business key), stored in raw format, and accepting all data as raw - regardless of arrival mechanisms. It has become the one-place in the organization that exists now as a system-of-record (no longer a single version of the truth). I also see huge growth in this area.
However, as you point out: "Video, docs, images, audio" usually do not get stored directly in the data warehouse (raw or not). This is true. These things must be "mined" for patterns before storing just the results in the data warehouse. However, I still am not understanding how the DW as you defined it, aligns with the DW as I've defined it, and I think I am missing your point (my apologies) on why this resource is shrinking, especially when my experience in the field is the opposite. I welcome any clarifications you would care to share with me.
Consistency and reconciliation needs are definitely highly focused, and "less but better data" is needed at this level to make better decisions. However, in order to run statistics, or mine the information, more "raw data" is needed on the raw DW to make "less but better results". Also, I am referring to the fact that most of the customers I talk to, state they are headed in the direction of "virtual data marts" with a raw DW in the back-end, and with virtual data marts in place, they can begin to make self-service BI a reality.
From that regard, the virtual data marts in fact, do have less data, and are more specific and are more targeted. But, the engines producing these "more informed decisions" need more raw data to make the information "better" with better correlated results.
So, sizes of data in the data marts is shrinking, while sizes of the data in the warehouse are growing... This is what I am experiencing in the customer base. Please let me know if I've completely missed your point (which has been known to happen). By the way, I do see a rise in master data systems, and yes, they too are smaller than the raw DW (because the MD systems do not store history nor multiple copies of the data), but they are just another publisher and subscriber on the enterprise service bus.
Thank-you kindly,
Dan Linstedt
To answer the original question: here are my top 3 picks:
Top 3 reasons to have a data warehouse:
* Auditability, Raw Data Set is required in the DW for this
* Compliance with internal and international regulations
* Historical snapshots of "old, retired systems", or systems which are external and outside our control.
Pros of having a Data Warehouse
* The ability to capture all source systems, and integrate by business key across lines of business
* Unloading of work from the source systems (answering ad-hoc deep time based queries for comparison analysis)
* The ability to incorporate unstructured, and structured data in the same place
* The ability to "see" and explore the gaps between source systems, business perception, business requirements, and source applications.
* Data is an asset on the books, and historical data is an asset as long as it is needed to meet compliance and auditory regulations.
and many more....
Cons of having a Data Warehouse
* Can be difficult to manage properly for an enterprise view
* Can be challenging to load properly
* Can be brittle (if the right architecture for the data model is not chosen)
* Can require a new knowledge set, new implementation skills (compared to those required to build and operate operational systems)
* Can require lots of money invested in high speed infrastructure (CPU, RAM, Disk, Network, processing power & applications).
* Can require a specific data model for the DW to counteract the original "brittleness" of the wrong designs in your data warehouse.
Do we need one? Yes, as long as the enterprise requires a "consolidated and integrated view" across all lines of business, including external data and unstructured data.
Is there anything stopping us from making it "virtual"? Yes, the amount of money required to put all that data in RAM. This is why you see "virtual marts" arriving on seen, the marts (once a true DW / Raw DW has been put in place) can be virtualized because their size and scope can be managed and paired down.
What causes the failures?
In my view, one of the main contributing factors is the lack of the proper data model architecture, and the lack of recognition that the DW must house RAW data integrated by business key. In other words, by putting "business rules" upstream of the data warehouse, and by choosing the wrong data model, the system can become brittle, and too complex to manage. When the business users require changes/additions, and the system requires re-engineering, it can become too expensive or take too long - causing the business to "shut it down", but then the business realizes they still need BI - so they try and try to re-focus, and rebuild.
However, if they rebuild based on the same principles, they will end up right back where they are now... Shutting it down and going through the same failure as before.
It's time for a change, in order to succeed two things must happen:
1) Recognition that the DW must be a RAW DW, push the business rules / interpretation down-stream (post-data warehouse) or between the DW and the data marts
2) Selection of the right Data Model for the RAW DW layers.
Anyhow, this is what I am seeing in the market place, and many of the customers I have visited who have broken the cycle, are now succeeding on an enterprise scale within their organization.
These are my two cents anyhow.
Thank-you kindly,
Dan Linstedt
Answer This Question