Share what you know with millions of people
Focus is the best place to turn what you know into remarkable content
0
How do you define "big data"?
More and more I'm hearing the term "big data" tossed around by analysts, vendors and thought leaders alike without much insight into what the term means. How would you go about explaining the concept of "big data" in layman's terms?
Events
- Dos and Don'ts of Small Business Marketing May 29 @ 11 am PT
- Lead Nurturing 202: The Next Generation May 31 @ 11 am PT
- The Tricks to Paid Media June 6 @ 11 am PT
- Display Advertising for Brand Awareness June 20 @ 11 am PT





11 Answers
I think Glen hit on the key point that has historically been used as a key component of the definition of big data - unstructured content (data). Big data is not necessarily just about the size of the data itself (which is obviously a key aspect), but also includes the mechanism by which the data is accessed/manipulated. We have had very large databases for many years. Along with those databases we have some pretty powerful systems/tools to manipulate them. But they have typically (not always though) been structured data models - processed either randomly or sequentially.
In today's world we have huge amounts of unstructured data from lots of sources which when taken in total is almost impossible to manage. That's what drove the development of architectural models for big data such as the Google File System which is the basis for the Hadoop Distributed File System (part of the Apache Foundation Hadoop project). HDFS is basically a block architecture where the blocks are all the same size and distributed/replicated for fault tolerance. Along with HDFS, another element of the Apache Foundation project is MapReduce which is a software framework (Java in this case) for distributed processing of large datasets on compute clusters. Essentially, HDFS is spread across a large number (hundreds or thousands) of clusters and MapReduce processes a portion of the overall data by performing distributed operations on key/value pairs.
There is a whole ecosystem of projects related to Hadoop including data serialization (Avro), data collection (Chukwa), structured data storage for large tables (HBase), data summarization and ad hoc query (Hive), distributed cluster coordination (ZooKeeper), etc. And like any open source initiative, a support structure has grown up around it - Amazon Elastic MapReduce, Horton Works, MapR Technology, Cloudera, Karmasphere to name a few. A little over a year ago IBM announced a $100 million investment in big data and large-scale analytics.
But Hadoop/MapReduce is not the only game in town. Cassandra is another implementation of the Google File System (BigTable) model and Amazon's Dynamo for data distribution and clustering. HPCC Systems (spin off from LexisNexis Risk Solutions), Twitter Storm (stream processing - the next big thing) and Microsoft Azure Table are gaining interest and mind share.
So another question is what's the difference between big data and data mining? Well, essentially nothing. And you can add "knowledge discovery" as another similar term into the mix. In the end, it's all about finding patterns. The main difference is in the data itself. Historically it was presented in tabular form. Now it's delivered in network or "graph" form. What makes these latter forms different is that they are made up of different formats - text, images, videos, email messages, spreadsheets, etc. - and many times they include the complex connections and relationships (e.g., Facebook graphs) between the various data elements. So as you can see, it's no longer just about reading rows from a table!
What's interesting is that there is a lot of discussion about the next generation of data management solutions. As with anything, there are camps with different perspectives on the viability of technological solutions, and one of those camps believes that the Hadoop/MapReduce architectural model (BigTable) won't be sufficient for the even more staggering amounts of data that will be created in the future. This should be an exciting area to watch.
My understand of "big data" from one of my high tech clients is that it refers to huge databases such as Facebook, Linkedin, Twitter and the like. I do not know if there are specific parameters in terms of terabytes but others will likely know far more than I.
John is correct in his basic break down of what is considered big data. It also includes large databases that companies use for business intelligence purposes. This data can be modelled and manipulated to get very detailed information in varied forms depending on what the team or department is looking for from their data.
I hope this helps
Steve
I agree with you Steve, the term big data actually refers to a huge data base where mining for data could be used to access very important historical informations in companies with very large databases.
I tend to think of big data primarily as petabyte-sized collections plus the network and computing capacity required to access and provide data confidentiality, accessibility, and integrity services.
For example, consider the longitudinal (multi-year, multi-episode) clinical records for a regional healthcare organization. That includes not only the rather small textual datasets from clinical encounters but also the very large datasets from medical imaging.
What makes it really big, though, is the network bandwidth required to access it as well as the high-volume mechanisms for queries, record locator services, and privacy protection policy enforcement.
An interesting article on Big Data can be found here:
http://www.mckinseyquarterly.com/Are_you_ready_for_the_era_of_big_data_2864
It actually goes further than many contributions in the discussion track on Big Data here and deals with impacts/benefits which using big data techniques can have on companies. It also looks at differences between industries.
Any inspirational thoughts after reading this paper?
Big Data essentially means data sets that are either too large or growing too rapidly (or both) for organizations to manage via traditional means. The popularity of this term has been driven by the explosion in social (external) data, but by no means is the definition limited to this.
Kudos to @Robert for his definition. Totally agree. One thing I would add is that big data addresses the ability to process very large ( terabyte +) sized data sets without first developing complex data warehouses and multi-dimensional models and to accomplish this unprecedented time.
Great definitions.
In my experience, coming from what used to be the weak stepchild of the data world: text, Big Data is best identified with the level of flexibility and integration required to create, organize and access all of the content associated with a particular universe of applications.
This was not always the case: business and numeric data was stored in RDBMS systems, text was stored in documents using "word processing" protocols, and the two were rarely integrated in a single, streamlined query and response transaction. A couple of decades ago, the RDBMS world tried (not very successfully) to include text (which, by the way, has accounted for more than 75% of all computerized data since the mid-70s) by selecting "interesting elements" in the text and copying them into relational tables for access in database queries. For its part, the text world tried a number of things to search and respond to queries directly from the text content (also with limited success.)
Then, in the late 90s, with the rise of XML for text, we finally began to see an information market that demanded integrated responses including everything that both database and text repositories knew about a query subject, and with that demand came the rise of "big data."
Now, really for the first time, the using audience wanted a seamless information world in which what was structured and "unstructured" (the epithet database people used call text) content. This new world required major evolution in how content was designed, created or captured, managed and delivered, giving rise to entirely new areas of software development and placing significant new demands on the organizations responsible for information management.
Today, if you are still thinking of databases as structured and text as unstructured, you are probably not thinking at the "big data" level.
I found this video clip (with PDF download) to be the most informative about what big data is in layman's terms. Zettabyte is my new favourite term! :)
http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/inde...
One definition of Big Data which defines the problem (use cases) that large enterprises use refers to large, diverse, schema and schema free (unstructured), highly distributed and complex data. The key focus is diverse and complex. How do I do an analysis across data that can be SQL, NoSQL, Youtube, twitter, legacy and much more... Scale, complexity and distribution is what generally Big Data refers to.
Answer This Question