Big Data

The term Big Data captures the explosion of data that has become available in the last few years. The internet of things, in which objects such as RFID readers, mobile devices, cameras, and other types of machines form networks and generate data, has contributed greatly to this explosion. It has been estimated that 90% of the world’s data that has ever existed has been created in just the last two years. In 2008, total data was believed to be .8 zettabytes (see Table 1 below). By 2020, it is estimated that it will number 35 zettabytes. In the not so distant future, our data will be measured in brontobytes.

Contributing to this explosion is not just its volume but also its velocity and its type. Velocity is the rate at which it changes, or updates, whereas type is structure, or rather lack of structure. Unstructured data, most of which is text, does not have a pre-defined format. The rise in social media such as Facebook and Twitter have contributed to the rise in unstructured data, whose growth accounts for much of the growth in Big Data.

Table 1. Number Quantities

Key Specs

kilobyte 1000 1000
megabyte 10002 1000000
gigabyte 10003 1000000000
terabyte 10004 1000000000000
petabyte 10005 1000000000000000
exabyte 10006 1000000000000000000
zettabyte 10007 1000000000000000000000
yottabyte 10008 1000000000000000000000000
brontobyte 10009 10100000000000000000000000000000

Many companies (Oracle, Amazon’s Web Services, IBM, HP, and many others) have entered the market to meet the needs of customers requiring data management services. As a result, Gartner’s hype cycle for emerging technologies (shown in Figure 1) places Big Data past its peak of inflated expectations but still 5-10 years from plateauing. The internet of things now occupies the peak in the Gartner cycle.


Figure 1. Gartner Hype Cycle for Emerging Technologies

Big Data has raised many challenges for businesses. Data, a great deal of which is unstructured and therefore unruly, must be collected, stored, cleaned up, curated, and analyzed, and the resulting information must then be reported in an understandable way to decision makers, leading to the emerging importance of data visualization, which has become a discipline in its own right.

To deal with Big Data, an open source software known as Hadoop was created. As its name implies, HDFS is a distributed system. It is made up of many nodes, and so provides fault tolerance, redundancy, and flexibility. In its original incarnation, it consisted of two primary components, the Hadoop Distributed File System (HDFS) and MapReduce, which processed data. This latter is made up of a mapping function, which filters and sorts data, subdivides them into smaller packets, and sends these to component nodes for processing. After completing their work, these nodes send the results of their completed work to the reducing function, which integrates and summarizes results. On top of MapReduce were various applications which allowed the data to be processed and analyzed.

Hadoop’s latest version still has the HDFS, called HDFS2, and a new version of MapReduce called YARN, a cluster resource management system which greatly increases Hadoop’s flexibility. In addition, more applications have been created to sit atop the Hadoop stack to aid in the processing, movement, analysis, and manipulation of data. Some of these applications are Pig, which allows creation of MapReduce programs; Mahout, a machine learning application; R connectors allowing statistical analysis with the language R; Hive, which allows SQL-like inquiries to be submitted; and HBase, a distributed, relational data base.

Apache’s Hadoop is open source, and many companies have adapted it and market their services to aid client companies with their data management and data analysis needs. In particular, Oracle employs Hadoop and has its own set of applications that sit adopt the Hadoop stack to aid in data processing. SAP is able to integrate with Hadoop to run analytics, and Amazon Web Services has its own Elastic MapReduce based on Hadoop.