Big Data Visualization

With the increased amount and complexity attributable to Big Data, communicating the information mined from Big Data has increased in importance. Data visualization is now conceptualized as telling a narrative, and this concept underlies the work of authors who have addressed the topic. (See, for instance, Cole Nussbaumer Knaflic’s Storytelling with Data: A Data Guide for Business Professionals.) Edward R. Tufte’s book The Visual Display of Quantitative Information, now in its second edition (2001), pioneered the display of quantitative information and is still a best seller in mathematical analysis. It has served as an inspiration for other authors in the field.

Moreover, data visualization is included in various MOOCs. Coursera offers data visualization courses as part of its data mining and data analysis specializations, and Udacity incorporates it as part of its Data Analyst nanodegree.

Among some tools within this area are packages within the R statistical programming language. One is ggplot2, which produced the graph shown in Figure 1. This versatile package allows R to produce sophisticated graphs and visualizations. The one in Figure 1 revisits the iris dataset discussed in the section on regression and shows sepal length plotted against petal length for three species of iris.


Figure 1. A Scatterplot Produced in R’s ggplot2 Package

Another R package, lattice, also is capable of producing nice graphs. Revisiting the automobile problem we explored under regression analysis, we used lattice to produce the following graphs. Figure 2 uses the sample of automobile data to estimate a probability density function for miles per gallon. As can be seen, the central tendency of the function appears to be about 18 mpg and the distribution shows a slight amount of skew toward the higher end.


Figure 2. Density Plot Produced in lattice

Figure 3 looks at mpg based on the number of cylinders and the number of gears an automobile has. This chart employs a box and whisker plot, which shows the minimum, first quartile, median, third quartile, and maximum. As would be expected, within each gear number, automobiles with the higher number of cylinders obtain decreased mpg. The higher mpgs are obtained with the higher number of gears, and the 4-cylinder 4-gear combination displays the most highly dispersed mpg.


Figure 3. Box and Whisker Plot Produced in lattice

Figure 4 displays mpg density by number of cylinders. The 4-cylinder distribution exhibits much greater variability than do the 6- and 8-cylinder autos. All display a slight amount of skew, and, again as would be expected, the greater the number of cylinders, the lower the mpg. Figure 5 revisits mpg estimated density function based on cylinder number but its density functions convey the information much more effectively than do Figure 4’s because they are on the same scale.


Figure 4. Density Distributions by Number of Cylinders Using lattice


Figure 5. Density Plot in lattice by Number of Cylinders

So far, we’ve looked at graphics produced the statistical language R. However, other options exist. Microsoft Excel offers a full range of scatterplots, bar charts, pie charts, histograms, and three-dimensional graphs. Although not as flexible as R’s capabilities, it nonetheless is fully functional. One software specializes in data visualization, Tableau, and has won high accolades within the field. It uses design principles and data analysis to allow creation of not only graphs and charts but also dashboards. Gartner’s Magic Quadrant for Business Intelligence has named Tableau a leader in data analysis for three years in succession.