Data Analysis

In their book Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, Foster Provost and Tom Fawcett, treat data science as a set of underlying concepts which they explore in depth. Data analysis is an important part of data science, and one of the key concepts the authors define is the ability to think data-analytically. This involves “identifying appropriate data and consider[ing] appropriate methods” through which to analyze and use it.

Modern businesses have access to massive amounts of data, both structured and unstructured, and simply determining what data are relevant and what are not constitutes a challenge. Once identified, ways must be found to collect and then store the raw data, process it so that it is in a form that can be analyzed, and then explore and analyze it using the most appropriate techniques, which must be determined. Typically, a firm’s upper management will not have extensive expertise in the computer architectures used to store and manipulate the firm’s data or the statistical techniques used to derive meaning from it and convert it into information. Thus, data visualization has come to the fore. Infographics and visual presentations that make the data “tell a story” allow communication of information to upper management in ways that are meaningful and yet make the smaller steps required to obtain it largely transparent.

More specifically, data analysis can be divided into different categories. Data exploration represents an early stage in the process, during which a researcher attempts to discern patterns, determine what parts of the data are of value, and determine what more data may be necessary in furthering the goal for which the data were collected.

Inferential statistics seeks to prove inferences with respect to data and is a form of confirmatory analysis. Predictive analysis seeks to predict future outcomes through past data, as in forecasting, of which linear regression is an example. Categorical analysis, of which machine learning is an example, is another form of prediction; a frequently seen example of this concerns “churning,” mobile phone users that switch carriers.

Identifying users who are more likely to switch when their contracts expire is an important concern of carriers, as the new user market is saturated. Another classification of methods is supervised versus unsupervised. Methods which have a specific target, such as identifying mobile phone users likely to churn, is an example of supervised learning, whereas clustering, which has no predetermined target but simply groups data based on the data’s own characteristics, is unsupervised learning. With the rise in social media such as Facebook and Twitter, textual analysis has become important. Text represents unstructured data, and its analysis concerns identifying patterns within text.

Firms specializing in helping business satisfy their data collection, storage, and analysis needs have arisen. Some specialize in hardware, and others in software. Cloud services incorporate both hardware and software, with software as a service (SaaS), infrastructure as a service (IaaS), and information technology management as a service (ITMaaS) all being part of the general term cloud services. Major advantages accrue to companies purchasing these services from other companies. They do not have to make massive IT expenditures and keep up with rapidly changing technology.

Among the major companies involved in data analysis are Hewlett-Packard (HP), IBM, EMC, Oracle, Microsoft, Google, SAP, SAS, and Amazon. Most of these offer computing platforms, which enable data storage as well as analytics. SAS is an exception, offering only software and is the premier statistical computing software worldwide. Some companies, such as EMC, specialize in storage, whereas others, Teradata for instance, specializes in analytics.

Still others, such as Google, are more cloud based but do offer some data analytics capabilities. Amazon Web Services offers several platforms for Big Data analysis, including a Hadoop-based mapreduce. Lesser known companies are Splunk, MemSQL, Palantir Technologies, Trifacta, Datameer, Tamr, Neo Technology, and DataStax.