Data Science

In their book Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, Foster Provost and Tom Fawcett, define data science as “a set of fundamental principles that guide the extraction of knowledge from data.” Provost and Fawcett define nine tasks that data science addresses, and these form one taxonomy with which to classify data science techniques.

Regression, which the authors describe as “value estimation,” enables a researcher to predict future values of a variable of interest (the dependent variable) based on the values of various other measures (independent variables). Most commonly employed is linear regression, which utilizes the least squares algorithm to minimize an average sum of squares of errors. Regression may involve any number of independent variables.

Closely related to regression is similarity matching, which attempts to identify individuals having similar characteristics. Techniques of this type are nearest neighbor models, where “nearness” can be defined using Euclidean distance between individuals or between centroid measures (i.e., medians, averages, and modes), distributions, or densities of a subpopulation. Scores derived by weighting significant individual measures can also be used to match individuals. Clustering techniques are based on similarity matching. An exploratory technique, clustering analysis attempts to identify subgroups within a sample population. Several clustering techniques exist, including agglomerative and divisive hierarchical clustering and k-means.

Classification techniques attempt to predict the class to which an individual belongs. A specific technique of this type is logistic regression, which attempts to predict the status of an individual or group to which it belongs.

Although logistic regression is most commonly used where there are two classes (e.g., customers that renew mobile phone contracts versus those who do not), multinomial logistic regression, where more than two response options exist, forms another variation.

Co-occurrence grouping, also called frequent itemset mining, market basket analysis, and association rule discovery, comprises still another set of techniques. An example is Amazon’s presentation to customers of books that other customers purchased after viewing the one the customer is currently viewing. Techniques such as this are employed in recommender systems. Profiling is similar in that it attempts to identify the most commonly observed behavior of a group of individuals.

Commonly represented in networks is link prediction, as for instance when Amazon suggests books to a customer based on books that customer has previously viewed. These also underlie recommender systems.

A task motivating many data science techniques is data reduction. Principal components analysis (PCA) attempts to form credible linear, uncorrelated combinations of constituent variables. These combinations can then function as independent variables in other techniques, regression for instance. Employing components rather than the original variables eliminates multicollinearity and reduces the number of variables required for the analysis. Factor analysis is a related technique in which underlying, latent variables are identified and can also function as the input variables in further analyses.

The final task listed by Provost and Fawcett is causal modeling, wherein a researcher attempts to identify causal relationships between variables. Although regression can indicate such relationships, care should be exercised in interpreting the results of a regression, as a latent variable not included in the model may be the underlying cause of both predictor and predicted values. Controlled experiments are particularly useful in identifying causal relationships.

Aside from tasks, data science techniques can be further classified as to data and technique. Data, for instance, may be structured or unstructured, where the latter denotes data that does not have a predefined form. Whereas an annual income, a grade point average, or a pay grade has a predefined form, text collected from a Facebook or Twitter posting does not. Typically, unstructured data does include text, but it does not necessarily have to do so.

Similarly, machine learning techniques such as logistic regression and other types of scoring models can be classified into supervised or unsupervised techniques. Logistic regression is a supervised learning technique because the solution, the statuses or groups with which past individuals are associated is known. The algorithm can thereby “learn” based on the association of particular values of the model’s independent variables and its ultimate status or group classification how to predict the status or group of future individuals. On the other hand, in unsupervised learning, the algorithm has no response variable to learn from but must base its outcome solely on characteristics of the input variables. Techniques which represent unsupervised learning are cluster analysis, principal components analysis, and factor analysis.

With the ever-increasing ability to collect, store, curate, and analyze both structured and unstructured data, including that generated by the internet of things, the field of data science is and will continue to expand dramatically. Moreover, increasing computational speed have made previously impractical techniques available for use; advances in genomic research are attributable to increases in computing capability. Data and the ability to mine it for usable information can now be viewed as important strategic assets of businesses, making data science an integral part of a firm’s operations.