Statistical data analysis is made possible because variability exists in any collection of items, measurements, or phenomena. A company’s sales, both quantities and amounts, differ from month to month, as do its expenditures. Number of service requests arriving at a website differ from hour to hour, as do the times between their arrivals.
In almost all such cases, the measure of interest follows a probability distribution, such as the normal. The exponential distribution governs interarrival times, and its complementary distribution, the Poisson, describes number of arrivals per some time unit. Analysis of stochastic processes are based on these latter two distributions. Moreover, probability distributions have parameters, of which the normal’s mean and standard deviation are examples.
With probability distributions, inferences can be made from the characteristics of data samples to the entire population of data from which the sample was drawn. In hypothesis testing, a researcher poses a hypothesis concerning data, collects sample data, and calculates a test statistic from the sample. Because a probability distribution can be associated with the sample test statistic, the probability of its observed value can be calculated given that the hypothesis is assumed to be true. The test statistic is based on determining how much of the variability observed in the sample is accounted for by the model assumed to govern the data.
For instance, in the regression model we look at earlier, the statistical software generated the following table:
Table 1. Results of a Regression Analysis
|Estimate||Std. Error||t value||Pr(>|t|)|
|Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1|
The test statistics here are present in the column entitled “t value,” whose values are associated with the regression model’s parameters. The hypotheses related to these state that these parameters are 0, that is, that the regression model does not provide an effective representation of the data. In review, this model is as follows:
Table 1 gives the results of hypothesis tests for each of the regression model’s parameters, , , and . (Typically, the first of these is of no consequence, as it does not speak to the relationship between an independent variable and the dependent variable.) The sixth column gives p-values, probabilities of observing values at least as extreme as the corresponding test statistics in the fifth column, given that the corresponding null hypotheses, that the corresponding α equals 0, is true. The last row gives levels of significance, which a researcher can set prior to performing the research. For instance, a p-value less than a significance level of .05 indicates rejection of the null hypothesis; a p-value greater than .05 indicates failure to reject. At the .05 level, the variable Weight is assumed to have a non-zero coefficient in the regression model, meaning that it has a predictive relationship with the dependent variable mpg whereas Displacement, with a p-value = .06362, cannot be accepted as a good predictor of mpg. At a level of significance of .001, both independent variables fail to be accepted. Thus, the level of significance chosen indicates how sure the researcher wants to be of the results. Statistical data analysis contains several types of test statistics, the t, F, and among them. In all cases, they represent the following:
The inferential statistics discussed above is known as classical statistics; a differing perspective on statistics which was once considered somewhat heretical has now gained widespread acceptance. This is Bayesian analysis or Bayesian statistics. It is based on Bayes’ theorem:
In contrast to classical statistics, Bayesian statistics allows external knowledge, known as prior information, to be incorporated into an estimation. Moreover, as new information becomes available, this branch of statistics allows the estimate to be updated based on the influence of the new knowledge. Bayesian analysis has entered the mainstream, with hierarchical linear regression as an example.
Both classical and Bayesian analysis, and mathematical modeling in general, are finding new and exciting applications in the era of Big Data. And computer architectures are finding ways, as has Hadoop, of bringing analytics tools to the data rather than moving the data into the analytic software. These analytic tools are part of the Hadoop technology stack.