In principal components analysis (PCA), original variables within a sample are transformed as linear combinations into a new set of variables that is uncorrelated. PCA can act as an exploratory technique, allowing underlying factors to be identified for factor analysis. PCA can also be used to make data manipulation easier. The new set of variables, known as components, are uncorrelated and, since a component can reflect more than one of the original variables, using them in further analyses can lessen the dimensionality of the problem. Fitting a regression to components may be much simpler than fitting it to the original variables because the components do not suffer from multicollinearity whereas the original variables may. Moreover, since components are linear combinations of the original variable set, they can be used to test normality of the original variables; if the principal components are not normal, then the original variables are also not normally distributed.
As stated above, a component is a linear combination of the original variables. They are the C1, C2, etc., as shown below, and the number of components generated will equal the number of variables, p in this case.
C1 = α11 + x1 + α12 + x2 + ... + α1p + xp
C2 = α21 + x1 + α22 + x2 + ... + α2p + x = p
Cp = αp1 + x1 + αp2 + x2 + ... + αpp + xp
Stated in terms of matrices and vectors, the equations above will be stated as follows.
Note that all p of the components will most likely not be retained. Each variable from the sample will have a sample variance S2, and the variance of component Ci, i = 1, . . ., p, is therefore as follows:
Performing a PCA involves calculating the eigenvalues and associated eigenvectors. For the sake of simplicity, we’ll employ only two random variables in the following discussion. In SAS, the proc PRINCOMP is used to calculate the principal components. A similarly named function function princomp() is used in R, which also offers prcomp(). Components can be calculated using either the covariance or correlation matrices, although the correlation is to be preferred when the variable scales differ (inches vs. pounds, for instance). Below is a step-by-step explanation of obtaining the components for two variables.
1. Center the observations by subtracting the sample mean from each variable value.
x'jn = xjn + x̄j
x'jn = centered value of the jth variable, j = 1, …, p, and n is the observation number, n = 1, …, N,
x'jn = the nth observation of the jth variable,
|= the sample mean of the jth variable.|
2. Find the correlation matrix for the two random variables, x'1 and x'2.
where r = the correlation between x'1 and x'2.
3. Solve the following for λ, the eigenvalues, which also represent the component variances
det (R - λI)
where I is the 2 x 2 identity matrix .Solving this yields λ1 = 1 + r and λ2 = 1 – r.The two λs are eigenvalues; there will as many eigenvalues as there are variables.
4. Employ the eigenvalues to find corresponding eigenvectors. We do so by solving the following :
Rα1 = λ1α1
where . Solving the resulting equations yields the result that α11 = α12. Requiring that α211 + α221 = 1 yields α11 = 1 √2 and α21 = 1 √2. Solving Rα2 = λ2 α2 where α2 = [ α12 , α22]T yields α12 = -(1/√2) and α22 = 1/√2. Thus, the eigenvalues and their corresponding eigenvalues are as follows :
λ1 = 1 + r, the first eigenvector = [1/√2,1/√2]T
λ2 = 1 - r, the second eigenvector = [-1/√2,1/√2]T
The principal components for this simple example are then given by the following :
5. Lastly, the variance of a component i is as follows :
var (Ci) = α2i1 S21 + α2i2 S22 + 2αi1αi2rS1S2
The sum of the component variances equals the sum of the variances of the original variables. Thus, if the sample variance S2i = 109.63 and S22 = 55.40, the sum of variance found in the variables equals 165.03. Assume that the variance of C1 = 147.44 and C2 = 17.59. The sum of both is 165.03.
6. Not all components will be selected. Software typically generates component variances, proportion of total variance, and cumulative proportion of total variance, and various thresholds of cumulative proportions have been proposed (i.e., 70%, 80%, 90%). A scree plot of eigenvalues, so named because of its resemblance to the scree seen at the bottom of a cliff, also provides a rule of thumb; when the scree begins to level off toward the horizontal, the number of components at that point should be selected.
Please see the next section for an example using R.