In principal components analysis (PCA), original variables within a sample are transformed as linear combinations into a new set of variables that is uncorrelated. PCA can act as an exploratory technique, allowing underlying factors to be identified for factor analysis. PCA can also be used to make data manipulation easier. The new set of variables, known as components, are uncorrelated and, since a component can reflect more than one of the original variables, using them in further analyses can lessen the dimensionality of the problem. Fitting a regression to components may be much simpler than fitting it to the original variables because the components do not suffer from multicollinearity whereas the original variables may. Moreover, since components are linear combinations of the original variable set, they can be used to test normality of the original variables; if the principal components are not normal, then the original variables are also not normally distributed.

As stated above, a component is a linear combination of the original variables. They are the *C _{1}*,

*C*, etc., as shown below, and the number of components generated will equal the number of variables,

_{2}*p*in this case.

C* _{1}* = α

_{1}*+ x*

_{1}*+ α*

_{1}

_{1}*+ x*

_{2}*+ ... + α*

_{2}

_{1}*+ x*

_{p}

_{p}C* _{2}* = α

_{2}*+ x*

_{1}*+ α*

_{1}

_{2}*+ x*

_{2}*+ ... + α*

_{2}

_{2}*+ x*

_{p}

_{ = p}.

.

.

C* _{p}* = α

_{p}*+ x*

_{1}*+ α*

_{1}

_{p}*+ x*

_{2}*+ ... + α*

_{2}

_{p}*+ x*

_{p}

_{p}Stated in terms of matrices and vectors, the equations above will be stated as follows.

Note that all *p* of the components will most likely not be retained. Each variable from the sample will have a sample variance *S*^{2}, and the variance of component *C _{i}*,

*i*= 1, . . .,

*p*, is therefore as follows:

Performing a PCA involves calculating the eigenvalues and associated eigenvectors. For the sake of simplicity, we’ll employ only two random variables in the following discussion. In SAS, the proc *PRINCOMP* is used to calculate the principal components. A similarly named function function *princomp()* is used in R, which also offers *prcomp()*. Components can be calculated using either the covariance or correlation matrices, although the correlation is to be preferred when the variable scales differ (inches vs. pounds, for instance).
Below is a step-by-step explanation of obtaining the components for two variables.

1. Center the observations by subtracting the sample mean from each variable value.

*x' _{jn }*=

*x*x̄

_{jn}+

_{j}where

* x*'_{jn = }centered value of the *j*th variable, *j* = 1, …, *p*, and *n* is the observation number, *n* = 1, …, N,

* x*'_{jn} = the *n*th observation of the *j*th variable,

= the sample mean of the jth variable. |

2. Find the correlation matrix for the two random variables, *x*'_{1 }and *x*'_{2}.

* *where r = the correlation between *x*'_{1 }and *x*'_{2}.

3. Solve the following for λ, the eigenvalues, which also represent the component variances

det (**R - λ I**)

where ** I** is the 2 x 2 identity matrix .Solving this yields λ1 = 1 + r and λ2 = 1 – r.The two λs are eigenvalues; there will as many eigenvalues as there are variables.

4. Employ the eigenvalues to find corresponding eigenvectors. We do so by solving the following :

**R**α_{1} = λ_{1}**α**_{1 }

where . Solving the resulting equations yields the result that α_{11} = α_{12}. Requiring that α^{2}_{11} + α^{2}_{21} = 1 yields α_{11} = 1 √2 and α_{21} = 1 √2. Solving **Rα_{2} **= λ

_{2 }α

_{2 }where α

_{2}= [ α

_{12 }, α

_{22}]

*yields α*

^{T}_{12}= -(1/√2) and α

_{22}= 1/√2. Thus, the eigenvalues and their corresponding eigenvalues are as follows :

λ_{1 }= 1 + *r,* the first eigenvector = [1/√2,1/√2]^{T }

λ_{2 }= 1 - *r,* the second eigenvector = [-1/√2,1/√2]^{T }

The principal components for this simple example are then given by the following :

5. Lastly, the variance of a component *i* is as follows :

var (*Ci*) = α* ^{2}_{i1 }*S

*+ α*

^{2}_{1 }*S*

^{2}_{i2 }*+ 2α*

^{2}_{2 }*α*

_{i1}

_{i2}*rS*

_{1}*S*

_{2}The sum of the component variances equals the sum of the variances of the original variables. Thus, if the sample variance S* ^{2}_{i }*= 109.63 and S

*= 55.40, the sum of variance found in the variables equals 165.03. Assume that the variance of*

^{2}_{2 }*C*= 147.44 and

_{1}*C*= 17.59. The sum of both is 165.03.

_{2}6. Not all components will be selected. Software typically generates component variances, proportion of total variance, and cumulative proportion of total variance, and various thresholds of cumulative proportions have been proposed (i.e., 70%, 80%, 90%). A scree plot of eigenvalues, so named because of its resemblance to the scree seen at the bottom of a cliff, also provides a rule of thumb; when the scree begins to level off toward the horizontal, the number of components at that point should be selected.

Please see the next section for an example using R.