Code
<- read.csv("datasets/02-PCA/bodyfatNKNW.txt", sep = " ")
bodyfat <- bodyfat[,c("triceps.skinfold.thickness", "thigh.circumference", "midarm.circumference")]
bodyfat_predictors
library(GGally)
ggpairs(bodyfat_predictors)
In the previous chapter we built a linear model to predict the amount of body fat, given measurements of the thickness of the triceps skin fold, the thigh circumference, and the midarm circumference. We saw that the dataset used for this model suffers from multicollinearity, meaning that some of the predictors (or linear combinations of predictors) are correlated with one another. Intuitively speaking, multicollinearity means that some variables don’t contribute much to the expressivity of the data: we could omit them and end up with a dataset that is almost as informative as the original one.
To find out which combinations of variables contribute most to the variability of our data, we will turn to principal component analysis, one of the mainstays of statistical data analysis. Principal component analysis will allow us to identify the main sources of variability in our dataset, and will tell us which combinations of variables can be omitted with only little impact on the data itself. This allows us to reduce the number of features in the dataset, and makes principal component analysis into what is called a technique for dimensionality reduction. This is useful for a number of reasons:
Principal component analysis (PCA) finds a low-dimensional approximation to the dataset which retains as much as possible the variability of the original dataset. To understand what is meant by this, let’s revisit the body fat dataset of chapter 1. In this dataset, the features are the measurements of the thickness of the triceps skin fold, the thigh circumference, and the midarm circumference. The total body fat is the outcome, but we will not consider it for the time being.
<- read.csv("datasets/02-PCA/bodyfatNKNW.txt", sep = " ")
bodyfat <- bodyfat[,c("triceps.skinfold.thickness", "thigh.circumference", "midarm.circumference")]
bodyfat_predictors
library(GGally)
ggpairs(bodyfat_predictors)
From the scatter matrix, we see that triceps.skinfold.thickness
and thigh.circumference
are highly correlated: if you know one, you can predict the other one reasonably well. This makes it feel like a waste to analyze both: since they carry the same information, we can just as well throw one or the other away, or replace both by a linear combination. Let’s do the latter, and introduce a new variable z1
which is the sum of both. In terms of this variable and midarm.circumference
, which we leave unchanged, the dataset has only two features that are mildly correlated, as shown on the scatter plot below.
<- with(bodyfat_predictors, {
p <- triceps.skinfold.thickness + thigh.circumference
z1 <- data.frame(
df z1 = z1, midarm.circumference = midarm.circumference
)ggpairs(df)
}) p
We have succeeded in our aim to reduce the number of features in our dataset from 3 to 2, but a number of questions immediately pop up:
z1
variable, and can we find it without looking at a scatter plot?It will turn out that the variable z1
, which we constructed in an ad-hoc way, is remarkably close to the first principal component of the dataset, and we will discover a way to compute all principal components. We will also see that by discarding more or fewer principal components, as much variability of the original dataset can be retained as is needed.
PCA works by making linear combinations of the original variables so that the total amount of variation is maximal. There is another way of computing principal components, by minimizing the reconstruction error, and it can be shown that both approaches give the same principal components (Bishop (2006), section 12). In this course, we will develop the first method further.
We assume that we have \(N\) observations \(\mathbf{x}_{i}\), where each \(\mathbf{x}_i\) is a column vector in \(\mathbb{R}^D\). We assemble these observations into an \(N \times D\) data matrix \(\mathbf{X}\), where each row of \(\mathbf{X}\) is an observation \(\mathbf{x}_i\): \[ \mathbf{X} = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \cdots \\ \mathbf{x}_N^T \\ \end{bmatrix}. \] Keep in mind that the columns of \(\mathbf{X}\) are the features (also referred to as independent variables or predictors) of the dataset. In the case of the body fat dataset, \(\mathbf{X}\) is a \(20 \times 3\) matrix, since there are 20 observations, with 3 features for each observation. We will refer to the \(j\)th column of \(\mathbf{X}\) by the notation \(\mathbf{X}_j\), where \(j = 1, \ldots, D\). Note that \(\mathbf{X}_j\) is a column vector, with \(N\) entries, and there are \(D\) such columns. For the body fat dataset, the columns correspond to the following features:
triceps.skinfold.thickness
thigh.circumference
midarm.circumference
The first principal component, \(\mathbf{Z}_1\) is a linear combination of the features, so that the amount of variation in \(\mathbf{Z}_1\) is maximized. Let’s unpack these ideas one at a time. The fact that \(\mathbf{Z}_1\) is a linear combination means that it can be written as \[ \mathbf{Z}_1 = v_1 \mathbf{X}_1 + \cdots + v_D \mathbf{X}_D = \sum_{j = 1}^D v_j \mathbf{X}_j. \] where the \(v_j\) are coefficients that we have to determine. These coefficients are sometimes referred to as the principal component loadings, since \(v_j\) expresses how much of \(\mathbf{X}_j\) is added (“loaded”) to the principal component.
We can write this expression for \(\mathbf{Z}_1\) in a compact way by assembling all the coefficients \(v_j\) into a vector \(\mathbf{v}\), called the loadings vector: \[ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \cdots \\ v_D \end{bmatrix}. \] With this expression, \(\mathbf{Z}_1\) can be written as a matrix-vector product: \[ \mathbf{Z}_1 = \mathbf{X} \mathbf{v}. \tag{1.1}\] We will often use this way of expressing \(\mathbf{Z}_1\) as a matrix-vector product, since it makes subsequent calculations easier.
Before we go on to determine the loadings \(v_j\), let’s focus on the geometry behind Equation 1.1. Each component of \(\mathbf{Z}_1\) can written as \[ (\mathbf{Z}_1)_i = \mathbf{x}_i^T \mathbf{v}. \] This is the dot product of the \(i\)th observation \(\mathbf{x}_i\) with the loadings vector \(\mathbf{v}\). This dot product tells us how much of the vector \(\mathbf{x}_i\) is parallel to \(\mathbf{v}\), as shown in Figure 1.1. For example, a data point that is at right angles to \(\mathbf{v}\) will have dot product 0 (no component at all along \(\mathbf{v}\)), while one that is parallel to \(\mathbf{v}\) will have a dot product that is maximal in magnitude.
In other words, we obtain \(\mathbf{Z}_1\) by taking a fixed vector \(\mathbf{v}\) and projecting all of our data points on the line through the origin in the direction of \(\mathbf{v}\). If we choose another vector, \(\mathbf{w}\), we obtain a different projection, as indicated on Figure 1.2.
Our goal is now to find the loadings vector \(\mathbf{v}\) so that the variance of the projected dataset is maximal. To make this problem well-posed, we will assume that \(\mathbf{v}\) has unit norm: \[ \mathbf{v}^T \mathbf{v} = 1. \tag{1.2}\] If we did not impose this constraint, we could increase the amount of variance simply by making \(\mathbf{v}\) longer.
The mean of the projected data is given by \[ \bar{\mathbf{Z}}_1 = \frac{1}{N}\sum_{j=1}^N (\mathbf{Z}_1)_j = \frac{1}{N}\sum_{j=1}^N \mathbf{x}_j^T \mathbf{v} = \mathbf{\bar{x}}^T \mathbf{v}, \] where \(\mathbf{\bar{x}}\) is the (sample) mean of the original data points. In other words, the mean \(\bar{\mathbf{Z}}_1\) is just the mean \(\bar{\mathbf{x}}\) of the data points, projected onto \(\mathbf{v}\).
The variance of the projected data is given by \[ \sigma^2 = \frac{1}{N} \sum_{i=1}^N\left( (\mathbf{Z}_1)_i - \bar{\mathbf{Z}}_1\right)^2 = \frac{1}{N} \sum_{i = 1}^N\left( \mathbf{x}_i^T \mathbf{v} - \mathbf{\bar{x}}^T \mathbf{v}\right)^2. \] This expression can be rewritten as a matrix product: \[ \sigma^2 = \mathbf{v}^T \mathbf{S} \mathbf{v}, \tag{1.3}\] where \(\mathbf{S}\) is the covariance matrix, given by \[ \mathbf{S} = \frac{1}{N} \sum_{i = 1}^N \left( \mathbf{x}_i \mathbf{x}_i^T - \mathbf{\bar{x}}\mathbf{\bar{x}}^T \right). \tag{1.4}\]
We are now ready to translate our problem into a mathematical form, so that we can solve it. To find the first principal component \(\mathbf{Z}_1\), we want to find a loadings vector \(\mathbf{v}\) so that the projected variance \(\sigma^2\), given in Equation 1.3, is maximized. In addition, we want \(\mathbf{v}\) to have unit length, as in Equation 1.2. In mathematical terms: \[ \mathbf{v} = \operatorname{argmax} \mathbf{v}^T \mathbf{S} \mathbf{v}, \quad \text{such that $\mathbf{v}^T \mathbf{v} = 1$}. \]
We can solve this optimization problem using the theory of Lagrange multipliers. If we introduce a Lagrange multiplier \(\lambda\) for the unit-length constraint, then the desired vector \(\mathbf{v}\) is given by \[ \mathbf{v} = \operatorname{argmax} L(\mathbf{v}) \] where \(L\) is given by \[ L(\mathbf{v}) = \mathbf{v}^T \mathbf{S} \mathbf{v} - \lambda( \mathbf{v}^T\mathbf{v} - 1). \]
A necessary condition for \(\mathbf{v}\) to be a maximum of \(L\) is that the gradient vanishes at \(\mathbf{v}\). Taking the gradient of \(L\) with respect to \(\mathbf{v}\) and setting the resulting expression equal to zero gives \[ \mathbf{S} \mathbf{v} = \lambda \mathbf{v}. \tag{1.5}\] This is a very important result: it tells us that the \(\mathbf{v}\) we are looking for is an eigenvector of the matrix \(\mathbf{S}\), with corresponding eigenvalue \(\lambda\). This will hold true generally, not just for the first principal component: finding the principal components of a data set will involve solving an eigenvalue problem, and selecting the largest eigenvalues.
Last, we have to find the Lagrange multiplier \(\lambda\). This can be done by multiplying Equation 1.5 from the left by \(\mathbf{v}^T\) to get \[ \mathbf{v}^T \mathbf{S} \mathbf{v} = \lambda \mathbf{v}^T \mathbf{v} = \lambda, \] where we have used the unit-length constraint Equation 1.2.
We see that the Lagrange multiplier \(\lambda\) is precisely the variance \(\sigma^2\) of the first principal component \(\mathbf{Z}_1\). For this reason, we will refer to the eigenvalue \(\lambda\) as the amount of retained variance, since it expresses how much variance is captured by projecting the entire dataset onto the direction \(\mathbf{v}\).
To sum up, the first principal component \(\mathbf{Z}_1\) is a linear combination of the original features (columns) of our dataset, chosen so that the variance of \(\mathbf{Z}_1\) is maximal. We can find \(\mathbf{Z}_1\) by looking for the largest eigenvalue \(\lambda\) of the covariance matrix, with unit length eigenvector \(\mathbf{v}\), and projecting the data matrix \(\mathbf{X}\) onto \(\mathbf{v}\).
Now that we’ve computed the first principal component, how do we compute the others? It probably won’t come as a surprise that the next principal components, \(\mathbf{Z}_2\), \(\mathbf{Z}_3\), and so on, involve the amount of variation that is left in the data after \(\mathbf{Z}_1\) has been removed, and that they involve the second, third, … largest eigenvalues.
Assuming that we have computed \(\mathbf{Z}_1\) as in the previous section, and denote the loadings vector by \(\mathbf{v}_1\). Recall that \(\mathbf{v}_1\) points in the direction of the largest variance.
To find the next principal component, we consider the variability in the dataset that is not already accounted for by \(\mathbf{Z}_1\). More precisely, we look for a loadings vector \(\mathbf{v}_2\) which is orthogonal to \(\mathbf{v}_1\), has unit length, and maximizes the amount of variability \(\mathbf{v}_2^T \mathbf{S} \mathbf{v}_2\). By a similar reasoning as in the previous section, one can show that this \(\mathbf{v}_2\) is an eigenvector associated with the second largest eigenvalue of \(\mathbf{S}\). The projection of the dataset onto this loadings vector then gives us the second principal component: \[ \mathbf{Z}_2 = \mathbf{X}\mathbf{v}_2. \] This procedure can be applied to find all \(D\) principal components and results in the following algorithm to compute the principal components:
Typically, we do not have to compute the principal components by hand: most data analysis packages will do this for us, either via a builtin command, such as in R (prcomp) or minitab, or via an extra package, such as scikit-learn (Python) or xlstat (Excel). It is instructive, however, to know the principles behind PCA, so that you can interpret and understand the results.
For this example, we will calculate the principal components in two different ways. We will first compute the principal components by hand, by solving an eigenvalue problem. This is possible because the data are two-dimensional, and solving the characteristic equation for a two- or three-dimensional matrix can be done by hand. This is not practical for real-world datasets, which often contains dozens, thousands, or millions of features, and we will therefore also cover computing the principal components using R.
Our dataset consists of 100 observations, where each observation has two components. The dataset is shown below and has been carefully constructed so that the covariance matrix \(\mathbf{S}\) and mean \(\mathbf{\bar{x}}\) are exactly equal to \[ \mathbf{S} = \begin{bmatrix} 5 & 2 \\ 2 & 2 \\ \end{bmatrix}, \quad \text{and} \quad \mathbf{\bar{x}} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \tag{1.6}\]
library(mvtnorm)
<- matrix(c(5, 2,
S 2, 2), byrow = TRUE, nrow = 2)
<- function(z, mean, sigma) {
adjust_cov_mean # Adjust NxD matrix z so that sample mean and sample
# covariance are exactly `mean` and `sigma`
# whiten the z's
<- colMeans(z)
z_mean <- t(apply(z, 1, function(row) row - z_mean))
z <- chol(cov(z))
R <- t(solve(t(R), t(z)))
z
# impose desired covariance, mean
<- chol(S)
R <- t(apply(z %*% R, 1, function(row) row + mean))
z
z
}
<- rmvnorm(100, mean = c(0, 0))
z <- adjust_cov_mean(z, c(1, 1), S)
z <- as.data.frame(z)
df colnames(df) <- c("X", "Y")
ggplot(df, aes(x = X, y = Y)) +
geom_point()
To find the loading vectors, we must find the eigenvalues of \(\mathbf{S}\), which we can do via the characteristic equation: \[
\det(\mathbf{S} - \lambda \mathbf{I}) = 0.
\]
Substituting the expression given in Equation 1.6 for the covariance matrix and expanding the determinant gives \[
\det \begin{bmatrix}
5 - \lambda & 2 \\
2 & 2 - \lambda \\
\end{bmatrix} = (5 - \lambda)(2 - \lambda) - 4 = 0.
\] The roots of this equation are \(\lambda_1 = 6\) and \(\lambda_2 = 1\). The corresponding eigenvectors are \[
\mathbf{v}_1 = \frac{1}{\sqrt{5}} \begin{bmatrix}
2 \\ 1
\end{bmatrix}, \quad
\mathbf{v}_2 = \frac{1}{\sqrt{5}} \begin{bmatrix}
-1 \\ 2
\end{bmatrix}.
\] These are our loading vectors, and they indicate the direction in which the data varies the most (for \(\mathbf{v}_1\)) and the “second-most” (for \(\mathbf{v}_2\)). Figure Figure 1.3 shows the dataset again, now with the two loading vectors superimposed. Each loading vector has been rescaled by multiplying it with the square root of the corresponding eigenvalue. Why the square root? The eigenvalue itself represents the variance in that direction, the square root the standard deviation.
<- 2/5^0.5
vx <- 1/5^0.5
vy <- function(lx, ly, color = "cornflowerblue") {
segment geom_segment(aes(x=1, y=1, xend=1 + lx, yend=1 + ly),
arrow = arrow(length=unit(0.5, 'cm')),
color = color, lwd = 1.5, alpha = 0.8,
lineend = "round")
}
<- 6**0.5
l1_sqrt <- 1
l2_sqrt
ggplot() +
geom_point(data = df, aes(x = X, y = Y)) +
segment(l1_sqrt*vx, l1_sqrt*vy, color = "cornflowerblue") +
segment(-l2_sqrt*vy, l2_sqrt*vx, color = "chocolate") +
xlim(c(-5, 6)) + ylim(c(-2.5, 4.5))
To compute the principal components directly via R, we can use the prcomp
command, as shown below. This is generally the preferred option over computing the eigenvalue decomposition by hand: prcomp
is more flexible (allowing one, for example, to scale the data before computing the principal components) and is also numerically more stable.1 For our simple dataset, the end result is the same:
<- prcomp(df)
pca pca
Standard deviations (1, .., p=2):
[1] 2.44949 1.00000
Rotation (n x k) = (2 x 2):
PC1 PC2
X 0.8944272 -0.4472136
Y 0.4472136 0.8944272
Prior to doing principal component analysis, the data are often standardized by subtracting the mean for each feature and dividing by the standard deviation. If the original predictors in our dataset are given by \(\mathbf{X}_i\), \(i = 1, \ldots, D\), this means that we introduce standardized variables \(\mathbf{Y}_i\), given by \[ \mathbf{Y}_i = \frac{\mathbf{X}_i - \bar{\mathbf{X}}_i}{\sqrt{S^2_i}}, \] where \(S^2_i\) is the variance of \(\mathbf{X}_i\). The resulting variables \(\mathbf{Y}_i\) will have mean 0 and variance 1.
Standardizing the predictors means that they will be comparable in magnitude: variables whose variance is small will gain in importance and large variables will decrease in importance, roughly speaking. This may change the PCA output significantly!
As a rule of thumb, you should standardize variables that are measured in different units (e.g. seconds, meters, Watt, …), since the unit can be rescaled without affecting the physical meaning of the variable, or its relation to other variables (e.g., rescaling a variable expressed in meters by a factor of 1000 is the same as expressing that variable in kilometers). By contrast, variables that are measured in the same units or that are unitless should not be rescaled (or should be rescaled collectively).
For an example of the latter, think about pixel intensities in an image dataset (such a dataset is in fact analyzed in Section 2.2). Pixels near the edge of the image presumably are part of the background, don’t vary that much, and are relatively unimportant, whereas pixels in the center are likely to be part of the image subject, vary a lot, and carry a lot of information. Standardizing the pixels would make the pixels near the edge just as variable as the pixels in the center, which would greatly amplify the noise in the image at the expense of the useful information in it!
Standardizing the predictors is also referred to as scaling.
In this section we will discuss a number of useful results that follow from PCA. We will use the bodyfat dataset as an illustration throughout.
<- prcomp(bodyfat_predictors)
pc pc
Standard deviations (1, .., p=3):
[1] 7.2046011 3.7432587 0.1330841
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
triceps.skinfold.thickness 0.6926671 0.1511979 0.7052315
thigh.circumference 0.6985058 -0.3842734 -0.6036751
midarm.circumference 0.1797272 0.9107542 -0.3717862
We can also ask R to print a summary of the principal component analysis for us. This will give us a table with the proportion of variance explained by each principal component, as well as the cumulative proportion (the amount of variance retained by that principal component and all previous ones).
summary(pc)
Importance of components:
PC1 PC2 PC3
Standard deviation 7.2046 3.7433 0.13308
Proportion of Variance 0.7872 0.2125 0.00027
Cumulative Proportion 0.7872 0.9997 1.00000
Perhaps the most useful visualization of the PCA is the so-called score plot, which is nothing but a scatter plot of the first two principal components. It often happens that the score plot is sufficient to discern patterns in the data, such as clusters.
Figure 1.4 shows a score plot for the bodyfat dataset. While no obvious patterns in this dataset stand out, the plot does show that the principal components are uncorrelated, and this is a good confirmation of what we already know on theoretical grounds. It is customary to put the percentages of variance explained on the axis labels of the score plot, so that the person interpreting it can have an idea of how well the first two principal components describe the data.
# Explained variation for first and second component
<- sum(pc$sdev^2)
total_var <- pc$sdev[1]^2 / total_var
pct_var_1 <- pc$sdev[2]^2 / total_var
pct_var_2
<- data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2])
df ggplot(df, aes(x = PC1, y = PC2)) +
geom_point() +
xlab(paste0("PC1 (", round(pct_var_1 * 100, 2), "% var. explained)")) +
ylab(paste0("PC2 (", round(pct_var_2 * 100, 2), "% var. explained)"))
As an aside, above we noted that the principal components are uncorrelated with one another. We can also verify that this is the case numerically. The result is not exactly zero, but it is very small:
cov(pc$x[,1], pc$x[,2])
[1] -2.604824e-16
A loadings plot shows the relations of the original variables and the principal components. This is often a useful visualization to see how each variable contributes to the principal components. Moreover, one can show that the loadings are proportional to the Pearson correlations between the principal components and the variables. Hence, if a loading is positive (negative), that variable will be positively (negatively) correlated with that principal component.2
At the beginning of this chapter we hypothesized that the variables thigh.circumference
and triceps.skinfold.thickness
would contribute about equally to the first principal component. From Figure 1.5 we see that this is indeed the case: both variables have loadings approximately equal to \(0.7\), when we consider the first principal component. We also see that the second principal component is mostly made up of the variable midarm.circumference
.
Unfortunately there is no command in base R or ggplot to create a loadings plot – you have to make one yourself.
library(tidyr)
library(tibble)
<- prcomp(bodyfat_predictors)
pc <- as.data.frame(pc$rotation) %>%
df rownames_to_column(var = "Variable") %>%
pivot_longer(c("PC1", "PC2", "PC3"),
names_to = "Component",
values_to = "Loading")
ggplot(df, aes(x = as.factor(Variable), y = Loading,
group = Component, color = Component)) +
geom_line() +
geom_point() +
xlab("Variable")
We now know how to calculate the \(D\) principal components for a given \(D\)-dimensional dataset, and we’ve seen that the principal components correspond to the directions in which the dataset varies the most. The real power of PCA, and the reason why it is so ubiquitous in data analysis, is that we can now selectively discard principal components that are not informative. By doing this, we obtain a dataset with fewer features, which is hence easier to analyze, and where the discarded features do not contribute too much to the expressivity of the data. This is what makes PCA into a dimensionality reduction method.
The question remains what principal components to discard. There are no universally accepted rules for this, but there are a couple of rules of thumb that can help us make an informed choice. Most of these rules take into account the total amount of variance retained by the first \(K\) principal components, defined as \[
S_K = \frac{\sum_{i=1}^K\lambda_i}{\sum_{j=1}^D \lambda_j},
\] Recall that the total amount of variance can be computed directly within R by using the summary
command (and look for the “Cumulative Proportion” row):
summary(pc)
Importance of components:
PC1 PC2 PC3
Standard deviation 7.2046 3.7433 0.13308
Proportion of Variance 0.7872 0.2125 0.00027
Cumulative Proportion 0.7872 0.9997 1.00000
The number \(K\) of principal components can be chosen so that a fixed amount of variance (for example, 95%) is retained. To see this idea in action, let’s apply it to the body fat dataset. The relative amount of variance explained by each principal component can then be calculated as follows:
<- pc$sdev^2 / sum(pc$sdev^2)
var_explained var_explained
[1] 0.787222422 0.212508963 0.000268615
and the total amount of variance explained cumulatively by
<- cumsum(var_explained)
total_var total_var
[1] 0.7872224 0.9997314 1.0000000
Note that these numbers agree with the output of the summary
command. For what follows, it will be easier to have access to these quantities as straightforward R vectors.
We see that the first two principal components explain 78.7% and 21.3% of the total variance, respectively, and together they explain more than 99.97% of variance in the dataset. It therefore seems reasonable to discard the last principal component, which contributes less than 0.03% of variance.
For datasets with many features, a scree plot or elbow plot can be helpful to identify high-variance principal components. In a scree plot, the amounts of variance are plotted in descending order, so that one can identify at a glance the amount of variability contributed by each principal component. Some scree plots also include the total amount of variability, \(S_K\), as a function of \(K\).
R can make a pretty basic scree plot for you, via the screeplot command.
screeplot(pc, main = "Principal components", type = "lines")
With a bit more work, you can build your own scree plot, which is often preferable if you want to customize the plot, to include for example the cumulative variance, as in figure Figure 1.6.
<- function(pc, n = length(pc$sdev)) {
gg_screeplot = pc$sdev[1:n]
sdev <- sdev^2 / sum(sdev^2)
var_explained <- cumsum(var_explained)
total_var <- data.frame(
df_var n = seq(1, n), v = var_explained, t = total_var)
ggplot(df_var) +
geom_line(aes(x = n, y = v, color = "Per component")) +
geom_point(aes(x = n, y = v, color = "Per component")) +
geom_line(aes(x = n, y = t, color = "Cumulative")) +
geom_point(aes(x = n, y = t, color = "Cumulative")) +
ylim(c(0, 1)) +
scale_color_manual(
name = "Explained variance",
breaks = c("Per component", "Cumulative"),
values = c("Per component" = "cornflowerblue",
"Cumulative" = "chocolate")
+
) scale_x_continuous(breaks = n) +
xlab("Principal component") +
ylab("Explained variance (%)")
}
gg_screeplot(pc)
Scree plots are often also referred to as “elbow plots”, since the plot typically (but not always) shows an “elbow” where the explained variance levels off. However, spotting the exact location of the elbow is very subjective, and it is typically more appropriate to take the point where the remaining principal components only contribute some small percentage of variation, for example 5%.
The biplot consists of a loadings plot overlaid on a score plot. By default, it shows the first two principal components as a scatter plot, together with a set of vectors, one for each original variable, showing the contribution of that variable to the first two principal components. Biplots are relatively complex, but it is worth understanding what they encode.
biplot(pc)
The numbers on the plot represent the data points from the original dataset, expressed relative to the first two principal components. This is the part of the biplot that is like a score plot. The red arrows, on the other hand, represent the original variables in the dataset (as shown by the labels attached to them) and are expressed on the top and right-most axis, which show how much each variable contributes to the first two principal components. The red arrows carry the same information as a loadings plot (in a different form), when you consider only the first two principal component.
At one glance, we see that triceps.skinfold
and thigh.circumference
are the most important contributors to the first principal component, and that the second principal component is almost entirely made up by midarm.circumference
. This confirms our intuition from the beginning of this chapter, as well as the conclusions that we drew from the loadings plot.
Once we have performed PCA, we can build a linear model using the principal components that we have chosen to retain. This is known as principal component regression. Let’s see this in action for the bodyfat dataset. We start from a model where we include the first principal component only:
<- pc$x[, "PC1"]
pc1 <- bodyfat$bodyfat
outcome
<- lm(outcome ~ pc1)
model_1 summary(model_1)
Call:
lm(formula = outcome ~ pc1)
Residuals:
Min 1Q Median 3Q Max
-5.1357 -1.8821 0.2682 1.7107 3.4992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.19500 0.58688 34.411 < 2e-16 ***
pc1 0.61366 0.08358 7.343 8.13e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.625 on 18 degrees of freedom
Multiple R-squared: 0.7497, Adjusted R-squared: 0.7358
F-statistic: 53.91 on 1 and 18 DF, p-value: 8.128e-07
Adding the second principal component results in a bigger model, but the coefficient of the second principal component is not significant. We therefore do not include it in our final model and we stay with model_1
.
<- pc$x[, "PC2"]
pc2 <- lm(outcome ~ pc1 + pc2)
model_12 summary(model_12)
Call:
lm(formula = outcome ~ pc1 + pc2)
Residuals:
Min 1Q Median 3Q Max
-3.9876 -1.8822 0.2562 1.3209 4.0285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.19500 0.56604 35.678 < 2e-16 ***
pc1 0.61366 0.08061 7.613 7.12e-07 ***
pc2 -0.23785 0.15514 -1.533 0.144
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.531 on 17 degrees of freedom
Multiple R-squared: 0.7801, Adjusted R-squared: 0.7542
F-statistic: 30.15 on 2 and 17 DF, p-value: 2.564e-06
In this way, we obtain a simple least squares model, with one predictor and one outcome. Quite a simplification from our original linear model, which had three predictors and suffered from multi-collinearity. This reduction from 3 to 1 predictors may not seem like a big deal, but in the next chapter we will see examples of how PCA can be used to reduce datasets with many hundreds of variables to just 2-3 components, while maintaining the essential information in the dataset.
In this section we have built the linear model “by hand”, but in reality you would probably not want to do this, for two reasons:
Luckily, there are R packages that will do the PCA, select the optimal number of components, and build the regression model all for us. Here we use the pls
package. We tell it to consider all principal components, and to assess the error through cross-validation. The details of this will be discussed in class, but the important line in the output is labeled RMSEP
. This shows a measure of error in function of the number of components taken into account. We see that this error is the lowest when only the first principal component is considered.
library(pls)
<- pcr(bodyfat ~ ., data = bodyfat, validation = "CV")
pcr_model summary(pcr_model)
Data: X dimension: 20 3
Y dimension: 20 1
Fit method: svdpc
Number of components considered: 3
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps
CV 5.239 2.612 2.597 2.681
adjCV 5.239 2.602 2.583 2.656
TRAINING: % variance explained
1 comps 2 comps 3 comps
X 78.72 99.97 100.00
bodyfat 74.97 78.01 80.14
There are various rules to select the optimal number of components to consider. One such rule is the “1-sigma rule” (Hastie, Tibshirani, and Friedman 2009) which considers the model with the least number of components whose cross-validation error is at most one standard deviation away from the optimal model. In our case, this again results in a model with only one predictor.
selectNcomp(pcr_model, method = "onesigma", plot = TRUE)
[1] 1
R has two commands to compute the principal components: prcomp
and princomp
. The former computes principal components using the so-called singular value decomposition (SVD) and is preferred for numerical stability. The latter, princomp
, uses the eigenvalue decomposition as is provided for backwards compatibility with SAS.↩︎
There is a closely related plot type, the profile plot, which differs from the loadings plot in that it has the Pearson correlations on the \(y\)-axis. Otherwise the two plot types are identical.↩︎
These coures notes are made available under the Creative Commons BY-NC-SA 4.0 license.