So if you’re an ecologist of any sort, you’ve probably used and definitely come across principal component analyses (PCA). These analyses are a way to compress a large number of correlated variables into a few variables that capture most of the variation seen in the larger dataset. This is achieved by constructing linear combinations (called principal component axes) of the original variables with a couple of constraints:

1. The first linear combination should capture as much of the variation in the dataset as possible

2. Subsequent linear combinations should also be as variable as possible, and must be uncorrelated with the previous linear combinations.

For example, imagine that you’re measuring body dimensions of frogs. Individuals that are longer in length probably also have longer legs and larger heads. For such a dataset of morphological variables, the first principal component axis is usually a linear combination in which the coefficients of each variable in the linear combination are roughly equal in magnitude and have the same sign. Such an axis is usually interpreted as measuring the overall “body size” of the organism. Subsequent axes are then interpreted as different body shape variables, some of which can be biologically interesting. For instance, in datasets that include morphological measurements of both males and females, a shape axis might point to differences in dimensions between the sexes.

How different is the information conveyed by the two different definitions of loadings? For highly correlated datasets, such as the ones we’re most likely to conduct PCA on, they don’t seem vastly different. This claim is based on my calculations of both definitions of loadings for three or four morphometric datasets I have lying around–the two loadings were perfectly correlated for each dataset. But for less correlated datasets, the answer might be different. Here is a graph of the relationship between the two types of loadings (described as “coeff” and “cor” respectively below) for PCAs conducted on randomly generated normal variables:

Someone more mathematically savvy than me should calculate this relationship explicitly for a number of datasets with varying correlation structures, so that we can assess whether this shift in definition of PCA loadings has implications for how we’ve been interpreting the biological relevance of these axes. Given how widely used PCAs are, it’s well worth knowing what these implications might be.