Principal component analysis explained simply

1. Principal components capture the most variation in a dataset

  1. The total distance among the projected points is maximum. This means they can be distinguished from one another as clearly as possible. We want to compare stuff, remember? If a line blurs data points together, it won’t help.
  2. The total distance from the original points to their corresponding projected points is minimum. This means we have a representation that is as close to the original data as possible.

2. PCA deals with the curse of dimensionality by capturing the essence of data into a few principal components.

  1. Principal components help reduce the number of dimensions down to 2 or 3, making it possible to see strong patterns.
  2. Yet you didn’t have to throw away any genes in doing so. Principal components take all dimensions and data points into account.
  3. Since PC1 and PC2 are perpendicular to each other, we can rotate them and make them straight. These are the axes of our pretty PCA plot.

3. Dimensions vary in the weights they have on each principal component.

  • Mouse #i :
  • Eigenvector #j:
  • Principal component j-th of sample i:

4. How to read a PCA plot

  • Mice that have similar expression profiles are now clustered together. Just glancing at this plot, we can see that there are 3 clusters of mice.
  • If 2 clusters of mice are different based on PC1, like the blue and orange clusters in this plot, such differences are likely to be due to the genes that have heavy influences on PC1.
  • If 2 clusters are different based on PC2, like the red and blue clusters, then the genes that heavily influence PC2 are likely to be responsible.
  • Keep in mind that PCs are ranked by how much they describe the data. PC1 reveals the most variation, while PC2 reveals the second most variation. Therefore, differences among clusters along PC1 axis are actually larger than the similar-looking distances along PC2 axis.
  • Is this plot meaningful? Check the proportion of variance, or the diagnostic scree plot. PCA is worthy if the top 2 or 3 PCs cover most of the variation in your data. Otherwise, you should consider other dimension reduction techniques, such as t-SNE and MDS.
Proportion of variance graphs, good and bad

--

--

--

At BioTuring, we dream, we think, we code, and we deliver important algorithms and software — to tackle biomedical challenges.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

8 Data Scientist Interview Questions You Must Be Prepared For

resume writing service

Using Chi-Squared Distribution in Reliability Test Planning Calculations

Data: Alberta’s new fuel

Customer Class Prediction

Knowledge Hypergraphs & Object-Role Modeling

Week #1 project update

Untapped “New World Oil”: Big Data got bigger during pandemic and social distancing

Human Rights Implications of IBM Watson’s ‘Personality Insights’ Tool

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BioTuring Team

BioTuring Team

At BioTuring, we dream, we think, we code, and we deliver important algorithms and software — to tackle biomedical challenges.

More from Medium

The Art of Statistics Ch.2 Summary

How To Randomly Sample Data Points (Uniform Distribution)

Data Science for Industry: Hydropower Application & Equipment

What is Empirical Risk Minimization?

A picture of 3 brown eggs and 3 white eggs on a white towel.