Correlation

Correlation is a measure of how two variables are related to each other (i.e. how they trend or move together).

It is important to note that correlation does not imply causality.

Quite obviously, correlation is high for variables that are interdependent. So you should make sure that variables are independent before you make any conclusions based on correlation.

Table of contents

Pearson’s correlation coefficient $r$
Non-parametric correlation coefficients
1. Spearman’s rank correlation coefficient $\rho$
2. Kendall rank correlation coefficient $\tau$

Pearson’s correlation coefficient $r$

Pearson’s corrleation corefficient, commonly denoted as $r$, measures the linear relationship between two quantitative variables.

It is the most common measure of correlation.

Population Pearson’s Correlation Coefficient

Population Pearson’s correlation coefficient $r$ is defined as:

$$ r = \frac{\Cov[X, Y]}{\sigma_X \sigma_Y} $$

It is basically the covariance divided by the standard deviations of $X$ and $Y$, which just confines the value of $r$ to the range $[-1, 1]$.

We say that the correlation is stronger when $\lvert r\rvert$ is closer to $1$, and weaker when $\lvert r\rvert$ is closer to $0$.

Sample Pearson’s Correlation Coefficient

$$ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} $$

Pearson’s $r$ measures linear relationship

As mentioned above, Pearson’s $r$ is a measure of linear relationship.

Therefore it is not suitable to measure a non-linear relationship with $r$.

You cannot assume a positive linear relationship just because you have some positive $r$ value, because it may not be linear in the first place. You should always use the scatter plot first to check.

Not a linear regresion

Even though Pearson’s $r$ is a measure of linear relationship, it is not the same as the slope of the regression line.

Whether the linear regression on the scatter plot of $X$ and $Y$ has a slope of $0$ or $10$ does not affect the value of $r$.

Assumes bivariate normal distribution

Because Pearson’s $r$ is a parametric measure, it assumes that both $X$ and $Y$ have normality.

If not, it is not a good measure of correlation.

Very sensitive to outliers

Pearson’s $r$ will change drastically in the presence of outliers.

You should always check for outliers before using $r$.

Non-parametric correlation coefficients

If any one of $X$ or $Y$ is not normally distributed, then Pearson’s $r$ is not a good measure of correlation.

In this case, we can use non-parametric correlation coefficients.

Spearman’s rank correlation coefficient $\rho$

To be added

Kendall rank correlation coefficient $\tau$

To be added