Friday, July 15, 2016

Unizor - Statistics - Correlation - Intoduction





Notes to a video lecture on http://www.unizor.com


Introduction to
Statistical Correlation


Foundation of statistical correlation (as of any other statistical subject) lies in Theory of Probabilities. Please refer to lectures on correlations in the "Probability - Random Variables - Correlation".

Theory of Probabilities introduces the concept of correlation between two random variables as a measure of dependency between them.
In particular, the correlation coefficient for two random variables is a number between −1 and +1, that is equal to zero for independent random variables and equals to +1 or −1 for linearly dependent random variables (as in case of η=A·ξ).

Statistical correlation is a methodology to evaluate the dependency between two random variables, represented by their statistical data, in the absence of information about their distribution of probabilities.

Recall the definition of a correlation coefficient between two random variables ξ and η:
R(ξ,η) =
Cov(ξ,η)/Var(ξ)·Var(η)

where covariance between these random variables is defined as
Cov(ξ,η) =
E[(ξ−E(ξ))·(η−E(η))] =
E(ξ·η)−E(ξ)E(η)


As we see, correlation coefficient is expressed in terms of expectation and variance of each random variable and expectation of their product.

Consider we have statistical data of mutual distribution of two random variables, ξ and η. It is extremely important that our data represent mutual distribution, which means that we know the simultaneous values of both under the same external conditions (like at the same time or at the same temperature etc.)
Then we can calculate not only their separate expectations E(ξ),E(η) and variances Var(ξ),Var(η), but also the expectation of their product E(ξ·η).

Assume we have statistical data on these random variables obtained in the course of Nmutual (that is, under the same condition, like at the same time or under the same pressure etc.) observations of them both. For example, we register the closing price of IBM corporation on New York Stock Exchange (random variable ξ) and Nasdaq 100 index (random variable η) at N sequential days:
ξ:(X1, X2,...XN)
η:(Y1, Y2,...YN)

Based on this information, we can calculate sample mean and variance of each of them as well as sample mean of their product:
E(ξ)=(ΣXk)/N
Var(ξ)={Σ[XkE(ξ)]²}/(N−1)
E(η)=(ΣYk)/N
Var(η)={Σ[YkE(η)]²}/(N−1)
E(ξ·η)=(ΣXk·Yk)/N

Using the results of the above calculations, we can evaluate covariance and correlation coefficient of our two random variables.

The reasonable question arises now, how, based on the correlation coefficient, can we make a judgment about dependency between our two random variables and what is the reason for this dependency.

As we know, independent random variables have a correlation coefficient of zero. It does not mean that, if a correlation is zero, our random variables are necessarily independent. However, it is considered reasonable to assume that sample correlation that is close to zero (traditionally, between −0.1 and +0.1) indicates no observable dependency between random variables.

On the other end of a spectrum we know that linearly connected random variables (η=A·ξ) have correlation coefficient +1 (for positive factor A) or −1 (for negative factor A). It does not mean that, if a correlation is +1 or −1, our random variables are necessarily linearly connected. However, it is considered reasonable to assume that sample correlation that is close to 1 by absolute value (traditionally, in interval of [−1,−0.9] or in [0.9,1]) indicates very strong dependency between random variables.

Not very small and not very large correlation coefficients are interpreted subjectively and differently in different practical situations. Some textbooks recommend to qualify correlation in excess of 0.7 by absolute value as "strong", from 0.4 to 0.7 as "moderate" and from 0.1 to 0.4 as "weak" with correlation less than 0.1 by absolute value as "absent correlation".

A couple of words about causality and correlation. These are two completely different concepts, but not unrelated under certain circumstances. Generally speaking, we can hypothesize that, if two random variables are strongly correlated, one of them might be a cause for another. But not necessarily! First of all, which one of two is the cause and another - a consequence? Secondly, they both might be consequences of some other variable that is a cause for both.

If we have to choose among two strongly correlated random variables which is the cause, the timing can help - the one observed earlier might be a cause (though, not necessarily), while another, observed later, cannot be a cause even theoretically.

No comments: