Friday, July 22, 2016

Unizor - Statistics - Correlation Problems 1





Notes to a video lecture on http://www.unizor.com


Statistical Correlation -
Problems 1


Assume two different experiments with numerical results performed under the same conditions.
Statistical results of these experiments are used to determine if there is a dependency between them.

For example, we measure the amount of salt added to the cold water (random variable S) and the time it takes to boil it (random variable T).

The results of numerous experiments are in a table, where each row corresponds to an observed value of the first experiment (S), each column corresponds to an observed value of the second experiment (T) and on a crossing of a row and a column there is a number of times when corresponding results of the first and the second experiments occurred.

Using these results, calculate the correlation coefficientbetween the random variables representing these two experiments.

Problem A

Values of random variables Sand T are in a table below.
T=101T=102T=104
S=14000
S=20350
S=40025
Calculate correlation coefficient R(S,T) of these two random variables.

Solution

To calculate correlation coefficient R(S,T) we need to calculate the following mathematical expectations E()and variances Var():
E(S) - mean value of S
E(T) - mean value of T
E(S·T) - mean value of S·T
Var(S) - variance of S
Var(T) - variance of T
and put them into a formula for correlation R(S,T):
R(S,T) =
Cov(S,T) /Var(S)·Var(T)

where
Cov(S,T) = E(S·T)−E(S)·E(S)

S=1 in 40+0+0=40 cases,
S=2 in 0+35+0=35 cases,
S=4 in 0+0+25=25 cases,
Total number of observations
N = 40+35+25 = 100
Therefore, mean
E(S) =
= (1·40+2·35+4·25)/100 = 2.1


T=101 in 40+0+0=40 cases,
T=102 in 0+35+0=35 cases,
T=104 in 0+0+25=25 cases,
Total number of observations is still the same
N = 40+35+25 = 100
Therefore, mean
E(T) =
=(101·40+102·35+104·25)/100
= 102.1


There are only three possible combinations of simultaneous values of S and T:
S·T=1·101=101 in 40+0+0=40cases,
S·T=2·102=204 in 0+35+0=35cases,
S·T=4·104=416 in 0+0+25=25cases,
Total number of observation is still the same
N = 40+35+25 = 100
Therefore, mean
E(S·T) =
=(101·40+204·35+416·25)/100
= 215.8


Now we can calculatecovariance between S and T:
Cov(S,T) =
= 215.8−2.1·102.1 = 1.39


Next is the calculation of variances of and T.

Var(S) =
= [40·(1−2.1)² +
+ 35·(2−2.1)² +
+ 25·(4−2.1)²] / 100 =
= 1.39


Var(T) =
= [40·(101−102.1)² +
+ 35·(102−102.1)² +
+ 25·(104−102.1)²] / 100 =
= 1.39


Correlation coefficient betweenS and T is
R(S,T) = 1.39 / √1.39·1.39 = 1

This result might have been predicted since, obviously, within the framework of our experiments there is a linear dependency between S and T:
T = 100 + S

Problem B

As we know, there is a limit of salt that can be dissolved in water. As we add salt, the concentration of it in water can reach its maximum and new salt is no longer dissolved, it called saturation.
Then the temperature of boiling will no longer increase since concentration of salt will remain the same.

Assume, we make three experiments, as in a problem above, but after the second experiment the water has reached a point of saturation.

Values of random variables Sand T are in a table below.
T=111T=112T=112
S=114000
S=120350
S=140025
Calculate correlation coefficient R(S,T) of these two random variables.

Solution

S=11 in 40+0+0=40 cases,
S=12 in 0+35+0=35 cases,
S=14 in 0+0+25=25 cases,
Total number of observations
N = 40+35+25 = 100
Therefore, mean
E(S) =
= (11·40+12·35+14·25)/100 =
= 12.1


T=111 in 40+0+0=40 cases,
T=112 in 0+35+0=35 cases,
T=112 in 0+0+25=25 cases,
Total number of observations is still the same
N = 40+35+25 = 100
Therefore, mean
E(T) =
=(111·40+112·35+112·25)/100
=(111·40+112·60)/100 =
= 111.6


There are only three possible combinations of simultaneous values of S and T:
S·T=11·111 in 40+0+0=40cases,
S·T=12·112 in 0+35+0=35cases,
S·T=14·112 in 0+0+25=25cases,
Total number of observation is still the same
N = 40+35+25 = 100
Therefore, mean
E(S·T) =
= (11·111·40 +
+ 12·112·35 +
+ 14·112·25)/100 =
= 1350.8


Now we can calculatecovariance between S and T:
Cov(S,T) =
= 1350.8−12.1·111.6 = 0.44


Next is the calculation of variances of and T.

Var(S) =
= [40·(11−12.1)² +
+ 35·(12−12.1)² +
+ 25·(14−12.1)²] / 100 =
= 1.39


Var(T) =
= [40·(111−111.6)² +
+ 35·(112−111.6)² +
+ 25·(112−111.6)²] / 100 =
= 0.24


Correlation coefficient betweenS and T is
R(S,T) = 0.44 / √1.39·0.24 ≅
≅ 0.76


Obviously, dependency between S and T is no longer linear, which caused thecorrelation to be smaller than 1.

No comments: