Monday, June 27, 2016

Unizor - Statistical Regression - Linear Regression





Notes to a video lecture on http://www.unizor.com


Linear Regression

In many cases we suspect that one random variable is dependent on another.
Sometimes the dependence can be expressed as a formula. Consider two people measure a temperature at the same place and at the same time, but one measure it in degrees of Fahrenheit (TF), while another - in degrees of Celsius (TC).
Obviously, we are talking about two random variables precisely related to each other by formula:
TC = (5/9)·(TF−32)

In most cases, however, the dependence between random variables is more complex and, most often, cannot be expressed as a formula or a function. The reason for this is that there are many factors influencing values of random variables, all contributing to the results of random experiments.

Consider such random variables as volcanic activity and average annual temperature on the planet. It must be some kind of dependency, but there are so many other factors that affect the average annual temperature, besides volcanic activity, that all we can definitely say is that there is some dependency between these two, but it cannot be expressed as a definitive formula.

In other cases we can talk about cause, being completely under our control, and effect that we can measure.
For instance, the cause can be the amount of money some parents spend on education of their child and the effect is the amount of knowledge (measured in some units) our child obtains as a result of this education. The cause (amount of money spent on education) is under control of the parents, while effect (knowledge) certainly depends on many additional factors, besides the money, which justifies considering it as a random variable.

Another example is forecasting air temperature in one hour from now based on current temperature, considering we take into account the time of day (morning or afternoon) and month of the year. During this one hour period temperature will change up or down, depending on known factors (time of the day and season) and some random fluctuations that we don't know about, so we can assume that there is a strong dependency between current and future temperatures and, if we know this dependency, we can predict with some precision the temperature in one hour based on the current temperature, time of the day and a season.

Of course, the more causes, that influence the effect, we are aware of - the better our understanding of the dependency between them is, and, therefore, we will be better equipped to achieve or predict certain effect.
For example, the positive results of medical treatment of some illness depend on the whole complex of measures - drugs, medical procedures, food, genetic factors, physical exercises etc. All these factors are important and, if we knew how each of these factors (that are either known or pretty much under our control) contributes to success of the entire treatment of the illness, we would be able to treat a patient very efficiently and with good outcome.

Let's bring some Mathematical Statistics into a picture.
First of all, we simplify the problem by considering only a case of one contributing factorX (which we consider as a cause, independent variable under our control or known in advance) and one observed effect Y of this cause - a random variable that depends on X and some unknown factors summarized in a random variable ε that shifts the value of Y randomly up or down with Normal distribution of probabilities with expectation of zero.
We further assume that the dependency between X and Y is linear with unknown coefficients a and b. So, the entire dependency can be expressed as
Y = a·X + b + ε
This type of relationship is called linear regression.

Our purpose is to determine coefficients a and b of linear regression based on known values of independent variableX - x1x2 ...xn and observed values of dependent variable Y -y1y2 ...yn.
The base for determining these coefficients should be the maximum closeness of values {a·xi+b} to values {yi}. Since the difference between Y anda·X+b is a Normal random variable ε, the best coefficientsa and b are those that minimize the empirical variance of random variable ε.

The next simplification can be achieved by replacing random variables X and Y with random variable X−E(X) and Y−E(Y).
Consider averaging transformations:
E(Y) = E(a·X + b + ε) =
= a·E(X) + b + E(ε) =
(since we assumed that ε is Normal random variable with mathematical expectation equaled to zero)
= a·E(X) + b

Therefore,
b = E(Y) − a·E(X)
and
Y = a·X + E(Y) − a·E(X) + ε
or
Y−E(Y) = a·[X−E(X)] + ε

As you see, if we knew E(X)and E(Y), our problem would be easier, since there is only one parameter a to be determined to minimize variance of ε.
We do not know precise values for mathematical expectations of X and Y, but we do have their statistics, which means that we can approximate these values with arithmetic means of statistical observations:
E(X) ≅ (x1+x2+...+xn)/n = U
E(Y) ≅ (y1+y2+...+yn)/n = V

Hopefully, we do not lose much precision stating that
Y−V = a·(X−U) + ε
(where U and V are known arithmetic means of available statistical values of X and Y)

Let's replace statistical values
x1x2 ...xn with
X1=x1−UX2=x2−U ...Xn=xn−U
and replace
y1y2 ...yn with
Y1=y1−VY2=y2−V ...Yn=yn−V

Our problem is to find such a coefficient a that values
Y1aX1, Y2aX2...YnaXn
have the smallest sample variation.

Since arithmetic mean of these values is zero (remember, we subtracted U and V from original statistical values to get these) and since the number of experiments n is constant, we just have to minimize the sum of squares of these numbers. This sum of squares is just a quadratic function of a and we can easily find its minimum.

s² = Σ(YkaXk =
a²·ΣXk²−2a·ΣXk·Yk+ΣYk²

Minimum of this quadratic polynomial is at point
a = ΣXk·Yk / ΣXk²

Knowing coefficient a, we can determine coefficient b from equation
E(Y) = a·E(X) + b
That fully determines the coefficients of linear regression.

How can we determine the quality of approximation or prediction of values of random variable Y based on values of Xand a regression formula we have just determined?
Knowing the regression coefficients and sample data forX and Y, we can derive the sample data for an error
ε = Y−a·X−b.
From these sample data for εwe can calculate its sample variance and standard deviationσ. Having  as a margin of error with 95% certainty, we can compare it with empirical mean value of random variableY and determine the ratio of a margin of error to Y's mean2σ/E(Y). If it's small (say, 0.05 or less), we can be satisfied with our regression analysis and conclude that the formula of regression adequately represents the dependency between X and Y and can be used for prediction of future values of Y based on observed values of X.
Obviously, which ratio of a margin of error to a mean value of Y should be considered as satisfactory is an individual issue and should be defined based on circumstances, thus introducing an element of subjectivity into this theory.

No comments: