Thursday, June 30, 2016

Unizor - Statistics - Linear Regression - Problem 2





Notes to a video lecture on http://www.unizor.com


Linear Regression - Problem 2

Consider relationship between the total distance the cars in the US traveled during some periods and the number of fatalities in car accidents during the same periods.
Intuitively, they must be related.

Assume a linear regression model:
Y = a·X + b + ε
where independent variable Xrepresents the number of miles all cars in the US have traveled during some decade (10 year period) and Y represents the number of fatalities during the same period.

The real data for X and Y were taken from Car Accidents.
For each decade, starting with 1925-1934 and ending with 2005-2014, the total number of miles traveled and the number of fatalities are below.

Decade....... Miles. Fatalities
1925-19341831278716
1935-19442632314372
1945-19544354325536
1955-19647120382171
1965-197410997513981
1975-198417093470533
1985-199420871435194
1995-200427049421979
2005-201429868363216

What are the coefficients of linear regression of the number of fatalities to total miles traveled?
How big the error presented by this regression?

Solution

So, we have n=9 values of miles traveled X
x1x2 ...xn
and the corresponding values of number of fatalities Y
y1y2 ...yn.

We have introduced two averages of the sample data:
U=(x1+x2+...+xn)/n = 13535
V=(y1+y2+...+yn)/n = 389522

Using new variables
Xk = xk − U and Yk = yk − V
(where index k is from 1 ton=9)
we came up with the best possible value for a coefficienta in the formula for linear regression as
a = ΣXk·Yk / ΣXk² ≅ 3.3944

Since averaging nullifies the effect of random error ε, we consider the following equation as true to find coefficient b:
V = aU+b
from which we derive
b = V − aU ≅ 343579

Now we have a regression equation:
Y = 3.3944·X + 343579 + ε
from which we can calculate sample data for random error εas
ε = Y − a·X − b

Resulting values of ε in sum should be approximately equal to zero (and they are equal to-1.836, which is good).
Their sample standard variation equals to 68508, which is a rather big number relatively to average value of Y=389522, which means that the quality of our regression is rather poor and not trustworthy for any predictions.
Intuitively, it can be seen from the fact that, starting in 1975, the number of fatalities goes down, while the number of miles traveled continue increasing. It might be related to better accident protection devices, like airbags, that manufacturers started installing in the cars.

Tuesday, June 28, 2016

Unizor - Linear Regression - Problem 1

Notes to a video lecture on http://www.unizor.com


Linear Regression - Problem 1

Consider a linear regression model described in the previous lecture:
Y = a·X + b + ε
where independent variable X is represented by sample data
x1x2 ...xn
and observed values of dependent variable Y are
y1y2 ...yn.

We have introduced two averages of the sample data:
Ave(x)=(x1+x2+...+xn)/n = U
Ave(y)=(y1+y2+...+yn)/n = V

Using new variables
Xk = xk − U and Yk = yk − V
(where index k is from 1 to n)
we came up with the best possible value for a coefficienta in the formula for linear regression as
a = ΣXk·Yk / ΣXk²

Problem

Using the sample averaging function Ave() applied as
Ave(x)=(x1+x2+...+xn)/n
Ave(y)=(y1+y2+...+yn)/n
Ave(xy)=(x1y1+...+xnyn)/n
Ave(x²)=(x1²+...+xn²)/n
prove that the expression for a coefficient a in the formula for linear regression in terms of original sample data xk and yklooks like
a = [Ave(xy)−Ave(x)Ave(y)] / [Ave(x²)−Ave²(x)]

Monday, June 27, 2016

Unizor - Statistical Regression - Linear Regression





Notes to a video lecture on http://www.unizor.com


Linear Regression

In many cases we suspect that one random variable is dependent on another.
Sometimes the dependence can be expressed as a formula. Consider two people measure a temperature at the same place and at the same time, but one measure it in degrees of Fahrenheit (TF), while another - in degrees of Celsius (TC).
Obviously, we are talking about two random variables precisely related to each other by formula:
TC = (5/9)·(TF−32)

In most cases, however, the dependence between random variables is more complex and, most often, cannot be expressed as a formula or a function. The reason for this is that there are many factors influencing values of random variables, all contributing to the results of random experiments.

Consider such random variables as volcanic activity and average annual temperature on the planet. It must be some kind of dependency, but there are so many other factors that affect the average annual temperature, besides volcanic activity, that all we can definitely say is that there is some dependency between these two, but it cannot be expressed as a definitive formula.

In other cases we can talk about cause, being completely under our control, and effect that we can measure.
For instance, the cause can be the amount of money some parents spend on education of their child and the effect is the amount of knowledge (measured in some units) our child obtains as a result of this education. The cause (amount of money spent on education) is under control of the parents, while effect (knowledge) certainly depends on many additional factors, besides the money, which justifies considering it as a random variable.

Another example is forecasting air temperature in one hour from now based on current temperature, considering we take into account the time of day (morning or afternoon) and month of the year. During this one hour period temperature will change up or down, depending on known factors (time of the day and season) and some random fluctuations that we don't know about, so we can assume that there is a strong dependency between current and future temperatures and, if we know this dependency, we can predict with some precision the temperature in one hour based on the current temperature, time of the day and a season.

Of course, the more causes, that influence the effect, we are aware of - the better our understanding of the dependency between them is, and, therefore, we will be better equipped to achieve or predict certain effect.
For example, the positive results of medical treatment of some illness depend on the whole complex of measures - drugs, medical procedures, food, genetic factors, physical exercises etc. All these factors are important and, if we knew how each of these factors (that are either known or pretty much under our control) contributes to success of the entire treatment of the illness, we would be able to treat a patient very efficiently and with good outcome.

Let's bring some Mathematical Statistics into a picture.
First of all, we simplify the problem by considering only a case of one contributing factorX (which we consider as a cause, independent variable under our control or known in advance) and one observed effect Y of this cause - a random variable that depends on X and some unknown factors summarized in a random variable ε that shifts the value of Y randomly up or down with Normal distribution of probabilities with expectation of zero.
We further assume that the dependency between X and Y is linear with unknown coefficients a and b. So, the entire dependency can be expressed as
Y = a·X + b + ε
This type of relationship is called linear regression.

Our purpose is to determine coefficients a and b of linear regression based on known values of independent variableX - x1x2 ...xn and observed values of dependent variable Y -y1y2 ...yn.
The base for determining these coefficients should be the maximum closeness of values {a·xi+b} to values {yi}. Since the difference between Y anda·X+b is a Normal random variable ε, the best coefficientsa and b are those that minimize the empirical variance of random variable ε.

The next simplification can be achieved by replacing random variables X and Y with random variable X−E(X) and Y−E(Y).
Consider averaging transformations:
E(Y) = E(a·X + b + ε) =
= a·E(X) + b + E(ε) =
(since we assumed that ε is Normal random variable with mathematical expectation equaled to zero)
= a·E(X) + b

Therefore,
b = E(Y) − a·E(X)
and
Y = a·X + E(Y) − a·E(X) + ε
or
Y−E(Y) = a·[X−E(X)] + ε

As you see, if we knew E(X)and E(Y), our problem would be easier, since there is only one parameter a to be determined to minimize variance of ε.
We do not know precise values for mathematical expectations of X and Y, but we do have their statistics, which means that we can approximate these values with arithmetic means of statistical observations:
E(X) ≅ (x1+x2+...+xn)/n = U
E(Y) ≅ (y1+y2+...+yn)/n = V

Hopefully, we do not lose much precision stating that
Y−V = a·(X−U) + ε
(where U and V are known arithmetic means of available statistical values of X and Y)

Let's replace statistical values
x1x2 ...xn with
X1=x1−UX2=x2−U ...Xn=xn−U
and replace
y1y2 ...yn with
Y1=y1−VY2=y2−V ...Yn=yn−V

Our problem is to find such a coefficient a that values
Y1aX1, Y2aX2...YnaXn
have the smallest sample variation.

Since arithmetic mean of these values is zero (remember, we subtracted U and V from original statistical values to get these) and since the number of experiments n is constant, we just have to minimize the sum of squares of these numbers. This sum of squares is just a quadratic function of a and we can easily find its minimum.

s² = Σ(YkaXk =
a²·ΣXk²−2a·ΣXk·Yk+ΣYk²

Minimum of this quadratic polynomial is at point
a = ΣXk·Yk / ΣXk²

Knowing coefficient a, we can determine coefficient b from equation
E(Y) = a·E(X) + b
That fully determines the coefficients of linear regression.

How can we determine the quality of approximation or prediction of values of random variable Y based on values of Xand a regression formula we have just determined?
Knowing the regression coefficients and sample data forX and Y, we can derive the sample data for an error
ε = Y−a·X−b.
From these sample data for εwe can calculate its sample variance and standard deviationσ. Having  as a margin of error with 95% certainty, we can compare it with empirical mean value of random variableY and determine the ratio of a margin of error to Y's mean2σ/E(Y). If it's small (say, 0.05 or less), we can be satisfied with our regression analysis and conclude that the formula of regression adequately represents the dependency between X and Y and can be used for prediction of future values of Y based on observed values of X.
Obviously, which ratio of a margin of error to a mean value of Y should be considered as satisfactory is an individual issue and should be defined based on circumstances, thus introducing an element of subjectivity into this theory.

Thursday, June 16, 2016

Unizor - Statistics - Averages





Unizor - Creative Minds through Art of Mathematics - Math4Teens


Statistical Averages

This lecture is, actually, a refresher of things studied in the course of Theory of Probabilities, but applied to Statistics.

As we noted, the purpose of Mathematical Statistics is, using the past observations, to evaluate the probability of certain events. More precisely, using the results of random experiments, we need to know the distribution of probabilities of a random variable.

In many cases we can assume that the random variable we deal with is distributed Normally, and the Normal distribution, as we know is fully defined by its mathematical expectation and variance. In other, not Normal, cases of distribution these two parameters also play a very important role, representing a concentration point of the value of a random variable and its spread around this concentration point. So, it's not surprising that evaluation of these parameters is a very important statistical task.

The typical approach to evaluation of mathematical expectation is the calculation of arithmetic average of the sample data. Let's examine a validity of this approach in case we conduct certain number of experiments with our random variable, providing the experiments are independent of each other and the conditions of experiments are not changing.

To make these two requirements less abstract, let's exemplify cases when these conditions are not present.
For instance, you roll the dice. If, for the next roll, you position the die exactly on the number you got on a previous roll, you establish a connection between the rolls and experiments are not independent.
As far as conditions of experiment changing, we can examine the bow and arrow target shooting at an outside range with wind affecting the results. Obviously, conditions are changing unpredictably with the wind and the results will not exactly correspond to the level of skills of participants.

So, we assume independence of the random experiments and stability of conditions.

Assume we conduct N experiments with the same random variable ξ and obtain N values Xi (i = 1, 2,...N).
How can we interpret the arithmetic average of these data
m = (X1+X2+...+XN)/N ?

The simplest approach is to consider a set of independent random variables ξ1, ξ2,...ξN distributed identically with ξ. Now assume a new random variable
η = (ξ1+ξ2+...+ξN)/N
Obviously, our average m of the N results of experiments with random variable ξ can be considered as a single result of an experiment with a random variable η.
It is an extremely important point that this approach is the base for all subsequent logical statements, because there is no better approximation of the random variable with unknown distribution of probabilities than its real experimental value.

But why a single result of an experiment with η is better than N results of experiments with ξ? The reason is simple - random variable η has the same mean value as ξ, but much smaller variance, that is its values have a much smaller spread around it's mean value and, therefore, are a better approximation of this mean value.
The proof of this is based on the known properties of mathematical expectation and variance of a sum of random variables discussed in the lectures Probability - Random Variables - Expectation of Sum and Probability - Random Variables - Variance of Sum. 
In particular, expectation of a sum of random variables equals to sum of their expectations and variance of a sum of independent random variables equals to sum of their variances.

In our case, if expectation of random variable ξ is μ = E(ξ), expectation of random variable η is
E(η)=E[(ξ1+ξ2+...+ξN)/N]=
=[E(ξ1)+E(ξ2)+...+E(ξN)]/N=
(since all ξi are independent and identically distributed with ξ)
= N·E(ξ)/N = E(ξ) = μ
So, expectations of ξ and η are the same.

If variance of ξ is σ² = Var(ξ), variance of η is
Var(η)=Var[(ξ1+...+ξN)/N]=
=Var(ξ1+...+ξN)/N²=
(since all ξi are independent and identically distributed with ξ)
= N·Var(ξ)/N² = 
=Var(ξ)/N = σ²/N
So, variance of η is N times smaller than variance of ξ.

So, the distribution of random variable η is more tightly concentrated around its mean value than the distribution of ξ around exactly the same mean value. This is the main reason why average of N observations of values of ξ is a good approximation for the mean value of ξ. And the quality of this approximation increases with N→∞.

In addition, it makes sense to add another important detail. With the number of experiments N growing, the distribution of random variable η is more and more resembling the Normal distribution (see Probability - Normal Distribution - Normal is a Limit in this course).
That allows us, knowing sample mean and sample variation, to evaluate the range of the value of μ=E(ξ) with some certainty.

Tuesday, June 14, 2016

Unizor - Statistical Distribution - Task D - Precipitation





Statistical Distribution
Problem 4 - Precipitation


To effectively learn from problem solving, try to solve these problems just by yourself, then listen to a lecture and then try to solve them again by yourself.

In order to determine if the claims about climate change are true or false, the monthly precipitation data at New York's Central Park have been gathered from the official Web site New York Historical Monthly Precipitation for the period from 1906 to 2015.
Based on these data, determine for each decade (10-year period) the mean, standard deviation and 95% certainty interval of the level of precipitation for each month of the year as well as for annualized levels.
Make a judgement about validity of the claims of New York climate change by comparing mean precipitation level in different months between two decades, 1906-1915 and 2006-2015 separated by a century. Consider a comparison of a difference between levels with zero.

Solution

The raw data with some additional calculations in a spreadsheet format can be downloaded from Monthly Precipitation NY (data).
This spreadsheet contains original data about monthly precipitation for each month and annualized precipitation during 1906-2015 period.
In addition, these data are supplemented with the following calculations for each month's and annualized for an entire period:
=MIN(...)
=MAX(...)
=SLOPE(...)
=AVERAGE(...)
=STDEV(...)
=VAR(...)
As we see, different months have different slopes during the entire period, some positive, some negative, some more, some less (the highest increase of precipitation is in April, the lowest - in January, the highest decrease in precipitation is in October, the lowest - in March). The annualized precipitation slope is positive, about 2 inches per century (about 4% growth in 100 years)
The degree of certainty of this conclusion is another issue, that we consider below.

The spreadsheet with original data and calculations per decade can be downloaded fromMonthly Temp NY (analysis).
Examine it. Here are some conclusions.

Let random variables ξ1906,ξ1907,...ξ1915, represent January precipitation during 1906-1915 years (calculations for other months and annualized precipitation are similar).
It is reasonable to assume that all of them are independent and have identical normal distribution with mathematical expectation μ190# and varianceσ²190#, which represent our mathematical model of the precipitation during these ten years.

Random variable
ξ1 = (ξ1906+...+ξ1915)/10
represents average January precipitation during 1906-1915 decade. It's expectation isμ1190# and its variance equals to σ²1=σ²190#/10.

Analogously, independent and identically distributed random variables ξ2006ξ2007,...ξ2015, represent January precipitation during 2006-2015 years, each having mathematical expectation μ200# and varianceσ²200#, which represent our mathematical model of the precipitation during these ten years.

Random variable
ξ2 = (ξ2006+...+ξ2015)/10
represents average January precipitation during 2006-2015 decade. It's expectation isμ2200# and its variance equals to σ²2=σ²200#/10.

On the simplest level, to determine the validity of the claims about climate change during 20th century, we can compare mathematical expectations
μ1 = E(ξ1) and μ2 = E(ξ2)
that is, mean precipitations in the beginning of 20th and 21stcenturies.
If they are different (or, which is the same, if μ2−μ1 ≠ 0), we have a confirmation of a shift in the precipitation. Based on this, we can calculate absolute and relative increase or decrease of precipitation during this 100 years period.

In the analysis spreadsheet mentioned above we have 10 sample values of January precipitation in each year of 1906-1915 decade
x1906x1907,...x1915
and 10 sample values for January precipitation in each year of 2006-2015 decade
x2006x2007,...x2015.
This gives one sample value for random variable ξ1 (average precipitation during 1906-1915 period)
X1 = (x1900+...+x1909)/10
and one sample value for random variable ξ2 (average precipitation during 2006-2015)
X2 = (x2006+...+x2015)/10
which we have to compare.

The difference X2−X1 is a single sample value of a normal random variable ξ2−ξ1, mathematical expectation of which (μ2−μ1) we want to compare with 0. This difference is the best possible estimate ofμ2−μ1. If we knew its variance, we could evaluate the range of possible values μ2−μ1 can take with whatever level of certainty we need.
For example, if σ² is a variance of ξ2−ξ1, we can say with 95% certainty that
|2−μ1)−(X2−X1)| ≤ 

Unfortunately, we don't know the variance of ξ2−ξ1.
We do, however, know that, since ξ1 and ξ2 are assumed to be independent random variables,
Var(ξ2−ξ1) = Var(ξ2)+Var(ξ1)

Now we have two ways to evaluate the variances of random variables ξ1 and ξ2.
We can assume that variance does not change with time and calculate a sample variance based on all January data from 1906 to 2015 - a questionable assumption, but a good estimate since the sample data are quite representative.
Alternatively, we calculate sample variances separately for 1906-1915 decade and 2006-2015 decade using only 10 sample values for each - better assumption, but sample data are not as numerous.

Let's use both methods and compare the results.

Assuming the variance does not change with time, the calculations based on entire set of data from 1906 to 2015 show the sample standard deviation of January precipitation to beσ=1.645 and the sample variance to be σ²=2.707.
The sample variance of the decade averages ξ1 or ξ2 is2.707/10=0.271.
The difference ξ2−ξ1 has sample variance 0.271·2=0.541 and standard deviation (square root of variance) is 0.736.

With this standard deviation and the sample average January precipitation during 1906-1915 and 2006-2015 periods, correspondingly, 3.911 and3.547, we can state with 95% certainty that
|2−μ1)−(3.547−3.911)| ≤
≤ 2·0.736
That is,
−0.364−1.472 ≤ μ2−μ1 ≤
≤ −0.364+1.472

or
−1.836 ≤ μ2−μ1 ≤ 1.108
As we see, we cannot say with 95% certainty that there is a positive or negative movement of the decade average January precipitation from 1900's to 2000's in New York.
Even if we reduce the level of certainty level to 68% (single sigma rule), still our left margin would be negative and right margin will be positive, which means that we cannot say there is a difference between mean January precipitations in 1900's and 2000's.

The same approach applied to other monthly and annualized precipitations produces different results. Here is a list of conclusions that we can make about validity of the statistically significant (with 95% certainty) change in precipitation in New York from 1900's to 2000's
Jan - NO
Feb - NO
Mar - NO
Apr - NO
May - NO
Jun - YES (increase)
Jul - NO
Aug - NO
Sep - NO
Oct - NO
Nov - NO
Dec - NO
Year - YES (increase)

Let's try a different approach and consider that variances do change with the time. Then our only choice is to evaluate separately variance of the January precipitation during 1906-1915 and during 2006-2015 periods based only on 10 values available during each decade.
Sample variance of the January precipitation during 1906-1915 is
Var(ξ1906) = 2.816.
Sample variance of the average January precipitation during this period is
Var(ξ1) = 2.816/10 = 0.282.
Sample variance of the January precipitation during 2006-2015 is
Var(ξ2006) = 1.232.
Sample variance of the average January precipitation during this period is
Var(ξ2) = 1.232/10 = 0.123.
Sample variance of the difference between average decade precipitations is
Var(ξ2−ξ1) = 0.282+0.123 = 0.405.
Standard deviation (square root from the above variance) is
σ = 0.636 (not much different from the approach when we calculate this value based on entire population, σ=0.736).
Finally,  interval around sample mean is: −0.364−1.273 ≤ μ2−μ1 ≤
≤ −0.364+1.273

or
−1.637 ≤ μ2−μ1 ≤ 0.909
The conclusion about absence of average January precipitation trend in New York is the same.
The same approach for other months and annualized precipitation produce different results. Here is a list of conclusions that we can make about validity of the change in precipitation in New York from 1900's to 2000's
Jan - NO
Feb - NO
Mar - NO
Apr - NO
May - NO
Jun - YES (increase)
Jul - NO
Aug - NO
Sep - NO
Oct - NO
Nov - NO
Dec - YES (increase)
Year - YES (increase)

As we see, this second approach to evaluate the variance Var(ξ2−ξ1) produced almost the same result as the first one, the difference is only for the month of December.

CONCLUSION
We can state with 95% certainty that decade average monthly precipitation during 2000's is higher than the corresponding decade average monthly precipitation during 1900's in June. December produced two different results in two methodologies we used. Annualized decade average precipitation is also higher during 2000's than in 1900's.
The slope upward of about 2" in annual precipitation per century is observed.
Obviously, we cannot state with any mathematical precision which factors contributed more or less to this increase. Politicians are making their careers fighting over these issues, which is completely outside of this presentation.

Wednesday, June 8, 2016

Unizor - Statistical Distribution - Task C - Rising Sea Level





Unizor - Creative Minds through Art of Mathematics - Math4Teens



Our random variable takes continuous range of values with theoretically known fixed boundaries - from a to b.



Problem



In order to determine if the claims about climate change are true or false, the daily sea level data (in mm) have been gathered daily at noon for the period from 1933 to 2014 at Crescent City, California, United States (latitude 41.74500, longitude -124.18300) and for the period from 1911 to 2014 at Atlantic City, New Jersey, United States (latitude 39.35000, longitude-74.42000) from the official Web site Sea Level Center of University of Hawaii.



Based on these data about the sea levels at California and New Jersey coasts, determine for each of these two places and for each decade (10-year period) the mean, standard deviation, slope, range and 95% certainty interval of average sea level.

Create a density of probabilities model - probability mass function - of average sea level by decade (divide the range in 5 intervals).

Make a judgement about validity of the claims that sea level was rising.





Solution



The summary data for California coast with some additional calculations in a spreadsheet format can be downloaded from Sea Level in CA.

The corresponding spreadsheet for New Jersey coast is at Sea Level in NJ.



These two spreadsheets contain summary data about average sea level by decades.

These data include:

=AVERAGE(...)

=STDEV(...)

=SLOPE(...)

=MIN(...)

=MAX(...)

It is further supplemented with distribution counts of sea levels for each decade in five intervals.

Examine it.

Here are some conclusions.



CONCLUSION

In short, the sea level in California was stable, while in New Jersey it has risen by about 400 mm in about 100 years. Both statements have 95% certainty.