Thursday, June 30, 2016

Unizor - Statistics - Linear Regression - Problem 2





Notes to a video lecture on http://www.unizor.com


Linear Regression - Problem 2

Consider relationship between the total distance the cars in the US traveled during some periods and the number of fatalities in car accidents during the same periods.
Intuitively, they must be related.

Assume a linear regression model:
Y = a·X + b + ε
where independent variable Xrepresents the number of miles all cars in the US have traveled during some decade (10 year period) and Y represents the number of fatalities during the same period.

The real data for X and Y were taken from Car Accidents.
For each decade, starting with 1925-1934 and ending with 2005-2014, the total number of miles traveled and the number of fatalities are below.

Decade....... Miles. Fatalities
1925-19341831278716
1935-19442632314372
1945-19544354325536
1955-19647120382171
1965-197410997513981
1975-198417093470533
1985-199420871435194
1995-200427049421979
2005-201429868363216

What are the coefficients of linear regression of the number of fatalities to total miles traveled?
How big the error presented by this regression?

Solution

So, we have n=9 values of miles traveled X
x1x2 ...xn
and the corresponding values of number of fatalities Y
y1y2 ...yn.

We have introduced two averages of the sample data:
U=(x1+x2+...+xn)/n = 13535
V=(y1+y2+...+yn)/n = 389522

Using new variables
Xk = xk − U and Yk = yk − V
(where index k is from 1 ton=9)
we came up with the best possible value for a coefficienta in the formula for linear regression as
a = ΣXk·Yk / ΣXk² ≅ 3.3944

Since averaging nullifies the effect of random error ε, we consider the following equation as true to find coefficient b:
V = aU+b
from which we derive
b = V − aU ≅ 343579

Now we have a regression equation:
Y = 3.3944·X + 343579 + ε
from which we can calculate sample data for random error εas
ε = Y − a·X − b

Resulting values of ε in sum should be approximately equal to zero (and they are equal to-1.836, which is good).
Their sample standard variation equals to 68508, which is a rather big number relatively to average value of Y=389522, which means that the quality of our regression is rather poor and not trustworthy for any predictions.
Intuitively, it can be seen from the fact that, starting in 1975, the number of fatalities goes down, while the number of miles traveled continue increasing. It might be related to better accident protection devices, like airbags, that manufacturers started installing in the cars.

No comments: