Notes to a video lecture on http://www.unizor.com
Linear Regression - Problem 2
Consider relationship between the total distance the cars in the US traveled during some periods and the number of fatalities in car accidents during the same periods.
Intuitively, they must be related.
Assume a linear regression model:
Y = a·X + b + ε
where independent variable Xrepresents the number of miles all cars in the US have traveled during some decade (10 year period) and Y represents the number of fatalities during the same period.
The real data for X and Y were taken from Car Accidents.
For each decade, starting with 1925-1934 and ending with 2005-2014, the total number of miles traveled and the number of fatalities are below.
Decade....... Miles. Fatalities
1925-1934 | 1831 | 278716 |
1935-1944 | 2632 | 314372 |
1945-1954 | 4354 | 325536 |
1955-1964 | 7120 | 382171 |
1965-1974 | 10997 | 513981 |
1975-1984 | 17093 | 470533 |
1985-1994 | 20871 | 435194 |
1995-2004 | 27049 | 421979 |
2005-2014 | 29868 | 363216 |
What are the coefficients of linear regression of the number of fatalities to total miles traveled?
How big the error presented by this regression?
Solution
So, we have n=9 values of miles traveled X
x1, x2 ...xn
and the corresponding values of number of fatalities Y
y1, y2 ...yn.
We have introduced two averages of the sample data:
U=(x1+x2+...+xn)/n = 13535
V=(y1+y2+...+yn)/n = 389522
Using new variables
Xk = xk − U and Yk = yk − V
(where index k is from 1 ton=9)
we came up with the best possible value for a coefficienta in the formula for linear regression as
a = ΣXk·Yk / ΣXk² ≅ 3.3944
Since averaging nullifies the effect of random error ε, we consider the following equation as true to find coefficient b:
V = aU+b
from which we derive
b = V − aU ≅ 343579
Now we have a regression equation:
Y = 3.3944·X + 343579 + ε
from which we can calculate sample data for random error εas
ε = Y − a·X − b
Resulting values of ε in sum should be approximately equal to zero (and they are equal to-1.836, which is good).
Their sample standard variation equals to 68508, which is a rather big number relatively to average value of Y=389522, which means that the quality of our regression is rather poor and not trustworthy for any predictions.
Intuitively, it can be seen from the fact that, starting in 1975, the number of fatalities goes down, while the number of miles traveled continue increasing. It might be related to better accident protection devices, like airbags, that manufacturers started installing in the cars.
No comments:
Post a Comment