## Thursday, June 30, 2016

### Unizor - Statistics - Linear Regression - Problem 2

Notes to a video lecture on http://www.unizor.com

Linear Regression - Problem 2

Consider relationship between the total distance the cars in the US traveled during some periods and the number of fatalities in car accidents during the same periods.
Intuitively, they must be related.

Assume a linear regression model:
Y = a·X + b + ε
where independent variable Xrepresents the number of miles all cars in the US have traveled during some decade (10 year period) and Y represents the number of fatalities during the same period.

The real data for X and Y were taken from Car Accidents.
For each decade, starting with 1925-1934 and ending with 2005-2014, the total number of miles traveled and the number of fatalities are below.

Decade....... Miles. Fatalities
 1925-1934 1831 278716 1935-1944 2632 314372 1945-1954 4354 325536 1955-1964 7120 382171 1965-1974 10997 513981 1975-1984 17093 470533 1985-1994 20871 435194 1995-2004 27049 421979 2005-2014 29868 363216

What are the coefficients of linear regression of the number of fatalities to total miles traveled?
How big the error presented by this regression?

Solution

So, we have n=9 values of miles traveled X
x1x2 ...xn
and the corresponding values of number of fatalities Y
y1y2 ...yn.

We have introduced two averages of the sample data:
U=(x1+x2+...+xn)/n = 13535
V=(y1+y2+...+yn)/n = 389522

Using new variables
Xk = xk − U and Yk = yk − V
(where index k is from 1 ton=9)
we came up with the best possible value for a coefficienta in the formula for linear regression as
a = ΣXk·Yk / ΣXk² ≅ 3.3944

Since averaging nullifies the effect of random error ε, we consider the following equation as true to find coefficient b:
V = aU+b
from which we derive
b = V − aU ≅ 343579

Now we have a regression equation:
Y = 3.3944·X + 343579 + ε
from which we can calculate sample data for random error εas
ε = Y − a·X − b

Resulting values of ε in sum should be approximately equal to zero (and they are equal to-1.836, which is good).
Their sample standard variation equals to 68508, which is a rather big number relatively to average value of Y=389522, which means that the quality of our regression is rather poor and not trustworthy for any predictions.
Intuitively, it can be seen from the fact that, starting in 1975, the number of fatalities goes down, while the number of miles traveled continue increasing. It might be related to better accident protection devices, like airbags, that manufacturers started installing in the cars.