*Notes to a video lecture on http://www.unizor.com*

__Linear Regression - Problem 2__

Consider relationship between the total distance the cars in the US traveled during some periods and the number of fatalities in car accidents during the same periods.

Intuitively, they must be related.

Assume a linear regression model:

**Y = a·X + b + ε**where independent variable

*represents the number of miles all cars in the US have traveled during some decade (10 year period) and*

**X***represents the number of fatalities during the same period.*

**Y**The real data for

*and*

**X***were taken from Car Accidents.*

**Y**For each decade, starting with 1925-1934 and ending with 2005-2014, the total number of miles traveled and the number of fatalities are below.

Decade....... Miles. Fatalities

1925-1934 | 1831 | 278716 |

1935-1944 | 2632 | 314372 |

1945-1954 | 4354 | 325536 |

1955-1964 | 7120 | 382171 |

1965-1974 | 10997 | 513981 |

1975-1984 | 17093 | 470533 |

1985-1994 | 20871 | 435194 |

1995-2004 | 27049 | 421979 |

2005-2014 | 29868 | 363216 |

What are the coefficients of linear regression of the number of fatalities to total miles traveled?

How big the error presented by this regression?

*Solution*

So, we have

*n=9*values of miles traveled

**X***x*,

_{1}*x*...

_{2}*x*

_{n}and the corresponding values of number of fatalities

**Y***y*,

_{1}*y*...

_{2}*y*.

_{n}We have introduced two averages of the sample data:

*U=(x*

_{1}+x_{2}+...+x_{n})/n = 13535*V=(y*

_{1}+y_{2}+...+y_{n})/n = 389522Using new variables

*X*and

_{k}= x_{k}− U*Y*

_{k}= y_{k}− V(where index

*k*is from

*1*to

*n=9*)

we came up with the best possible value for a coefficient

*in the formula for linear regression as*

**a**

**a**= ΣX_{k}·Y_{k}/ ΣX_{k}² ≅ 3.3944Since averaging nullifies the effect of random error

*ε*, we consider the following equation as true to find coefficient

*:*

**b***V =*

**a**U+**b**from which we derive

**b**= V −**a**U ≅ 343579Now we have a regression equation:

**Y = 3.3944·X + 343579 + ε**from which we can calculate sample data for random error

*ε*as

**ε = Y − a·X − b**Resulting values of

*ε*in sum should be approximately equal to zero (and they are equal to

*-1.836*, which is good).

Their sample standard variation equals to

*68508*, which is a rather big number relatively to average value of

*, which means that the quality of our regression is rather poor and not trustworthy for any predictions.*

**Y**=389522Intuitively, it can be seen from the fact that, starting in 1975, the number of fatalities goes down, while the number of miles traveled continue increasing. It might be related to better accident protection devices, like airbags, that manufacturers started installing in the cars.