## Tuesday, February 9, 2016

### Unizor - Bernoulli Statistics - Sample Variance

Unizor - Creative Minds through Art of Mathematics - Math4Teens

Notes to a video lecture on http://www.unizor.com

Bernoulli Statistics - Sample Variance

The most frequently used procedure to better evaluate the variance of η is to use sample variance that can be calculated based on existing values from N experiments x1, x2...xN.

The variance of a discrete random variable is a probability-weighted average of squares of its deviation from its mathematical expectation.
Inasmuch as we have accepted a sample average of values our random variable ξ took in N experiments
m = (x1+x2+...+xN) / N
as a substitution for E(ξ)=P, it seems reasonable to accept a sample average of squares of deviations from its sample mean
sN² = [(x1−m)²+(x2−m)²+...+(xN−m)²] / N
as a substitution for Var(ξ)=P(1−P).

Analogously to analyzing the bias and margin of error of the sample mean m as an estimate of E(ξ), we need to analyze two issues with this substitution sN² for Var(ξ) - its bias and its margin of error.

An estimate m of probability P is unbiased, and that is a very desirable property of any estimate.
Considering m as a single value of a random variable
η = (ξ1+ξ2+...+ξN) / N
and calculating the mathematical expectation of η, we had E(η)=P, which confirms that m is an unbiased estimate of E(ξ)=P.

It would be very much desirable for our estimate sN² of Var(ξ) to be unbiased as well.
To determine this, we have to consider sN² as a single value of a random variable
ζ = [(ξ1−η)²+(ξ2−η)²+...+(ξN−η)²] / N
and check if its mathematical expectation E(ζ) equals to Var(ξ)=P(1−P).
If it is, our estimate sN² would be an unbiased estimate of Var(ξ) and, using sN² instead of an upper bound 1/4 for variance Var(ξ), we might more precisely evaluate the quality of using sample mean m as an estimate of an unknown probability P.

Let's do the calculations, keeping in mind that all ξi are independent random variables identically distributed as ξ and η is their arithmetic average.
E(ζ) = E{[(ξ1−η)²+(ξ2−η)²+...+(ξN−η)²]/N}=E{[Nξ1−(ξ1+...+ξN)]²}/N²

Let's separately evaluate the expectation E{...} in the above expression without a factor N² in the denominator:
E{[Nξ1−(ξ1+...+ξN)]²} =
N²·E{ξ1²}−2N·E{ξ1·(ξ1+...+ξN)}+E{(ξ1+...+ξN)²}

Consider each expectation in the above expression separately:
E{ξ1²} = 1²·P+0²·(1−P) = P

Taking into consideration that all ξi are independent and identically distributed, we can use the property of mathematical expectation of a product of two independent variables to be equal to a product of their expectation:
E{ξ1·(ξ1+...+ξN)} = P + (N−1)P²

In the last component, if we square the sum of N variables, we will have N² components added up, N of them being squares of each ξi and the rest N²−N components being products of mixed indexed independent variables.
Therefore,
E{(ξ1+...+ξN)²}=N·P·[(1+(N−1)P]

Combining all the results together to calculate the numerator of E(ζ) (recall that denominator is N²), we get:
N²·E(ζ) = N²·P − 2N[P+(N−1)P²] + N·P·[(1+(N−1)P] =
= N·P·(N−1−N·P+P) = N·(N−1)·P·(1−P)
Hence,
E(ζ) = P·(1−P)·(N−1)/N

As we see, the mathematical expectation of
ζ = [(ξ1−η)²+(ξ2−η)²+...+(ξN−η)²] / N
is not exactly the same as Var(ξ)=P·(1−P).
Granted, for large N the difference is small and tends to zero as N→∞, but still the evaluation is biased, that is not centered on a value we want to evaluate.

An easy solution to this problem is, instead of ζ, use
θ = ζ·N/(N−1)
That is,
θ = [(ξ1−η)²+(ξ2−η)²+...+(ξN−η)²] / (N−1)
In this case E(θ) will be exactly as Var(ξ):
E(θ) = E(ζ)·N/(N−1) = P(1−P)

This implies that an unbiased evaluation of an unknown variance of a Bernoulli random variable ξ based on a sample of its N values
x1, x2...xN should be
s² = [(x1−m)²+(x2−m)²+......+(xN−m)²] / (N−1)

In the next lecture we will use this evaluation of the Var(ξ) in the problems presented in the previous lecture to obtain more realistic numbers for margin of error and certainty level.

The so far open is a question of how good our evaluation of Var(ξ) really is. After all, if it's not a good evaluation, that is, if deviation from its mean, that is, Var(θ), is too large, our calculations of margin of error and certainty level are not as precise as we'd like them to be.
The exact calculations of Var(θ), which is a measure of precision of our evaluation of Var(ξ) with sN−1², were done by mathematicians and are known quite well, but they lie outside of the scope of this course. However, it's important to know that Var(θ) tends to zero as N→∞ on the order of 1/N, which seems intuitively correct.