## Monday, March 28, 2016

### Unizor

Unizor - Creative Minds through Art of Mathematics - Math4Teens

Notes to a video lecture on http://www.unizor.com

Statistical Distribution Task A - Problem

The values our random variable ξ takes are discrete and theoretically known, they are X1, X2... XK, but probabilities to take these values - p1, p2... pK - are unknown and we need to evaluate them.

Problem
There are three parties that nominated their candidates for the positions of the President, the Vice President and the Defense Minister - White Party, Blue Party and Red Party.
The winner of the elections will become the President, the one who takes the second place will be the Vice President and the third place candidate will be the Defense Minister.
In order to predict the results of the presidential elections, a survey was organized among 4000 randomly chosen voters. The results of the survey are: 1480 people prefer a candidate from the White Party, 1320 people prefer a candidate from the Blue Party and 1200 people prefer a candidate from the Red Party.
What are the probabilities of winning for all candidates (PW, PB, PR) and what is the 95% certainty margin of error in each case (mW, mB, mR)?

Solution
The probabilities of winning are, obviously, approximated as empirical frequencies - random variables P'W, P'B and P'R with values obtained in a survey:
P'W = 1480/4000 = 0.37
P'B = 1320/1000 = 0.33
P'R = 1200/1000 = 0.30

As for a margin of error, let's calculate it in two ways:
(a) simple and crude, using the rule 2σ ≤ 1/√N;
(b) using sample variance.

Method (a)
Var = PW(1−PW)/4000
(where PW is unknown).
Crude evaluation of this is based on inequality
p(1−p) ≤ 1/4 = 0.25
for all p from 0 to 1 (the range of all probabilities).
Therefore, for method (a) of evaluation of margin of error we can use:
σ² ≤ 1/(4·4000);
σ ≤ 1/(2·√4000)
For 95% certainty we need 2σ interval:
2σ ≤ 1/√4000 ≅ 0.0158

As we see, the crude and simple method (a) of evaluating the margin of error gives
2σ ≤ 0.0158.
Therefore,
mW ≤ 0.0158
mB ≤ 0.0158
mR ≤ 0.0158

Here is how the probabilities fall into intervals with this margin of error:
PW ∈ [0.3542, 0.3858]
PB ∈ [0.3142, 0.3458]
PR ∈ [0.2842, 0.3158]

The White Party candidate has, with 95% certainty, more chances to become President. The differentiation between Blue and Red is not sufficient to state that they are different with 95% certainty, since intervals intersect, as we see.

Provided the same proportion of opinions (0.37, 0.33 and 0.30), we need (in a crude evaluation case) to make the number of experiments N large enough to have 1/√N smaller than half a distance between values. The smallest distance is 0.33−0.30=0.03, so we have to satisfy the inequality:
1/√N ≤ 0.03/2
N ≥ 4444
It means that, for being 95% certain in our prediction for all three participants, we need at least 4444 participants if the smallest difference between empirical frequencies is 0.03.
Generally, for 95% certainty, if the smallest difference between empirical frequencies is d, we need
1/√N ≤ d/2
N ≥ 4/d²

Method (b)
This method is based on evaluating the variance using the sample data.

Sample variance of random variable β introduced above is
VarW = [1480(1−0.37)² +
+ 2520(0−0.37)²] /3999 ≅
≅ 0.2331
This is only a little better then crude evaluation based on the maximum of the variance 0.25.

Sample variance of average of 4000 independent random variables identically distributed as β will be
Var(P'W) ≅ 0.000058275
The 2σ in this case is
2σ(P'W) ≅ 0.0153
The interval the corresponding probability falls into with 95% certainty is:
PW ∈ [0.3547,0.3853]

Analogous calculations for PB lead to the following results:
VarB = [1320(1−0.33)² +
+ 2680(0−0.33)²]/3999 ≅
≅ 0.2212
Therefore, sample variance of average of 4000 of independent random variables identically distributed as β will be
Var(P'B) ≅ 0.000052888
The 2σ in this case is
2σ(P'B) ≅ 0.0149
The interval the corresponding probability falls into with 95% certainty is:
PB ∈ [0.3151,0.3449]
We see clearly that Blue Party candidate with 95% certainty has less chances to become President.

Similarly, calculations for PR lead to the following results:
VarR = [1200(1−0.30)² +
+ 2800(0−0.30)²]/3999 ≅
≅ 0.2100
Therefore, sample variance of average of 4000 of independent random variables identically distributed as β will be
Var(P'R) ≅ 0.0000525
The 2σ in this case is
2σ(P'R) ≅ 0.0145
The interval the corresponding probability falls into with 95% certainty is:
PR ∈ [0.2855,0.3145]

As you see, with more precise (albeit, with a small degree of uncertainty) estimation of variance, there is a clear distinction between Blue and Red parties.
With 95% certainty Red Party candidate has less chances to become a Vice President.

## Wednesday, March 23, 2016

### Unizor - Statistical Distribution - Task A - Quality

Unizor - Creative Minds through Art of Mathematics - Math4Teens

Notes to a video lecture on http://www.unizor.com

Statistical Distribution

Task A. The values our random variable ξ takes are discrete and theoretically known, they are X1, X2... XK, but probabilities to take these values - p1, p2... pK - are unknown and we need to evaluate them.
Examples: rolling a standards cubical dice (6 known outcomes), Bernoulli random variable (only two values - 1 and 0).

Our task is to evaluate the unknown probabilities of ξ to take different values based on experimental results with this random variable.

Let's recall the approach suggested in a general lecture on statistical distribution.
Assume that, as the result of N experiments, random variable ξ (that could theoretically take values X1, X2... XK) took value Xj in νj experiments, where j∈[1,K].
Obviously, ν1+ν2+...+νK=N.
Also notice that all νj are random variables, sum of which is constant (non-random) N.

Our best approximation for the probability pj of ξ to take value Xj is random variable νj/N - the empirical frequency of occurrence of this event.
So, our statistical distribution of random variable ξ looks like
pj = Prob{ξ=Xj} ≅ νj/N
where j∈[1,K].

The quality of this approximation should be better with larger number of experiments and we will attempt to evaluate this quality.

Let's concentrate on one particular probability p1=Prob{ξ=X1} and its empirical approximation with a random variable ν1/N. The other probabilities will be similar.

Consider a new Bernoulli random variable β that takes the value 1 if variable ξ takes value X1 and 0 otherwise.
Prob{β=1}=Prob{ξ=X1}=p1
Prob{β=0}=1−p1
Obviously, as we know from the properties of Bernoulli random variables,
E(β)=p1
Var(β)=p1(1−p1)

Out of N experiments value β=1 occurred ν1 times and value β=0 occurred N−ν1 times.
Therefore, we can say that random variable ν1 (that we want to use to evaluate probability p1 of ξ to take value X1) equals to β1+β2+...+βN, where all βj are independent identically distributed random variables, distributed exactly as β (that is, they are equal to 1 if ξ=X1 and 0 otherwise).
Thus, we have reduced our problem to evaluate the quality of approximation of unknown probability p1 with empirical frequency ν1/N to an already researched problem of evaluating probability of a Bernoulli random variable β to take value 1 with multiple experiments with this random variable producing results
β1, β2,...βN.

As we know, the above mentioned sum
ν1=β1+β2+...+βN
has expectation N·p1 and variance N·p1(1−p1).
The empirical frequency
ν1/N = (β1+β2+...+βN)/N
(a sample average of the results of experiments) has a distribution close to Normal (according to Central Limit Theorem) with expectation p1 (and, therefore, is used as an unbiased approximation of p1) and variance p1(1−p1)/N, which we want to evaluate to determine the quality of approximation.

The quality of approximation of p1 with ν1/N is measured by its standard deviation from expected value.
Crude evaluation of this standard deviation, that depends only on the number of experiments N, equals, as we know from the properties of Bernoulli random variables, to σmax=1/(2√N ). So, with 95% certainty, we can say that the approximation of probability p1 with empirical frequency ν1/N
is an unbiased evaluation with margin of error not exceeding 2σmax=1/√N .

More precise (albeit, slightly less certain) evaluation of the quality of our approximation of p1 with ν1/N can be obtained using the sample variance of random variable ν1/N.
First, let's calculate the sample variance of β - an average of square deviations of its values βj from its sample average
Σ(βj)/N = ν1/N.
This sample variance of β equals to:
s²N−1 = [(β1−ν1/N)²+...+
+(βN−ν1/N)²]/(N−1)
Out of N experiments β1, β2,...βN the value β=1 occurred ν1 times and the value β=0 occurred N−ν1 times. Therefore, our sample variance of β equals to:
s²N−1=ν1·(1−ν1/N)²+
+(N−ν1)·(0−ν1/N)²]/(N−1) =
= [ν1(N−ν1)]/[N(N−1)]

From this we can calculate a sample variance of our estimate of probability p1 ≅ ν1/N:
σ² = [ν1(N−ν1)]/[N²(N−1)]

Expanding this from evaluation of probability p1 to any pj, with 95% certainty we can say that νj/N can be used as an approximation of probability pj with margin of error
2σ = 2√[ν(N−ν)]/[N²(N−1)]
In the above formula νj is substituted with ν for brevity and because it's applicable to any νj.

## Tuesday, March 15, 2016

### Unizor - Probability - Density Distribution

Unizor - Creative Minds through Art of Mathematics - Math4Teens

As we mentioned, cumulative probability distribution function F(x)=Prob{ξ less than x} contains all the information about our random variable ξ, sufficient to determine the probability of any event associated with this random variable.
This cumulative function is applicable to both discrete and continuously distributed random variables. However, for discrete variables it makes more sense to deal with probability mass distribution function because it makes more visible which values are more probable than others without any calculations. There is no such probability mass distribution function for continuously distributed random variables, but some reasonable equivalent is very desirable to define.

Here comes probability density function, that is to cumulative distribution function the same as speed is to distance.
We will define this function completely analogously to how we defined speed above.

Assume, our random variable ξ is defined by its cumulative probability distribution function F(x). Then we can define the probability of any event related to our random variable, for instance
Prob{ξ is between a and b} = F(b)−F(a)
Let's take two equal but small non-intersecting intervals of values ξ might take, for instance [a1;b1] and [a2;b2]. Comparing F(b1)−F(a1) with F(b2)−F(a2), we can make a judgement about which values, within the first or the second interval, are more probable.
Analogously, value
[F(b)−F(a)]/(b−a)
is a good measure of an average increment of probability on interval [a;b] and can be compared with similar average increment of probability on different intervals.

More than that, if we divide the whole set of real values our random variable might take into many small equal intervals and calculate the average increment of probability on each interval, we will have a picture similar to probability mass distribution function for discreet random variables, only in this case it will be distribution of probabilities not among discrete values, but among different small intervals. And, the smaller the intervals - the more precisely the average probability increments on each interval obtained by this procedure would inform us about comparative distribution of probabilities among different intervals.

Ultimately, for each value x our random variable might take, we can consider a function called probability density derived from cumulative probability distribution function F(x) as
f(x) = lim{d↓0} [F(x+d)−F(x)]/d
which plays the role of a "speed" of changing the probabilities around any value x our random variable can take.

## Wednesday, March 9, 2016

### Unizor - Probability - Cumulative Distribution Function

Unizor - Creative Minds through Art of Mathematics - Math4Teens

Notes to a video lecture on http://www.unizor.com

Cumulative Distribution Function

Let's consider a case of continuously distributed random variable ξ.
For instance, ξ is the results of measuring a weight of a tennis ball (assuming our ability to measure it absolutely precisely).
As we noted in the previous lecture, where we have introduced a concept of a continuous distribution, the probability of this weight ξ to be any exact real number is zero. However, the probability of taking a value in some range, say from 55g to 58g, is not zero.

This property of continuous distributions is fundamentally different from a corresponding property of discrete distributions, where the probability of a random variable to take any specific value is non-zero.

Our task is to describe the continuous distribution of a random variable, but similar approach - to specify all values with corresponding probabilities - would not work because there is infinite number of values and each specific value occurs with a probability zero.

Recall that the probability is in many ways similar to a measure. For instance, the length of any individual point is zero, but the length of any segment is some non-zero positive number.

Returning to random variables with continuous distribution of probabilities and using the above analogy, assume for definiteness that our random variable ξ (the weight of a tennis ball) can take values in the range from 50 to 60 and we would like to be able to describe all the probabilities for all the intervals of the weight. The function of numerical argument x (the weight) that might be very helpful as a representation of all this knowledge is the probability that our random variable ξ takes the value, which is less than x:
Fξ(x) = Prob{ξ less than x}

Now, if we want to know the probability that the weight is between w1 and w2, we can calculate it as
P[w1,w2] = Fξ(w2)−Fξ(w1).

Since all values of ξ are concentrated between 50 and 60, this function equals to 0 for all x smaller or equal to 50 and is equal to 1 for all values greater or equal to 60.
Function Fξ(x) is obviously monotonically non-decreasing. As x increases from 50 to 60, Fξ(x) increases from 0 to 1.
This function is called a distribution function of our random variable ξ.

This distribution function fully defines all the probabilities on all the intervals of values of our random variable ξ. If the probability of the weight of a tennis ball to be in central interval is proportional to the width of this interval (a so-called uniform distribution of probabilities), the graph of the corresponding distribution function would look like this:

(where a is the minimum weight of a tennis ball, which we assume is 50 gram and b is its maximum weight - 60 gram.)

In case the weight is distributed non-uniformly from a to b, the distribution function would grow faster within more probable intervals to accommodate greater probability concentrated in these areas.
Here is an example:

We can construct a distribution function for discrete random variable as well. Consider the mass distribution function for this variable. As argument x moves from its minimum value to a maximum, the distribution function remains constant in-between the points of concentration of mass and jumps up every time it goes over the next mass, adding this mass to a cumulative probability.
Graphically, it looks like this:

Here discrete random variable X takes values x1, x2 etc. with probabilities P(x1), P(x2) etc.
Every time an argument x goes over the next value xi that random variable X can take with the probability PX(xi), the distribution function grows by that probability value up to 1 at the point of maximum and then stays equal to 1.

As we see, the probability distribution function is, in a way, a universal function applied to both discrete and continuous distributions and sufficient to recreate the full picture of probability distribution, that is to determine a probability of any event associated with the random variable this function belongs to.

## Monday, March 7, 2016

### Unizor - Probability - Mass Distribution Function

Unizor - Creative Minds through Art of Mathematics - Math4Teens
Notes to a video lecture on http://www.unizor.com

Probability Mass Function

Here we will consider random variables with discrete distributions of probabilities.

Let's assume that random variable ξ represents the results of the following experiment.
We shoot a target that consists of three concentric circles marked with numbers of points assigned to each circle: 10 for the smallest circle in the center, 5 for the middle circle, 2 for the largest outer circle and 0 for outside the largest circle.
So, our random variable ξ in this experiment has values {0,2,5,10}.

Each participant in the sharpshooting competition has different skills and, therefore, each of them can be represented by a random variable that takes the above values but with different distribution of probabilities for different people.

We can describe the sharpshooting skills of any particular participant by the probability of this particular participant to hit each circle.
For instance, for participant A these probabilities might be:
Prob{ξ=0}=0.32
Prob{ξ=2}=0.28
Prob{ξ=5}=0.23
Prob{ξ=10}=0.17
(the sum of all probabilities must be equal to 1, of course)

The above is a kind of verbal description of the values and probabilities for our random variable.
Mathematicians, however, always prefer to deal with something more resembling a function. So, they have come up with a concept of mass distribution function that serves exactly this purpose.
Our task is to represent the distribution of probabilities of this random variable as an algebraic function and graphically.

Consider a function fξ(x) defined for a discrete random variable ξ for all real arguments x as follows:
f(x) = Prob{ξ=x}

It is important to note that this is a function defined for all real arguments x, regardless of whether ξ can or cannot take this particular value of an argument.
For those values of an argument that random variable ξ can take the function's value is a corresponding probability.
For all other values of an argument the function's value is zero since the probability of ξ to take this value is zero.
This function is called mass distribution function (MDF). It is specific for each discrete random variable and completely describes in an algebraic form the distribution of probabilities of this random variable.

In the example above with target shooting the mass distribution function for participant A looks like this:
f(0)=0.32
f(2)=0.28
f(5)=0.23
f(10)=0.17
and for all other arguments x the function value is zero.

Since mass distribution function is a function in algebraic sense, we can analyze it as any other function. The most important way of dealing with this function is to represent it graphically.

This function is equal to 0 for all negative x,
jumps up to 0.32 at x=0,
then it takes the value of 0 for all x between 0 and 2,
jumps up to 0.28 at x=2,
then it takes the value of 0 for all x between 2 and 5,
jumps up to 0.23 at x=5,
then it takes the value of 0 for all x between 5 and 10,
jumps up to 0.17 at x=10,
then it takes the value of 0 for all x greater than 10.

As a conclusion,
mass distribution function (MDF) is a convenient way to represent algebraically and, especially, graphically the distribution of probabilities of discrete random variables.
Always make sure that your mass distribution function takes values that sum up to 1 (or 100% if probability is expressed as a percent).

WARNING
Sum of two mass distribution functions of two discrete random variables IS NOT a mass distribution function of a random variable that is the sum of two original variables. A very superficial reason for this is that a sum of two mass distribution functions is a not a mass distribution function at all since the result of summation of its values will be 2, not 1. Obviously, there are some deeper probabilistic reasons for this.
Similarly, a product of a constant (not equal to 1) and a mass distribution function IS NOT a mass distribution function for a new random variable that is a product of the original variable and that constant.