Friday, February 19, 2016

Unizor - Statistical Distribution





Unizor - Creative Minds through Art of Mathematics - Math4Teens

Notes to a video lecture on http://www.unizor.com

Statistical Distribution

In this lecture we will attempt to analyze the real probabilistic laws governing the behavior of some random variable by observing the results of random experiments with it.
The task at hand is to find out for some random variable the probabilities it takes certain values, if we know what values it already took in the past.

Let's separate our task into the following subtasks:

Task A. The values our random variable ξ takes are discrete and theoretically known, they are X1, X2... XK, but probabilities to take these values are unknown and we need to evaluate them.
Examples: rolling a standards cubical dice (6 known outcomes), Bernoulli random variable (only two values - 1 and 0).

Task B. The values our random variable ξ takes are discrete and unknown, so we have to determine with some certainty both the values it might take and its probabilities to take these values.
Examples: number of people injured in auto accidents during some randomly chosen month, amount of money some randomly chosen family in the United States spends on entertainment during a year.

Task C. Our random variable ξ takes continuous range of values with theoretically known fixed boundaries - from a to b.
Examples: temperature of the water in a pot (from freezing to boiling), direction of the wind at a specific location (from 0o to 360o relatively to the North).

Task D. Our random variable ξ takes continuous range of values with unknown one or both boundaries.
Examples: weight of a Buddha statue randomly chosen among all statues in Bangkok (no reasonable theoretical upper boundary), distance between two randomly chosen European cities (no reasonable theoretical upper boundary, except the size of Europe).

All tasks are approached similarly. We always start from some N random experiments with our variable ξ and register the results of these experiments - the values it took.

Now we have to choose different paths to solve the four tasks mentioned above.

Task A
Assume that, as the result of N experiments, random variable ξ that could theoretically take values X1, X2... XK, took value Xi in ni experiments, where i∈[1,K].
Obviously, n1+n2+...+nK=N.
Our best approximation for the probability of ξ to take value Xi is ni / N - the empirical frequency of occurrence of this event.
So, our statistical distribution of random variable ξ looks like
Prob{ξ=Xi} ≅ ni / N
The quality of this approximation is better with larger number of experiments and will be discussed separately.

Task B
Assume that, as the result of N experiments, random variable ξ, which theoretically takes some unknown values with unknown probabilities, took values x1, x2... xM, correspondingly, m1, m2... mM times (sum of all mi equals to N).
We can reduce this task to the one similar to a previous case by following these steps:
1. Choose the minimum xmin and maximum xmax values among empirical results xi.
2. Divide the range from xmin to xmax into K equal intervals:
Δ1: [xmin=y0,y1],
Δ2: [y1,y2],
...
ΔK: [yK-1,yK=xmax]
3. For each interval Δi between xmin and xmax calculate the number of times our random variable ξ took a value within this interval. Assume, it's ni.
4. Use an empirical frequency of the value of random variable ξ to fall within each interval as a statistical distribution:
Prob{ξ∈Δi} ≅ ni / N
Try to avoid cases with small number of experiments (say, less than 100) since the evaluation of probabilities in these cases will be far from precise.

Task C
Here we are dealing with a continuously distributed random variable with values theoretically in the range from a to b.
We do have N empirical values this random variable took in a series of experiments.
The approach we can suggest is similar to the previous on.
We divide the range from a to b into K equal parts and count how many times the values fell into each part, thus getting the discrete distribution that approximates the real continuous distribution of our random variable.

Task D
If we have no theoretical knowledge about the range of values our random variable might take, we have no other choice but artificially assign its minimum and maximum empirical values to lower and upper boundaries (maybe, with some rounding).
When these boundaries are established, we continue the same way as in a previous task - divide the range between the boundaries into intervals and count the number of times the values fell into each interval to get empirical frequencies.

No comments: