Monday, August 1, 2016

Unizor - Statistics - Histogram





Notes to a video lecture on http://www.unizor.com


Histogram

Histogram is a graphical representation of statistical data that allow to form an opinion about distribution of probabilities of an observed random variable.

Consider a case when we do not have any information about distribution of probabilities of our random variable and only observe the values it takes, as we conduct one random experiment after the other with it. The results of our experiments are some real numbers X1, X2...XN, where Nis the number of experiments.

We can rather primitively assign the probability of 1/N to each observed value and say that this approximates the distribution of probabilities of our random variable. If there are repeated values among our data, their probabilities would be added together.

It might work in simple cases like rolling the dice. We will have only six possible values - numbers from 1 to 6 - and, as our experiments continue, accumulated frequencies for each number will approach the probability of this number to occur. For an ideal dice these frequencies will be around 1/6 each.

In more complicated practical cases we might not have predefined values our random variable can take and, in case of random variables with continuous distribution, we theoretically cannot have them all.

Histogram presents a practical solution to this problem.
First of all, knowing the results X1, X2...XN of N experiments with our random variable, we have to determine intervals of values that we would like to group our data into. For example, a reasonable approach is to take a range from minimum to maximum value and divide it into certain number of equal intervals called bins. In some cases the width of intervals can be different, but in most cases they are of equal width.

It is quite desirable to have sufficient number of bins to differentiate results into different groups and to have sufficiently large number of experiments to fill each bin with substantial number of results.

Having done this grouping, we present it graphically as a set of adjacent bars, each representing a group with the width proportional to the interval that defines the corresponding bin (usually, they have equal width) and height proportional to the number of values in this bin.

Here is an example. For instance, we measure the body temperature of each person who comes to a doctor. Say, we have accumulated 100 different values of temperature ranging from 35oC to 40oC. We can divide this range into bins of half degree intervals and for each bin register how many patients have temperature to be in that interval. The results are in a table below:

oC-rangeQuantity
35.0 ≤ toC < 35.51
35.5 ≤ toC < 36.03
36.0 ≤ toC < 36.516
36.5 ≤ toC < 37.038
37.0 ≤ toC < 37.518
37.5 ≤ toC < 38.015
38.0 ≤ toC < 38.56
38.5 ≤ toC < 39.01
39.0 ≤ toC < 39.51
39.5 ≤ toC ≤ 40.01

The data in this table can be presented graphically in a form of a histogram as follows.
On the X-axis we mark all points of division between bins:35, 35.5,....39.5, 40.
Then we construct a rectangle above each interval with a height corresponding to a number of people with temperature falling into a corresponding range.
Thus, above segment [35;35.5] the rectangle will have a height 1, above [35.5;36] - height 3, above [37;37.5] - 18 etc.
The resulting bar chart is a histogram of distribution of temperatures based on statistical data we have.

Is this histogram an exact representation of real distribution of probabilities of temperature? Absolutely not. But it's a good approximation, and the approximation will be more precise if we have more data distributed into more of smaller bins.

The important question arises about a choice of intervals to break into an entire range of obtained statistical data.
The recommended "rule of thumb" for N experimental results is to use N. So, if you have 100 experimental results, use 100 = 10 same size intervals in the range from minimum to maximum observed value.
Other recommended formula for the number of intervals is log2(N)+1, which should work well for larger number of experiments with presumably Normally distributed random variables.
There are other more complex formulas, but this is outside the scope of this course.

No comments: