Sunday, 27 April 2014

Not-so-random post about randomness

One of the most common misconception about randomness is that it is often confused with uniformity.
Could you tell which is the most random distribution of dots between the two below?

which is random
Which is random? (click to enlarge)

The most common answer would be the one on the right, which is wrong. In fact, the pattern on the right is generated by applying a small wobble (or uncertainty) on an uniform distribution. The pattern on the left, instead, has been generated using a pseudorandom [1] number generator in BASH (details in this blog post). Randomness allows for "clumps" to form and it's because of those clumps that the unpredictable behaviour of randomness comes from (unless, of course, you know the "shape" of the random distribution a priori). This appears more clearly whenever we look at the distributions shown above in 1D rather than 2D:


random and uniform distribution
A random and an uniform distribution (click to enlarge)


It is clear that in the case of a uniform distribution, we have a greater level of predictability. Imagine you have a series of events which follows the uniform distribution. You are filling the histograms with events one by one by reaching 5000 events which recreate the distribution above. Whenever you approach higher numbers of events filled if there is a lack of events in one region of the histogram, there will be a higher chance for the next even to fall in that region.

You can picture in practice a 2D uniform distribution by imagining pouring a layer of marbles in a box, one by one. At the beginning their position will look random, because they can move around and take any possible position, but the fuller it gets the less available spaces there are and soon it will be easy to predict which are the only available spots for the next marble to go in.

The key thing is that whenever a position is occupied, it cannot be occupied again. This allows for a level of predictability, and this is not true of random distributions. Randomness is not predictable, its profile is unknown by definition, and it allows for the same position to be occupied again. This caused by the assumption that each random event is independent from the previous (Poisson process). Maybe this is also the reason why we believe uniform distributions to be "random". We are more familiar with those in practice and random events are either mostly abstract or harder to visualize.

This explains tendency of people to estimate randomness wrongly. This happens every time during lotteries. If a number has not come up for a long time, we feel that it is due soon. This is because we imagine random events as uniform. In fact, the probability of picking any number in a lottery is the same for every extraction, making all number equally probable to be picked, and not anyone more probable, because every extraction is independent from the previous.

An equivalent example is found in coin flipping. We feel that repeated head or tail events in a sequence of coin flips is a rare event. In fact, the chance of obtaining heads or tails is still 50% even after any number of heads have come up already. For example, events of the kind: THHTHTTHTTH and TTTTTHHHHH are equally as likely.

I created a computer-generated set of tosses to work out the frequencies of occurrences of repeated heads or tails (details on how this is done are in here). These are the results for 100000 occurrences of 15 tosses each:

 3 repetitions:  93894/100000
 4 repetitions:  64634/100000
 5 repetitions:  34667/100000
 6 repetitions:  16723/100000
 7 repetitions:    7789/100000
 8 repetitions:    3487/100000
 9 repetitions:    1584/100000
10 repetitions:     719/100000
11 repetitions:     332/100000
12 repetitions:     125/100000
13 repetitions:       53/100000
14 repetitions:       18/100000
15 repetitions:         7/100000

15 repetitions would mean having a full set of heads or tails, which seems impossible, but it happens 0.006% of the times. In fact, over 100000 occurrences, it happened 7 times. Close enough.

The data shows that three repetitions is an event which happens almost all the times, with four repetitions more than half of the times. Up to 6 repetitions, in fact, is not that much of an uncommon event. Almost half of the sequence! I am sure it would be hard to define "random" - in the common sense - anything which has more than 4 repetitions, although this simple test shows that it's almost the norm.

Another great example of how bad we are at understanding randomness is given by this small application at Nick Berry's DataGenetics. It will distinguish a randomly generated sequences of tosses (from a real coin) from the ones you made up yourself. It is not infallible, as it is based on the Pearson Chi-squared test (so it cannot always predict which is which) and after some trials it is easy to trick it to make it believe your sequence is random. Yet, I am sure it will give you a better understanding of randomness!



[1] I have used the word pseudorandom because every computer-generated programs simulates randomness. True randomness can only be found in some not completely understood natural phenomena. Being a simulation, there are different "qualities" of generated random numbers. I am not sure about the one used in BASH, but it is usually good to stick with a higher quality random number generator for more serious business (e.g. GSL)