Tuesday, August 17, 2010

How to distinguish fake coin tosses...

Dilip posted an interesting problem the other day: if you were a professor teaching probability theory, and asked your students to toss a coin 100 times and write down the sequence of heads and tails that they obtained, and some of them cheated and simply made up a sequence of heads and tails, how could you tell?

It is interesting because, generically, very few sequences are truly random; and the human mind is certainly incapable of randomness. But what signs of non-randomness could you look for? The answer that Dilip, apparently, had in mind is that true random sequences will usually contain long "runs" of heads or tails (say, 6 or more heads, or 6 or more tails, in succession in 100 coin tosses). However, individuals generating random sequences will perceive such short "runs" as non-random and correct for them. But this is not a very reliable answer by itself: the probability of a run of 6 or more (heads or tails) in 100 tosses is about 80%, so a truly random run will fail this test about a fifth of a time, while a smart student will probably throw in such runs. I argued that if one combined this with various other tests, one should be able to tell quite reliably. But at the moment I am unsure whether 100 tosses are enough for this. Certainly nobody could say for sure whether a sequence of 5 tosses was generated by a coin or a human. Meanwhile, I strongly doubt that any human could generate a sequence of 1000 random symbols (coin-tosses, numbers, whatever) that would fool the statistical tests. But can one reliably tell which of the following two is "random" and which isn't?

Sequence 1:


Sequence 2:

Neither of these was generated by tossing a coin. One was made by me, by pressing "h" or "t" "randomly" on a keyboard (ie, I, a human being aware of the usual "pitfalls", was trying to generate a random sequence, fairly rapidly without thinking much about it). The other was made using the pseudorandom number generator in Python, which is based on the Mersenne twister. I would guess that the Mersenne twister is more random than I am: what I would be interested in knowing, from any experts reading this, is whether one of the above sequences can be demonstrated, statistically, to be so non-random that the chances are very high it was generated by me and not by the program. I am, moreover, interested in the method and not the answer (which you have a 50% chance of getting right randomly). If you confidently identify the Mersenne twister-generated sequence, it is safe to say that the problem is with your test and not with the Mersenne twister.

The "bonus question" that came up in Dilip's blog is, what is the probability of observing a run of 6 or more heads or tails (let's call them 6-runs) in 100 coin tosses?

Kovendhan gave an approximate answer which seemed to work well in practice, but it turns out that he made two errors and a corrected calculation does poorly. The probability of a particular choice of 6 tosses (let's call it a 6-mer) being all heads or all tails is (1/2)5 (it is (1/2)6 for all heads, and the same for all tails). The probability then that it is not all-heads or all-tails is 1-(1/2)5. There are 95 ways to choose 6 successive tosses in 100. The probability of none of these 95 is all-heads or all-tails is ( 1-(1/2)5)95 = about 0.049 approximately. The probability of at least one stretch of 6 identical tosses (all heads or all tails) existing would then seem to be 0.951 -- pretty near certain. The approximation consists of neglecting the fact that adjacent 6-mers here are not independent: eg, if your chain of tosses is HHTHTT, not only does this fail to give a 6-run, but it will also fail to do so on any of the next 3 tosses at least.

Meanwhile, Kovendhan used (1/2)6 instead of (1/2)5 for the individual 6-run, and 94 for the number of 6-mers, which yields 0.772, but he reported 0.781 -- I'm not sure how he got that. My numerical experiments suggested the true number is a little above 0.8, which is close to Kovendhan's fortuitously incorrect calculation of his approximate method, but quite far from what his approximation should really give.

The moral is, be aware of approximations and, if possible, have an estimate of their effect. Kovendhan's approximation is in fact very similar to Linus Pauling's when he calculated the zero-temperature entropy of ice. In the crystal structure of ice, each oxygen atom has four neighbouring oxygen atom, in a locally tetrahedral arrangement. Along each of these O-O bonds is a hydrogen atom, but not centrally located: two H atoms must be closer to the central O atom, and two must be closer to the neighbouring O atoms. Globally, there are many ways to satisfy this; to count the ways, Pauling essentially assumed that the configurations of the local "tetrahedra" could be counted independently. This is like Kovendhan's assumption about 6-mers; unfortunately, while two tetrahedra in ice share at most one corner, two 6-mers in the toss sequence can share up to 5 tosses, which makes the 6-mers much less independent than the tetrahedra in ice.

I attempted an answer which I give below. A commenter observed that a formula exists for the probability of a run of 6 heads, and the same formula gives the probability of a run of 6 tails. However, the probability of six heads or tails is trickier.

My approach (which will be recognisable by physicists, computer scientists and others) was this: supposing we can calculate the probability P(N) that there are no 6-runs in N tosses, how do we calculate P(N+5), the probability that there are no 6-runs in N+5 tosses? If we can do this, we can start from P(5) = 1 (there are no 6-runs in 5 tosses, obviously) and build it up from there.

Naively, a 6-run can be built up at any of the five tosses between N and N+5: for example, if the previous five tosses up until N were all heads, then tossing heads again will give a 6-run. So we must consider all possibilities for both the five tosses from N-4 to N, and the five tosses for N+1 to N+5: ten tosses in total. There are 210 = 1024 possibilities for these 10 tosses, so it looks like a counting problem. The number of possibilities of "successful" runs in these 10 tosses can be enumerated as follows (where "N" stands for "any", and one can replace H with T in all these examples):
HHHHH HNNNN (32 possibilities: 24 for the last 4 tosses, times 2 for replacing H with T)
THHHH HHNNN (16 possibilities)
NTHHH HHHNN (16 possibilities)
NNTHH HHHHN (16 possibilities)
NNNTH HHHHH (16 possibilities)
for a total of 96 possibilities. There are then 1024-96 = 928 cases where there is no run of 6 heads or tails. So the "naive" answer is P(N+5) = P(N)*928/1024. If we want P(100), we can use this to go all the way back to P(5) = 1:
P(100) = P(95)(928/1024) = P(90) (98/1024)2 ...
to get
P(100) = P(5) (928/1024)19 = 0.154 roughly
so the probability of at least one 6-run in 100 tosses is about 0.846.

Unfortunately, this is still not correct, because the 1024 possibilities -- and the 96 possibilities with runs -- are not all equally probable: they are conditional on the premise that there is no 6-run up until the N-5th toss. So, for example, the case "HHHHH TNNNN" should be weighted by the fact that it would be disallowed for half the possible sequences prior to this (the ones that ended with H); the case THHHH HNNNN would disallow far fewer sequences (the ones that end in TTTTT, which is only 1 in 32 sequences).

Therefore, I considered as prior possibilities, the five tosses numbered N-14 to N-10 (that is, the five tosses preceding the current ten-mer). There are the following ten possibilities that should be distinguished:
and each of these has a "prior probability" (1/4 for NNNHT, 1/32 for HTTTT, etc), and each disallows a fraction of the 1024 10-mers we are considering, as well as a fraction of the 96 10-mers that contain 6-runs. If you do the calculation separately for each of these 10 prior possibilities, and then weight the average by their prior probabilities, you end up with
P(no 6-run in 100 tosses) = (1481219/1614480)19 = 0.195 roughly,
and P(at least 1 6-run) = 0.805 roughly.

This, as it turns out, is in excellent agreement with what I had previously obtained numerically.

Final question for anyone who has read this far: is this the exact answer? (I posted on Dilip's blog that I think it is, but don't go by that.) It is, however, "good enough" in my opinion, by two measures: the remaining error is small; and the issues have (I think) been adequately illustrated that the answer can be (laboriously) improved, if need be.

I'd be fascinated, however, if there is an exact answer on the lines of the "run of heads" answer linked above.


km said...

I didn't see you blog about Deolalikar's N = NP? Any thoughts?

Anonymous said...

I begin to lose you in the third section, but how about this for a precise answer?


Kapil Paranjape said...

A fairly standard test for randomness is Maurer's test which checks
how compressible the sequence is. The problem with that test is that
it takes too long on data-sets that are large enough and it is not
significant on data-sets that are small! Still, why not use it?

Anyway, I just took your seq1 and seq2 and applied 'gzip -9' on them.
Here is what I got:

58 /tmp/seq1.gz
62 /tmp/seq2.gz

I then applied 'lzma --best' on both.

47 seq1.lzma
48 seq2.lzma

This seems to indicate that the first sequence is less random than
the second.

as said...

I just looked up Feller(Vol 1. Sec XIII.8) and he discusses the case of consecutive H or T and writes down an explicit formula for mean time before we get such a run. Am confident that the formula exists for the probability and is hidden in a mid term question heap in the Princeton basement.

Thank you for putting up a solution. One question though, I was wondering if you were over counting a little bit, as in when looking from N-5 to N+5, how many times would you be counting solutions like HHHHHHHHHH(10 times). Maybe, I missed something.

Rahul Siddharthan said...

km -- it's a bit beyond me, and others are doing a great job -- in particular Richard Lipton. The current consensus seems to be that the proof is flawed, probably fatally so, and the remaining question is whether any interesting results can still be retrieved.

Anonymous: well, that article also talks about heads specifically, not "heads or tails". But the approach is similar to what I used, except I build it up 5 at a time rather than 1 at a time, and get a formula rather than write a computer program.

Kapil -- I'll hold off commenting yet on whether you are right, but -- as you know -- a number like 62 or 48 is not very significant without knowing how much it will vary. So for example, if you generated 1000 random 100-long sequences of H/T, gzipped them all, and calculated the mean/variance of the zipped files, and then found that one of my sequences was well within your expected range and the other was well outside, then that would be pretty convincing...

as - no, that is why I am writing sequences like THHHHHHNNN -- to avoid counting examples that were already counted in HHHHHHNNNN.

Rahul Siddharthan said...

km - ps: about Deolalikar, I do find it disappointing that, after having obtained a huge amount of feedback from eminent people (including at least two Fields Medallists) for his earlier manuscript, he has chosen to restrict his newer manuscript to a "small number of researchers", and -- at least on his web page -- fails to acknowledge the usefulness of any of the previous feedback.

If he is really on to something, I would think the open method would be best.

as said...

Thanks, I completely missed the distinction between HHHH.. and THHH...

p said...

crude answer:
sequence 1 has 60H and 40T and starts with T
sequence 2 has 54H and 46T and starts with H

very likely you did sequence 2 by hand, and subconsciously
1.tried to have close to 50% tails. in reality, 50-50 would occur in much larger samples
2. began with H

Rahul Siddharthan said...

Prithwiraj: nice answer and you're right.

The next question is, can one write a computer program to do the job? The program should take an input sequence, and return the probability that it was generated by a human (or other nonrandom process); and its confidence should increase with the size of the input.

p said...

Thanks. I was hoping I would be right, or might've looked quite silly with that answer :)

For the algorithm:
Collect a large number of human-generated sequences and an equal no of computer generated sequences in a database (the more the better). Then use bayesian methods (which i am not yet very familiar with) to determine probability of the given input sequence being random.

Anonymous said...

Another test of randomness that I think might be particularly likely for humans to fail at is that the probability of choosing the next result is independent of the previous result. Humans will have a hard time making this true. P(N=H | N-1=H) will almost certainly not equal P(H). On your data set, sequence 2 has the probability that the next result is the same as the previous is only 39/99, much less than the ideal 49.5/99. Sequence 1 on the other hand has a 55/99 chance that the next result is the same as the previous, much closer to the idea. A quick simulation suggests that only 456/10000 pseudorandom runs have a conditional probability as far from ideal as 39, whereas 3090/10000 have conditional probability as extreme as 55. The implied p-value for Sequence 1 being non-random is 0.31, whereas the implied p-value for Sequence 2 being non-random is 0.046, just low enough to reject Sequence 2's randomness if the standard cutoff of 0.05 is used. A longer sequence would of course narrow the spread on the truly random sequences, while probably not changing too much for the human, so this method should be better and better for longer sequences. Alternatively it should also perform better on more naive humans, who will tend even more to want to avoid runs. It looks like 100 flips is a borderline case for sophisticated humans.

Rahul Siddharthan said...

Anonymous, nice point, and am wondering how you came across this blogpost after so many years!

Matrixw said...

I did the test of gzip. I randomly generated 10000 different 100-length string and zipped them. The length of zipped result follow a normal distribution with mean=43.7683 and std=1.852. The first string's zipped result length is 39 (pdf=0.00783, and 1.66% of the test results) and the second is 43 (pdf=0.19765 and about 17.44% of the test results).

Also, I did the full test for P(Xn | X_(n-1) ). the result is:
(format: [[P(t|t), p(h|t)], [p(t|h), p(h|h)]])
for string 1:
[[38, 22], [22, 17]]
for string 2:
[[14, 31], [31, 23]]

Further, I proposed another method to test the randomness
first, if we interpret the sequence as a sequence of 1-d points, we could count the density of all possible 1-d points.
For example, for sequence [0,0,1,1], there are two 1-d points. 0 occurs twice and 1 occurs twice, and the result follows the uniform density along all possible 1-d points.

Then, what if we interpret the sequence as a higher-dimensional points? For example, we can treat [0,0,1,1] as two 2-d points (0,0), (1,1) and the result is not uniform along all 2-d points [(0,0), (0,1), (1,0), (1,1)].

If a sequence is truly random, then it should follow the uniformness along all dimensional interpretation.

Therefore, I did the test of 2-d interpretation on your sequences.
(format: [P([0,0]), P([0,1]), P([1,0]), P([1,1])])
first sequence:
[0.5, 0.16, 0.18, 0.16] std=0.14
second sequence:
[0.22, 0.3, 0.3, 0.18] std=0.052

From those results, I think the second sequence has more probability to be "truly random".