**Detecting non-random signals in "The Terratin Incident"**

Today’s question, utterly nonsensical even by IDFC standards, is nevertheless one that MUST BE ANSWERED before the episode essays can continue.

Just how good is Captain Kirk at calculating probabilities?

Long have scholars studied this topic, presumably, and at last, having seen “The Terratin Incident”, I can lend my voice to the doubtless spirited debate. Our discussion must start with Kirk’s own comments on the subject. Let’s check the tapes.

Meaningless at the moment, Bones, but it was sent twice. Odds against that occurring in a totally random transmission are too high to ignore.

So is Kirk right? Just how unlikely is it to register the phrase ‘TERRATIN’ twice in the space of the same random transmission?

**Inspecting Morse**

As usual, there are a few basic assumptions we’re going to have to make here. The first is that whatever code has been used to assemble this message is only capable of transmitting 36 characters – each of the letters of the modern Latin alphabet, plus the ten digits. You’d want a code like this to not get too complicated, so I don’t think this is a terrible starting point. Morse code for example (in its most basic form) is based around these same 36 objects. We might object here to the idea we would need so many characters – the ancient Romans managed just fine with 21, which could represent both words and numbers, for example – but the fact ‘TERRATIN’ both requires reading in English, and has no actual underlying meaning in that language, argues against the transmission code in question being one which requires translating from some other alphabet.

The next assumption is that, if we mistake a random transmission for a message, each of the 36 characters are equally likely to be the next one to be (incorrectly) recognised. Put another way, the random background noise we mistake for an ‘R’ is no more or less likely to be heard than the random background noise we mistake for a ‘7’. This is actually a potentially problematic assumption, as my reference to Morse code demonstrates. The various characters Morse code can describe aren’t equally complex – some require more information to deliver than others. This is a built-in limitation of its delivery system, which has only two active building blocks, the dot and the dash. If you wanted to equate each of the 36 characters to a unique and equally long combination of dots and dashes, you’d need each such combination to be eight blocks long – there’s only 32 combinations of dots and dashes which are seven symbols long.

Obviously, a sonic code which requires eight sounds for each character is a rather inefficient one. Plus, there’s the danger you miss the start of a message, and so have no way of knowing where to start and end each octet. Instead, then, Morse code uses pauses between each sequence corresponding to a single character, and allows those sequences to vary in size. An ‘E”‘ is generated by single dot, for instance, whereas a ‘J’ starts with a dot but then requires three dashes. There are only 32 unique sequences of dots and dashes that are precisely seven symbols long, but there are 62 unique combinations of such sequences that are between one and five symbols long, and you can assign the most commonly-used characters to the shortest sequences – hence why the common letter in English takes just one dot to express.

All of which is entirely neat and clever, but also potentially an issue for my purposes here. Any random signal mistakable for Morse code would need to feature three building blocks that could be mistranslated as dots, dashes and pauses. Under such circumstances, the frequency of the block we’re labelling ‘pause’ makes a big difference to how often different characters will appear in our ‘translated’ message. If the pauses happen very frequently, the sequence of characters you come up with won’t include many ‘J’s, because you won’t often get four non-pauses in a row, let alone the specific combination of the four that makes a ‘J’. On the other hand, long gaps between pauses means you won’t get many ‘E’s, because the chances of following a pause with just one dot and then immediately getting another pause aren’t that high (another issue in this case would be the risk of getting six or more non-pauses between a pair of pauses, which wouldn’t correspond to any character in Morse).

Let me show you what I mean. I’ve simulated two random strings of one hundred elements. Each element is a dot, a dash or pause. In the first example, the probability of each element being a pause is 7%. In the second example, the probability of each element being a pause is 25%. In each case dots and dashes are equally likely. I then translated them into Morse code, as you can see below:

E, E, T, E, T, T, S, A, E, E, T, E, I, T, E, E, E, T, M, S, E 8, ?, O, T, T, ?, G, O, N, E, Z, C, U, U, M, T

(The question marks in the second sequence represent combinations of dots and dashes that have no equivalent in Morse code, usually because they were simply too long.)

In the first sequence, ten of the 21 characters are ‘E’ – that’s 47.6%, almost half the sequence. In the second sequence, the ratio is just one in 16, or 6.25%. The theoretical probabilities of the next character in each sequence being an ‘E’ are in fact 37.5% and 12.5% respectively, whereas the theoretical probabilities of the next character being a ‘J’ are 0.07% and 0.66% respectively. You may notice that in both cases, unlike an all-duck Shakespearean acting troupe, Juliets are less likely than Echoes. There are a couple of reasons for this, but the most immediate is that although fewer pauses means you’re more likely to get longer sequences between pauses, the probability of always getting *exactly* one character between pauses is always more than the probability of getting *exactly* four.

We’re starting to wander into the weeds here, though, so let’s pull back and summarise. It’s not necessarily the case that in a coded signal (and therefore in random noise that we *mistake* for a coded signal) that every character is equally likely. This would seem problematic for any exploration of whether Kirk is correct or not, because we have no idea of how the code system Spock believes is operating actually, well, *operates*.

Fortunately (well, fortunately for anyone actually enjoying this post), that needn’t stop us, because of a nifty little idea known as the Theorem of Total Probability. This nifty piece of mathematical tastiness tells us that we don’t need to assume all characters are *actually* equally likely, we just need to assume that whatever biases exist among the characters are equally likely to favour any specific character. For a simple example of this in practice, consider a coin that’s just been handed to you by a Vulcan. If the Vulcan tells you the coin is fair, you would expect with probability 50% that the first time you toss it, it will come down heads. What happens though if the Vulcan tells you one side is three times as likely to land face-up as the other, but doesn’t say *which* side that is? You now know there is either a 75% chance of the coin coming down heads, OR a 75% chance of the coin coming down tails, but not which of those is correct. Enter the Theorem of Total Probability, that tells you that as long as you’re willing to believe the coin is equally likely to be biased in favour of heads or biased in favour of tails, you should still believe with probability 50% that the first toss will be heads.

You can think of this as an application of symmetry, if you like. If all biases are equally likely, there’s no sense worrying about them. It all comes out in the mathematical wash, so to speak.

**Space-Monkeys And Typewriters**

That’s the basics settled on. What do these assumptions tell us is the probability of seeing “TERRATIN” twice in the same transmission? Let’s start with the number of different combinations of eight characters. With 36 choices for the first character, and 36 choices for the second, there are 36 x 36 = 1296 combinations of two adjacent characters. If we carry this logic through, we can find the total number of combinations of eight characters by multiplying 36 by itself 8 times. This gives us a rather impressive total of just over two point eight *trillion* combinations, of which “TERRATIN” is just one. That means the probability of randomly generating eight characters and getting the word “TERRATIN” is one in two point eight trillion. To put that in perspective, the probability of playing the National Lottery every weekend next month and winning *twice* is around one in six point five trillion. Less likely, then, but not by all that much.

I actually ran a simulation of this, randomly generating one hundred million combinations of eight letters and digits (each letter and digit being equally likely, as discussed). You can see below the closest I managed to get to “TERRATIN”.

In short, then, spontaneously generating the word “TERRATIN” is very difficult, even when you try doing it a ridiculous number of times.

This is only part of the story, though. Yes, it turns out to be staggering unlikely that if we randomly generate eight characters, they’d spell out “TERRATIN”. But what if we generated sixteen characters, divided them into two sets of eight, and checked whether one or even both of them said “TERRATIN”? Now we have a truly sanity-blasting eight *septillion *(8 000 000 000 000 000 000 000 000) combinations, so it seems things are getting even worse. But the number of combinations including at least one incidence of “TERRATIN” has exploded too, because if the first set says “TERRATIN” the second set can say absolutely anything, and vice versa. So now, instead of one combination that says “TERRATIN”, we have five point six trillion combinations saying it at least once. The probability of this happening is now a mere one in one point four trillion. Do this with three sets of eight characters, and the probability of at least one saying “TERRATIN” becomes one in 940 billion, and so on*.*

If we keep this up, then sooner or later, at least in theory, we should be able to find a sufficiently gargantuan number of randomly-generated eight-character sets that we’d expect at least one of them to say “TERRATIN”. In fact, that number is two trillion – that’s sixteen trillion individual characters. Not a short message, then. In actual fact, though, this is a huge overestimate, because we could find the word starting in one eight-character group and finishing in another. Two adjacent groups that read, for instance, “FTHBRTER” “RATIN6GW4” would give us what we need, and so far, we haven’t taken that into account.

Instead of separating up our sequence, then, all we need do is recognise that every character represents the start of a sequence that either says “TERRATIN” or doesn’t. Considered this way, the two trillion sequences we needed above can come from two trillion and seven characters – though since the original two trillion value is just an approximation, getting precious about the need for the last starting value to have seven more characters after it wouldn’t really be sensible.

A rather more important issue is the amount of data this involves, and the amount of time it would take to process it. Two trillion characters is (very) roughly equivalent to four hundred thousand copies of Alan Moore’s million-word novel *Jerusalem*. How long would they take to download? I threw this question to the Twitter lions, having no knowledge of transfer rates itself, and the general opinion was that even using high-end contemporary broadband, that amount of data could easily take more than a day to download (this was about four years ago, for the record). Doubtless the *Enterprise* could do it much faster, but that’s still a ludicrous amount of information, far more than could reasonably be assumed to be an attempt to send a simple message. If Moore had mentioned he’d hidden a unique eight-letter code-word in every four hundred thousandth copy of *Jerusalem*, then I wouldn’t be able to feign surprise, but I also wouldn’t waste any time actually looking for it.

There’s a solution to all this as well, though. You’ve probably noticed that I’ve been focusing on the probability of seeing “TERRATIN” at least once, despite the whole point of Kirk’s observation being that it was surprising that it had occurred twice.

There’s a simple reason for this. What we’re studying here isn’t actually the chances of seeing the word “TERRATIN” repeated. It’s the chances of seeing *any* word repeated. Because presumably, any other word (or approximation to one), had it been repeated in the signal, would also have caught Spock’s attention.

**Repetition, Repetition, Repetition**

This is a common mistake when people think about the probability of an extreme event they’ve witnessed – it’s easy to forget that there might be plenty of other extreme events that would also have caused comment had they happened. We meet someone we know in a strange city, or even country, and ask ourselves “what were the chances?”. Well, the chances of meeting that particular person was pretty tiny, doubtless. The chances of meeting *someone* among the hundreds of people we can recognise on sight, though, is necessarily rather larger.

To (appropriately enough) take an example entirely at random, let’s say that instead of “TERRATIN”, the supposed message contained a repeat of the word “MARSNINI”. Presumably that would have been equally worthy of investigation. We could therefore think of the situation as follows – one random word appears at the start of the broadcast, and we then search for whether it appears again. This corresponds to the calculations I did above – once we happen to read “TERRATIN” or “MARSNINI”, we wouldn’t expect to see the word again in a random sequence unless it was trillions of characters long.

Now that we’re considering the case of any word being repeated, though, we can see this approach isn’t correct. That’s because every time you look at a new sequence of eight characters, then it will either match an earlier sequence, or it will add one to the list of sequences that we’re looking to occur again later on. This shortens significantly the length of transmission in which you’d expect to get a repetition, simply because of the speed at which you’re generating new sequences that can be checked for repeat examples of.

One last example might help, I think. Once again I generated a sequences of characters, for which each letter and number was equally likely. The original sequence was 500 characters long, and I record below each of the eight-letter sub-sequences I obtained.

1. E U I Z T I Q I 2. U I Z T I Q I T 3. I Z T I Q I T V 4. G D M A L H I R 5. D M A L H I R M 6. M A L H I R M C 7. L M P Q P R W N 8. W S S S A C G T 9. S S S A C G T R 10. S S A C G T R F 11. B G V K I J F O

(Note the overlaps between sequences 1 to 3, 4 to 6, and 8 to 10. Each of these represent a sub-sequence of ten consecutive letters, which could therefore be considered as three different eight-letter sub-sequences.)

There’s an immediate and obvious problem here, which is that none of these eight-letter sub-sequences really looks like a word – “DMALHIRM” is probably the closest we’ve got. That’s not something I’m equipped to deal with, though. What would count as sufficiently recognisable as a pseudo-word is a topic way outside my wheelhouse; I’d imagine in any cast that it would depend upon an individual’s experience of linguistics, if nothing else. I’m therefore going to ignore the issue of what would and would not be considered worth a follow-up, and simply note that by assuming every combination is viable, I know I’m underestimating the amount of time it will take to get a repeat. This will turn out not to actually matter, as we’ll see.

What’s important here is the speed at which new combinations are being generated. Eleven pseudo-words in 500 characters is actually on the low side – 40 would be closer to the average. This means that after 500 characters, we’d generally expect to either already have found a repetition (which remains incredibly unlikely), or that when checking the next 500 characters, there are already 40 different pseudo-words to look for matches with. If none of them appear, and none of the new pseudo-words match each other, that’s an average of 80 different ways to achieve a match in the next 500, and so on.

Using this approach, the number of random characters we need to be more likely than not to find a repetition becomes a mere fourteen million – not even three Alan Moore novels. Allow numbers to be included in our pseudo-words, and that shrinks further to just two million – not even 20% of a single ludicrously long novel.

For all that we’ve managed to downsize the length of random signal needed for a repetition to be likely, though, 400 pages of Alan Moore’s prose is still rather obviously too long to be mistaken for an interstellar communication. If the signal had instead been as long as, say, this post, the probability of a repetition would be somewhere around 0.005%. Even this essay feels like something too long to want to spend through space, though (in addition to all the other reasons that would be an ill-advised idea).

In conclusion, then, Kirk is correct in his understanding of probability. To find one “TERRATIN” may be regarded as fortune; to find two looks like proof of intelligence. Forget Spinoza, this is where our original captain truly proves his academic chops.

We now return you to your regular scheduled of episode deconstructions and terrible puns.

## Comments