Substitution ciphers

1. Caesar cipher
2. Breaking the Caesar cipher
3. Breaking other simple substitution ciphers
4. Vigenère Cipher
5. Substituting blocks of letters
6. Playfair cipher

1. Caesar cipher

Encryption has been around at least since the days of Julius Caesar. Caesar may have used different ciphers, but the one that bears his name, the Caesar cipher, is a simple substitution cipher: each letter gets replaced with the letter that comes 3 places after it in the alphabet. For example, an “A” would be replaced by a “D”, a “B” replaced by an “E”, and so forth.

Similar schemes, but shifting the alphabet by a different amount, could also be used. One common variant, called “ROT13”, shifts the letters by 13; since the modern English alphabet has 26 letters, shifting by 13 twice gives you the original message, which means that the encryption method is exactly the same as the decryption method.

Definitions: In cryptography, we talk about different texts. The message prior to encryption, or after decryption, is called the plaintext. After it is encrypted, it becomes the ciphertext. Data that does not get encrypted is called cleartext.

Choose an amount by which to shift the alphabet, and enter a message to be encrypted.

Try decrypting a message too. After you've encrypted a message, click the “Copy ciphertext to plaintext” button, and then change the shift amount (set it to the alphabet length minus the previous shift amount) to get the original message.

2. Breaking the Caesar cipher

⊕Its effectiveness would no doubt have been aided by the low literacy rates of the time. While we don’t know exactly how effective the Caesar cipher was during the time of Julius Caesar, it’s likely that it was reasonably effective at the time. However, the Caesar cipher is actually quite easy to decipher, even if you don’t know the key.

Suppose that you know that a message is encrypted using this method, but you don’t know by how much the alphabet is shifted. Under the Classical Latin alphabet, there would only be 23 possibilities for the shift. It would not be too difficult to try all 23 possibilities by hand. You don’t need to try to decrypt the whole message each time; just try a word or two at a time, and only continue decrypting when they end up being real words. With a modern English alphabet, you would have a little bit more work to do, but 26 tries still isn’t that much.

Moreover, depending on the language that the message is written in, you may be able to gather some hints from the message. ⊕A message might contain other 1-letter “words” – for example, the sender may use a shorthand form of writing some words, such as “B” for “be” or “U” for “you”, or may use initials in place of peoples’ names. But guessing that it is an “I” or an “A” is a good place to start. For example, in English, there are only two 1-letter words: “I” and “a”. So if the encrypted message contains a 1-letter word, say “N”, then you would only need to try the shifts that would turn the “N” into “I” or “A”, in this case, -5 or 13.

For simplicity’s sake, we only offer the Modern English alphabet here. Sorry to all the fans of the Classical Latin alphabet.

Try shifting the alphabet by different amounts until the plaintext makes sense.

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Ciphertext: abc

Plaintext: ...

Trying to decrypt a message by trying all possibilities is called a brute force attack. If there are only a few possibilities to try, such as in this case, it may be easy to do. If there are a lot of possibilities, then it becomes more difficult, or even practically impossible: with some ciphers, it would take computers more than the age of the universe to try all possibilities.

3. Breaking other simple substitution ciphers

Now that we know how to break the Caesar cipher, let’s consider substitution ciphers that aren’t based on shifting the alphabet. There’s no reason, when creating a cipher, that just because an “A” is substituted with a certain letter, then a “B” needs to be substituted with the next letter. We could decide to substitute “A” with “D”, “B” with “Q”, “C” with “J”, etc. ⊕Unless a message contains all 26 letters, you wouldn’t need to try all the possibilities, but it would probably still be a large number of possibilities. With the modern English alphabet, there are 26! ＝26×25×24×⋯×1, or 403,291,461,126,605,635,584,000,000, possible substitutions, which is more than you would want to try by hand.

In this case, a brute force attack is much harder. But it is often still possible to break the cipher by hand, using knowledge of the language that the message is written in. As mentioned above, a 1-letter word in an English letter is likely to be “I” or “a”. Looking at other short words may also give you clues to other letters. Another method is by looking at how often each letter appears, known as frequency analysis. The first known description of frequency analysis is by the Arab mathematician Al-Kindi (أبو يوسف يعقوب بن إسحاق الصبّاح الكندي) in the 9th century.

In English, for example, the letter “E” is the most common letter. So if you count how many times each letter appears in the message, the letter that appears the most frequently is likely to be an “E”. Though it might also be a “T” (the second-most common letter), an “A” (the third-most common), or an “O” (the fourth-most common). It’s very unlikely that it would be a “Q” or a “Z” (the least common letters). In addition, you can consider the letters that start a word; in an English-language text, “T” is the most common starting letter for words. Or if you have a sufficiently long text, you can consider the frequency of the words themselves: the most common word in an English-language text is “the”, followed by “of”, “and”, “to”, “in”, and “a”. You can also consider double-letters: in English, you will rarely see a word with a “jj” in it, whereas words that contain “ll”, “ss” or “ee” are more common.Peter Norvig has compiled many statistics on English texts at https://norvig.com/mayzner.html.

Frequency analysis tends to work better with longer texts. It is easy to find a single-word message, like “quiz”, “jazz”, “equinox”, or “thagomizer”, where some letters that are normally less common (“j”, “x”, or “z”) occur as frequently, if not more frequently, than some letters that are normally more common (“a”, “e”, “t”). Longer messages tend to have letter frequencies that better match the overall letter frequency for the language.

The graphs show the frequency of each letter in the English language, and the frequency of each letter in the ciphertext. Try to decrypt the message by typing in the letters that each letter in the ciphertext should be replaced with. As you enter your guesses, the letters that you have used will be crossed out in the frequency charts to help you see what letters still remain. If you guess a letter for more than one substitution, it will turn red.

As you correctly guess some letters, it should become easier to guess some of the other letters.

Note: Even though the most common letter in English in “E”, the most common letter in ciphertext might no correspond to “E”, but the letter corresponding to “E” is likely to be among the top of the most common letters. Feel free to make some guesses and revise them.

Most common short English words (in descending order of frequency):

1-letter: a, i
2-letter: of, to, in, is, it, be, by, on
3-letter: the, and, for, was, not, are

Letter frequency in English

E12%

T9%

A8%

O8%

I7%

N7%

S7%

R6%

H5%

L4%

D4%

C3%

U3%

M3%

F2%

P2%

G2%

W2%

Y2%

B1%

V1%

K1%

X0%

J0%

Q0%

Z0%

Letter frequency in ciphertext

Non-alphabetic substitutions

We don't always need to substitute letters for other letters. In the Sherlock Holmes story The Adventure of the Dancing Men, letters are replaced by pictures of stick-man figures in different poses, and Holmes explains how he is able to decrypt the messages, including the use of frequency analysis. This leads us into the realm of steganography, which deals with hiding messages so that others are not aware of the presence of the message. This is an interesting topic of its own, but is unfortunately beyond the scope of this book. Some other things, such as Morse code, flag semaphore, maritime signal flags, or sign language alphabets are similar, though their purpose is usually to allow different means of communication, rather than for secrecy.

4. Vigenère Cipher

We can also try variations on substitutions to try to get around frequency analysis, with some degree of success. We can use the Caesar cipher, but change the amount that we shift the alphabet for different characters in our message. For example, we could shift the alphabet by 7 for the first letter in our message, shift by 18 for the second letter, and shift by 12 for the third letter. The amount that we shift is the key to the cipher, and someone who knows the key can decrypt the ciphertext by reversing the process. This method was first described by Giovan Battista Bellaso in 1553, but was misattributed to Blaise de Vigenère,⊕Ironically, Vigenère did invent a cipher that is similar, but stronger than the one commonly known as the Vigenère Cipher, but his name is generally associated with the cipher described by Bellaso. and so is commonly referred to as the Vigenère Cipher.

The key is usually represented by a key word, with each letter representing a different shift amount: “A” represents no shift, “B” represents a shift of 1, “C” represents a shift of 3, and so on. Thus if the key word was “KEY”, then we would first shift by 10, then by 4, and then by 24. For messages longer than the key, we repeat the key as many times as needed to match the message length.

Enter a text to encrypt or decrypt, and a key.

Key	Shift	ABCDEFGHIJKLMNOPQRSTUVWXYZ
E	4	EFGHIJKLMNOPQRSTUVWXYZABCD

Since letters are shifted by different amounts, frequency analysis as we did before on the whole text will not allow you to decrypt the message. However, if you happened to know the key length, n, then you could perform n different frequency analyses, and crack each letter of the key separately. For example, if you knew the key length was 3, then you would take the 1st, 4th, 7th, 10th, etc. characters (that is, the characters encrypted by the first letter of the key) as one text, the 2nd, 5th, 8th, 11th, etc. characters as another text, and the 3rd, 6th, 9th, 12th, etc. characters as the last text.

But how could you find the key length? One way is to guess the key length, do the frequency analysis as described above, and see which key length gives frequency statistics that are most like the statistics for English. This works better if the text is much longer than the key length.

Try different key lengths and observe the frequency histograms to try to determine the correct key length. The key will be between 2 and 4 characters. Once you have guessed the key length, try different shift amounts to decrypt the text. Note that short texts will be harder to decrypt, especially if the key happens to be longer.

5. Substituting blocks of letters

Another variant on substitutions is to substitute multiple letters at a time, called a polygraphic substitution cipher. For example, we can divide the message into pairs of letters, and perform substitution on the pairs. For example, if the message was “Hello, World”, we would divide it into “HE”, “LL”, “OW”, “OR”, “LD”. We would then have some sort of table to look up what each pair gets replaced with. Say, the “HE” gets replaced with “WG”, the “LL” with “FP”, the “OW” with “NA”, the “OR” with “LN”, and the “LD” with “OC”, giving an encrypted message of “WGFPN ALNOC”. This, for example, hides the fact that the original message had a double letter (the “ll” in “Hello”), three of a single letter (three “l”s), and two of another letter (two “o”s). If the message has an odd number of characters, we can pad the message with a character, which the recipient can remove after decrypting.

This method of encryption is a bit inconvenient, though, as it requires remembering the substitutions for 26²=676 pairs, if we are substituting pairs. If we are substituting larger groups of letters, we would need to remember more substitutions.

Decrypt the ciphertext, which was encrypted by substituting letter pairs. The substitution table gives the substitutions. The letters on the left represent the first letter of the pair, and the letters on the right represent the second letter of the pair. So, for example, if you want to decrypt the pair “AB”, you would find “A” on the left, and “B” on the top, and replace it with the pair where the row and column meet.

Substitutions for decrypting
	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z
A	AA	AB	AC	AD	AE	AF	AG	AH	AI	AJ	AK	AL	AM	AN	AO	AP	AQ	AR	AS	AT	AU	AV	AW	AX	AY	AZ
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

TODO...

Breaking this type of cipher requires access to a much longer text, since letter pairs are less likely to occur more than once, which means that guessing one substitution, does not give much information about the rest of the message. In the example above of the ciphertext “WGFPN ALNOC”, even if we were able to somehow discover that “WG” decodes to “HE”, there is no other “WG” in the message, which means that there isn’t much information available to decode the rest of the message. Frequency analysis can be used, but you would need quite a large text to get useful statistics.

6. Playfair cipher

Since remembering a substitution table with 676 entries is difficult, different methods can be used to determine the substitution for a given letter pair. One such method was invented by Charles Wheatstone in 1854, but is commonly referred to as the Playfair cipher, because Lord Playfair promoted it use.

The Playfair cipher only requires remembering a single ordering of the alphabet (the same as what you would need to do a simple substitution cipher) and writing it into a 5×5 square (to make 25 letters, “J” or “Q” can be dropped, or “I” and “J” can be combined). The ordering of the alphabet is usually just a key word (with duplicate letters removed), followed by the remaining letters written in some simple order, such as horizontally across the rows of the square, vertically across the columns, spiralling from the centre, etc.

Pairs of letters are then encrypted using this square, and the following rules:

If both letters are the same, then the second letter is replaced with an “X” (or another uncommon letter), and then the remaining rules are followed with this new pair.
If both letters are on the same row of the table, then each letter is replaced by the letter to the right of it in the table (wrapping around to the first letter if it is the rightmost letter).
If both letters are in the same column of the table, then each letter is replaced by the letter below it in the table (wrapping around to the top row if it is the bottom-most letter).
If the letters are in different rows and columns, then they are opposite corners defining a rectangle in the table. Take the two letters that are the other two corners of this rectangle. Each of the original letters is replaced by the letter that is in the same column.

For example, consider the table:

A B C D E
F G H I K
L M N O P
Q R S T U
V W X Y Z

If we want to encode “HELLO WORLD”, we split it into the pairs “HE”, “LL”, “OW”, “OR”, “LD”, as before. Looking at the letters “H” and “E”, they are the corners of the 3×2 rectangle at the top-right of the table, with “C” and “K” being the other corners:

- - C D E
- - H I K
- - - - -
- - - - -
- - - - -

“K” is in the same row as “H”, so “H” is replaced by “K”. “C” is in the same row as “E”, so “E” is replaced by “C”. This gives a ciphertext of “KC”.

Next we consider the pair “LL”. Since they are the same letter, we replace the second one with an uncommon letter, say “Q”, and continue with the pair “LQ”. In the table, “L” and “Q” are in the same column, so we replace each with the letter below it in the table: “L” is replaced by “Q”, and “Q” is replaced by “V”. This gives a ciphertext of “QV”.

Similarly, the pair “OW” forms a rectangle with “MY” as the other two corners. “OR” forms a rectangle with “MT” as the other two corners. And “LD” forms a rectangle with “OA” as the other two corners. The full ciphertext, then, is “KCQVM YMTOA”. Note that in this example, the two “O”s happen to be replaced by the same letter (“M”). This is because the other letters in their pairs (“W” and “R”) are in the same column of the table. If you encrypted a pair with “O” and a letter from a different column, say “F”, then it would be replaced by a different letter (“L”).

To decrypt a text, you simply reverse the encryption steps. Step 1 requires some amount of guesswork to determine when it should be applied, but it is usually quite obvious. Steps 2 and 3 are the same except that the direction of shift is reversed. And Step 4 is exactly the same.

Enter a text to encrypt or decrypt, and fill out the 5×5 square.

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z
A	AA	AB	AC	AD	AE	AF	AG	AH	AI	AJ	AK	AL	AM	AN	AO	AP	AQ	AR	AS	AT	AU	AV	AW	AX	AY	AZ
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z
A	AA	AB	AC	AD	AE	AF	AG	AH	AI	AJ	AK	AL	AM	AN	AO	AP	AQ	AR	AS	AT	AU	AV	AW	AX	AY	AZ
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z