Before starting this article, let's take a look at a group of base64 encoded strings:
ZG==
YY==
aW==
ZF==
cm==
aM==
b2==
dc==
c2==
Zf==
The decoded content is "daidrhouse", which seems fine. But if you look closely, the decoded results of the first and fourth lines are both "d", but the content is different?
According to the normal base64 encoding, "daidrhouse" should result in the following:
ZA==
YQ==
aQ==
ZA==
cg==
aA==
bw==
dQ==
cw==
ZQ==
Clearly, compared to the former, the second character of each base64 string has been changed, but the decoded content remains the same. This brings us to the principle of base64 encoding.
What is base64?#
As the name suggests, base64 encoding is a way of encoding binary content using 64 ASCII characters as a base. You may have seen base64 encoded images embedded in web pages, or even when transferring lyrics files in QQ Music. Encoding binary data into ASCII characters makes it easier to read and transmit data in certain scenarios. Of course, compressing all binary data into just 64 characters will inevitably compromise the size. After encoding, the size of the characters will increase by 1/3, and the reason for this will be explained below.
Index Table#
Base64 has a standard encoding table, which consists of 64 ASCII characters sorted and assigned indices.
Index | Character | Index | Character | Index | Character | Index | Character |
---|---|---|---|---|---|---|---|
0 | A | 16 | Q | 32 | g | 48 | w |
1 | B | 17 | R | 33 | h | 49 | x |
2 | C | 18 | S | 34 | i | 50 | y |
3 | D | 19 | T | 35 | j | 51 | z |
4 | E | 20 | U | 36 | k | 52 | 0 |
5 | F | 21 | V | 37 | l | 53 | 1 |
6 | G | 22 | W | 38 | m | 54 | 2 |
7 | H | 23 | X | 39 | n | 55 | 3 |
8 | I | 24 | Y | 40 | o | 56 | 4 |
9 | J | 25 | Z | 41 | p | 57 | 5 |
10 | K | 26 | a | 42 | q | 58 | 6 |
11 | L | 27 | b | 43 | r | 59 | 7 |
12 | M | 28 | c | 44 | s | 60 | 8 |
13 | N | 29 | d | 45 | t | 61 | 9 |
14 | O | 30 | e | 46 | u | 62 | + |
15 | P | 31 | f | 47 | v | 63 | / |
Sometimes, to avoid confusion (such as in URLs), .
and _
are used instead of +
and /
from the index table.
Encoding Method#
Base64 processes 3 bytes (24 bits) as a group. If there are fewer than 3 bytes, padding with 0 is done, and =
is used at the end to indicate the number of padded bytes. Each group of 6 bits is then encoded as 1 group of 6-bit binary, resulting in 4 groups of 6-bit binary for the 24 bits. At this point, there are a total of 64 possible combinations for these 6-bit binaries, which can be represented by 64 characters. (This also explains why the size increases by 1/3 after encoding.)
Examples#
Steganography Principle#
When decoding base64, the number of =
at the end of the string determines the number of bytes to be removed. You may have noticed that when the number of characters in a group is 1 byte or 2 bytes, 4 or 2 bits of binary are ignored during decoding, as indicated by the red marks in the following image.
These red-marked binaries can be encoded but are ignored during decoding. Modifying the content at these positions will not affect the original data.
Problem Solving#
Now, let's try to solve the problem mentioned at the beginning of the article. What is hidden in that group of base64 encoded strings?
By concatenating all the red-marked binary bits, we can obtain the final result: "hello".