Text Encoding Basics

I have tried to learn about text encoding every few years for the past ten years. Generally, I don’t have many issues with it, so I manage to forget the important details by the next time I need to fix something. I decided to short-circuit that process next time by jotting down some thoughts, and I figured I’d share them here as well.

So, what is text encoding? Well, internally, computers don't have any representation of strings. They operate on numbers, specifically binary. Because they only store or "understand" numeric data, humans have developed schemes called encodings that map numbers to characters. When a computer needs to print a string (like the one you are reading right now), it has a list of numbers, and it looks up which symbol corresponds to each number, kind of like a child’s decoder ring might help them “decrypt” a message from their friend. For example, using ASCII encoding, the number 65 corresponds to the character A, and the number 98 corresponds to the character b. Notice that the uppercase A and the lowercase b are many numbers apart. That’s because uppercase and lowercase numbers are mapped differently in ASCII!

Because humans are involved, we don’t have just one way of mapping between numbers and characters. We have many — so many, in fact, that I am not going to list them all. There are a few that any developer is likely to encounter in their daily life: ASCII, UTF-16, and UTF-8.

ASCII

ASCII (pronounced “ASK-ee”) was first published in 1963 and became the de facto standard text encoding for encoding the English alphabet. Many other text encodings are at least somewhat compatible with ASCII, which contains 128 characters encoded using 7 bits (0000000 through 11111111). ASCII was not unchallenged; it was often used supplemented with other characters, because it only represented the Latin alphabet and limited symbols.

UTF-16

Obviously, much of the world doesn’t use the English alphabet, we needed other text encoding schemes that encompass more systems of writing. UTF-16 is one such standard. First released in 1996 (and developed for several years before that), it is the encoding used by Microsoft Windows. Interestingly, the web never really adopted UTF-16, possibly because it is incompatible with ASCII. UTF-16 is based on a two-byte (16 bit) encoding system that can represent 1,112,064 different characters, including the characters of many known languages and the now ubiquitous emoji.

It is worth noting that just because ASCII and UTF-16 are “incompatible,” meaning that you couldn’t map something from one back to the other, UTF-16 does encode the first 128 characters with the same decimal value. This means that while they would use a different number of bits for those characters, those strings of bits represent the same base 10 decimal.

UTF-8

UTF-8 was developed around the same time as UTF-16 but represented its data differently and was able to preserve some compatibility with ASCII. It is the dominant text encoding on the web and (like to UTF-16) contains characters for many known languages. In my experience, many programs are shifting away from UTF-16 and to UTF-8 to reduce compatibility issues.

Practice

Now that you have a high-level understanding of text encoding, it will be helpful to see it in operation in the wild. Luckily, we can do that even in JavaScript in the browser you are reading this in right now. Open up your developer tools in your favorite browser, and paste this:

    
    console.log('a'.charCodeAt(0))

console.log(String.fromCharCode(98))

    

This code snippet should output “98” and then a. The first line is creating a string containing only one character: a lowercase a. It then uses the built-in JavaScript function charCodeAt(n) to get the character code at the n-th position in the string (remember, strings in JavaScript are 0-based). This should print out the ASCII code for a, 98!

The astute among you might have noticed that the function charCodeAt doesn’t return an ASCII code but, instead, returns a UTF-16 code. Remember that the first 128 decimal values of ASCII and UTF-16 match, so we can safely use the decimal as an ASCII code even though it is technically the UTF-16 code.

The next line of the example goes the opposite direction, from the code 98 to the character a.

Helpful Resources


I am currently in the process of building my own static site generator! You can follow progress on that project here