2017-06-19

Demystifying encodings — part 1

Has a UnicodeEncodeError ever got on your nerves? I think it has happened to pretty much everyone. I thought it’s a good time to start demystifying this mystery, and to have fun while doing it.

You are probably familiar with the famous hello program, which does nothing but print “Hello, world!” Now we are going to create the bon program, which prints “Bon appétit!” Here it is, but don’t type it in or copy/paste it:

import sys
 
sys.stdout.write("Bon appétit!\n")

We could have written it with a single print statement, but for several reasons I prefer this version.

Now, I told you to not type in this program or copy/paste it. Instead, we are going to create it by writing down the bytes that comprise it. The file bon.py, like any file, is a series of bytes. These bytes represent the characters in the file. The first character is “i”, which is represented by the byte 105; the second one is “m”, represented by 109; and so on. You can find the table for all characters (except é) at Wikipedia.

In order to create the file, we will use another Python program. Here it is. You can copy and paste that one; name it createbon.py:

byte_values = [
    # i   m    p    o    r    t        s    y    s
    105, 109, 112, 111, 114, 116, 32, 115, 121, 115, 13, 10,
 
    13, 10,
 
    # s   y    s   .    s    t    d    o    u    t   .
    115, 121, 115, 46, 115, 116, 100, 111, 117, 116, 46,
 
    # w   r    i    t    e   (   "   B    o    n
    119, 114, 105, 116, 101, 40, 34, 66, 111, 110, 32,
 
    # a  p    p   +------+   t    i    t    !  \    n   "   )
    97, 112, 112, 195, 169, 116, 105, 116, 33, 92, 110, 34, 41, 13, 10,
]
 
with open('bon.py', 'wb') as f:
    f.write(bytearray(byte_values))

Type python createbon.py and it will create bon.py. Don’t try to run it yet; instead, open bon.py in a UTF-8 editor to see it. By default, on GNU/Linux systems any editor should work properly. On Windows I experimented with Notepad and it worked, but you might need to tell it it’s UTF-8.

You may have noticed that 32 is the space, and that I’m using the sequence 13 10 for a new line. 13 stands for “carriage return”, and 10 for “line feed”. On Windows both characters are needed to change line, whereas in Unixes only line feed is used. Here I used the Windows version. Python programs in GNU/Linux will rarely use 13 10 as the newline, they will normally use only 10, but if it happens that a file uses 13 10 it will work without problem.

Running bon.py is a different story. First of all, you can only run it in Python 3.  Python 3 assumes that the encoding of the input file is UTF-8, which it is.

If you try to run bon.py with Python 3 on GNU/Linux, most likely it will work; however, on several environments, it might not work and instead give you a UnicodeEncodeError:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 7: ordinal not in range(128)

Whether it succeeds or not, here is what happens: Python reads bon.py. When it reaches the “Bon appétit!\n” part, it reads the bytes and decodes them into a string of characters. Python assumes these bytes are the string encoded in UTF-8, which it is, so it is able to decode these bytes into an internal representation of the string.

When the time comes to print the string to the standard output, Python must convert the string from the internal representation to an encoding that will be understood by the terminal inside which Python is being run. The way Python determines that encoding depends on the system on which it is running.

Let’s see some examples only for the word “appétit”. If the terminal uses UTF-8, the bytes produced by the encoding will be 97, 112, 112, 195, 169, 116, 105, 116 (I’ve bolded é just to make it easier for you to read the whole sequence). If it uses ISO-8859-1, the bytes will be 97, 112, 112, 233, 116, 105, 116. If it uses UTF-16, the bytes will be 97, 0, 112, 0, 112, 0, 233, 0, 116, 0, 105, 0, 116, 0. In all cases, Python will throw these bytes to the terminal, and the terminal will decode them into its internal representation, and in co-operation with the operating system and the windowing system it will paint the corresponding glyphs on the screen.

If the terminal uses ASCII, Python will attempt to encode the string into ASCII. In addition, if the encoding used by the terminal is unknown to Python, Python will also use ASCII. But é cannot be encoded into ASCII, because the ASCII character set does not contain that character, so Python will throw the UnicodeEncodeError.

Python 2

While it is also a valid Python 2 program, Python 2 assumes that the encoding of the input file is ASCII, and in ASCII all bytes are a number between 0 and 127, and here we have two bytes that are more than 127—it’s the two bytes that represent é, 195 and 169. So Python 2 will stop and complain (“SyntaxError: Non-ASCII character ‘\xc3’ in file /tmp/bon.py on line 3, but no encoding declared”).

You can also convince Python 2 by adding a comment on top of the file that tells Python that the encoding of the file is UTF-8, but we will not bother with that here; read PEP 263 if you are interested in that.

(Python 3 always works with UTF-8, whereas Python 2 needs a PEP 263 comment, and if you use one it should always be UTF-8; don’t use anything else, otherwise you will be in trouble when porting your code to Python 3).