Python developer's guide to character encoding

Character encoding is a common problem in software development. Like other programming languages, character encoding in Python can be troublesome. In this article, we will dive deep into character encoding, discuss ways to interact with text and bytes in your Python 3 project, and fix common encoding errors using character encoding in Python 3.

What is Character Encoding?

In human language, text files on a computer contain a bunch of characters made of text or sentences, which could include English text, "a”, or Latin text, ”ā”. In computer language, however, this text file contains bits and bytes, not text. Like English or Latin, computer stores characters as bytes. These bytes are more like computer codes, which are translated into human-readable text. This translation is character encoding.

Character encoding is a set of methods for mapping raw binary (0101110110) to readable characters (text) using an encoding lookup table. Every character is assigned a unique ID number, which helps computers read and understand text. Multiple types of character encodings are used for interpreting bytes. Often, the wrong character encoding is applied when interpreting bytes, causing them to display as strange-looking characters, such as voil├Ā ‡å-ã or an unknown character, such as ��; even worse, it could cause an error that crashes your program.

While working with characters in Python, you will encounter two main data types: strings and bytes. Character encoding revolves around encoding and decoding these data types.

String Module

Strings are computer bytes interpreted and displayed in human-readable form. Before the development of Python 3, strings used the binary format by default to store bytes. It implies that strings are bytes with ASCII set as the default encoding. While ASCII has been quite useful, it can only encode English language characters. Solving this problem led to the development of Universal Standard Encoding, Unicode. Unicode is a universal character set that defines all characters for different human languages used on computers with over one million code points. These code points refer to displayed text or characters.

In Python 3, every string uses the Unicode format to represent characters by default. It implies that each text has a specific code point that displays the characters using UTF-8 as the default encoding.

Bytes

The second data type is bytes, which are a series of integers. A byte is a collection of eight bits that represent a unit of information. Characters in a computer include text or strings made up of one or more bytes. Imagine that you receive a file containing data in bytes. You will need to translate these bytes into readable text. In this case, character encoding converts the character you select into the correct bytes in the computer’s memory before reading the bytes back into characters to show the text.

Working with Character Encoding in Python

As stated previously, Python 3 has two data types: strings and bytes. The process of moving between strings and bytes is known as encoding and decoding in Python. Now, let us dive into how this works.

Encoding Strings into Bytes

The process of converting strings (text) to computer bytes is known as encoding. In Python 3, strings represent human-readable text, which is Unicode characters. Unicode is not encoding but an abstract encoding standard that uses Unicode Transformation Format (UTF) for encoding. Although there are multiple character encodings, we will be working with UTF-8, which is the default encoding for Python 3. UTF refers to the standard Python encoding, and 8 refers to the 8-bit units used in character encoding. UTF-8 is a standard and efficient encoding of Unicode strings that represents characters in one-, two-, three-, or four-byte units. Python uses UTF-8 by default, which means it does not need to be specified in every Python file.

To encode a string into bytes, add the encode method, which will return the binary representation of the string.

>>> text = 'Hello World'
>>> text.encode('utf-8')
b'Hello Word'

The output is the binary representation of Hello. Next, let’s look at something more complex.

>>> text = 'parlé'
>>> text.encode('utf-8')
b'parl\xc3\xa9'

>>> text = 'résumé'
>>> text = text.encode('utf-8')
b'r\xc3\xa9sum\xc3\xa9'

In the code above, the letters 'parl' are ASCII characters, which allows them to be represented. Each character in the ASCII table represents a single byte. However, complicated characters such as é are not ASCII-compatible in UTF-8 and are represented by two bytes encoded, xc3 and xa9, as in the first example. In the second example, strings that are not ASCII-compatible are represented by three bytes encoded. UTF-8 can go up to four bytes encoded. It means complicated characters in UTF-8 require several bytes for their binary representation.

Decoding Bytes into Strings

Converting a bytes-object to a string is known as decoding. To decode bytes into strings, call the decode() method and specify the type of character encoding you wish to use. Of course, we are using UTF-8.

>>> text = b'parl\xc3\xa9'
>>> text.decode('utf-8')
'parlé'

>>> text = b'r\xc3\xa9sum\xc3\xa9'
>>> text.decode('utf-8)
'résumé'

When we pass a binary format along with the decode method, the output is our original string.

Remember, you do not have to specify UTF-8 when working with Python 3. We are only specifying it here to show the encoding used.

Reading a Text File in Python

A file on a computer does not contain readable text. To read the characters in the file as text, you need to use the read() method. In addition, in Python, when you open a file for reading or writing, it is best practice to state the character encoding with which you are working. This is because, when working with text files, Python uses different character encodings depending on the operating system by default. Usually, when you open a file using the open() method, Python automatically treats it as a text file to convert the bytes in the text file to a string with the encoding you want. It is best to specify the encoding.

Here is an example:

>>> with open("data.txt", mode="r", encoding="utf-8") as f:
...   message = f.read()

Writing a Text File in Python

You can also use the open() method to write files. To write, set it to write by entering mode = w:

>>> with open("data.txt", mode="w", encoding="utf-8") as f:
...     f.write("Hi, I am having fun learning Python")

Other Encodings Available in Python

As mentioned previously, there are multiple encodings available for Unicode characters in Python. We already discussed UTF-8, which is the most common and widely used. It is also the default encoding for Python 3, but there are others:

UTF-16: UTF-16 encoding for Unicode characters represents characters in two or four bytes. The lowest binary representation of a character in UTF-16 consists of two bytes. The major advantage UTF-8 has over UTF-16 is that while the former uses one byte for encoding an ASCII character, the latter encodes the same character with two or more bytes. A UTF-16-encoded English text file is at least twice as large as a UTF-8-encoded version of the same file.

UTF-32: UTF-32 uses fixed four bytes for encoding Unicode characters. This means that every character encoding uses four bytes. UTF-16 uses more memory compared to UTF-8 and UTF-16. It is faster and preferable for string manipulation because you can calculate the length of a string in bytes using the number of characters in the string. However, for every ASCII character, you use an additional three bytes.

Pitfalls and How to Fix Them

Avoid character encoding errors at all cost, as they are troublesome and no developer enjoys spending time dealing with a bug. Let us look at common pitfalls in character encoding and how to fix likely errors. One common error raised when working with character encoding in Python 3 is the UnicodeEncodeError. There are a few causes of this error.

First, a UnicodeEncodeError error may occur when using characters that cannot be encoded, such as emojis. Unicode supports the vast majority of languages, but not all. Therefore, a character not supported in Unicode will fail. Second, when the strict method is used by default for encoding and decoding, it will cause an error if a character cannot be encoded or decoded.

There is also a UnicodeDecodeError, which occurs when the character encoding of the bytes we are reading and the character encoding Python is attempting to use to read them are not similar.

How Can I Avoid or Fix This Error?

One way to fix a character encoding error is to use the ignore or replace method to remove special characters or emojis that cannot be encoded. You can also use the ignore method when opening a file to avoid any errors. Here is an example:

text = 'ф'

with open('message.txt', 'w', encoding='utf-8', errors='ignore') as f:
    f.write(text)

Conclusion

Computers do not recognize text; they store data in binary format. Character encoding is the key that converts this binary data to readable text. In this article, we have discussed several topics. The early method of character encoding, ASCII, was insufficient as it did not allow non-English characters to be represented in binary format. This was resolved by the introduction of Unicode, which assigned a specific code point for every human-readable character. We also discussed how encoding works in Python 3 and the various character encoding methods in Python.

I hope this tutorial was helpful to you. Happy Coding! 🙂