Simplifying File Character Encodings

A recent project required me to work with a number of character encodings. And, to quote a colleague who has done more than his share of this dirty work: ‘Character sets are a b****’. Yes, they are. This particular project had free text stored in one encoding, a dependency which required input in a different encoding, and my own needs for utf-8 output. If I had known all these requirements up front, then a simple pipeline diagram could have helped me control the flow, but the project was exploratory. I’ve probably made enough excuses by now, so I can admit that I soon no longer knew which datasets had which encoding until I ran them and noticed the errors. By now, there must be a better way?

I found two attractive options which approach the problem in different ways: chardet which relies on a finite state machine and charset_normalizer which uses a brute force approach to pick the encoding with the least amount of noise. Let’s briefly explore these both.

`chardet`

The chardet package has a number of underlying finite state machines which track all the possible interpretations of a given sequence for a particular encoding. It reads byte-by-byte and outputs a dictionary with the recommended encoding, a probability, and the language. (Language is detected using a language model — but is apparently not particularly good, there are better options…) The primary entry point is the detect function. Here’s how we can use it:

import chardet
from pathlib import Path  # my usual file interface
import urllib.request  # another way to get text: from the internet


# file-based
input_file = Path('file.txt')  # what's my encoding?
with open(input_file, 'rb') as fh:  # get content of file as bytes since we don't know encoding
    bytestr = fh.read()
result = chardet.detect(bytestr)
print(result)
#> {'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

# web-based
url = '...'  # pick a URL
raw = urllib.request.urlopen(url).read()
result = chardet.detect(raw)
print(result)
#> {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

In the above code example, we are first using an input file which we read in bytes mode (the default, r will read in text mode which requires an encoding).

`charset_normalizer`

The charset_normalizer package has an annoyingly long name, but claims to be faster and more accurate than chardet, doesn’t have LGPL license ‘restriction’, and supports more character sets. It even includes a detect function for those who might want to migrate from chardet. The documentation says that it basically performs a brute forced approach using each encoding on a chunk of the text, with those with a noise below a particular threshold being retained.

Beside detect which works the same as in chardet, charset_normalizer has three basic interfaces: from_bytes, from_path, and from_fp. All of them returns a CharsetMatches object which basically behaves as a list of CharsetMatches (which contain encoding information via .encoding). Here’s the context for each of these:

from charset_normalizer import from_path, from_fp, from_bytes

# use `from_bytes` to get from bytes already in Python
some_bytestr = b'\xd8\xa7\xd9\x84\xd8\xac\xd9\x88 \xd9\x85\xd8\xb4\xd9\x85\xd8\xb3 \xd8\xa8\xd8\xa7\xd9\x84\xd8\xae\xd8\xa7\xd8\xb1\xd8\xac'
cm_list = from_bytes(some_bytestr)  # result is a list of CharsetMatch objects

# use `from_fp` for an opened filelike object
with open('what-am-i.txt', 'rb') as fh:
    cm = from_fp(fh).best()  # calling `.best()` returns the best-matching CharsetMatch from the list

# use `from_path` to point to a file
path = 'what-am-i.txt'
cm = from_path(path).best()

The cm return from all these functions (after .best() is called) provides a number of properties, of which encoding is the one I will focus on in my use cases. I have two particular use cases: 1) determine file’s encoding and read it line-by-line, and 2) determine the file encoding and read the file as a single text block. These are related, so let’s look at how they work.

from pathlib import Path
from charset_normalizer import from_path


file = Path('path/to/something.txt')  # file to read; might also be `for file in path.iterdir():`

# case 1: read file line-by-line
charset = from_path(file).best()  # get the best character set
for line in str(charset).splitlines(keepends=True):  # read line by line, retaining newlines
    mylib.parse(line)

# case 2: determine file encoding and read the entire text
charset = from_path(file).best()  # get the best character set
text = str(charset)  # this is the text of the file with the best encoding applied
encoding = charset.encoding  # the is the selected encoding

charset_normalizer also boast a command line interface. Once the package is installed into your virtual environment, you can activate that virtual environment and run normalizer.

# single file
normalizer /path/to/something.txt
#> long JSON output
normalizer -m /path/to/something.txt
#> utf_8

# multiple files
normalizer /path/to/something.txt /path/to/somewhere.txt
#> long JSON output for each file, separate by a newline
normalizer -m /path/to/something.txt /path/to/somewhere.txt
#> ascii
#> utf_8

Discussion

I ended up using charset_normalizer for my purposes. One slight annoyance is that since utf-8 is a strict superset of ascii, if a file is utf-8 encoded, it might be considered ascii. This makes sense from the design of the project which is not so much as to determine the correct encoding as to retrieve the correct characters from the file/text. A simple rule of encoding = 'utf8' if cm.encoding == 'ascii' else cm.encoding would also avoid this issue.