A recent project required me to work with a number of character encodings. And, to quote a colleague who has done more than his share of this dirty work: ‘Character sets are a b****’. Yes, they are. This particular project had free text stored in one encoding, a dependency which required input in a different encoding, and my own needs for utf-8
output. If I had known all these requirements up front, then a simple pipeline diagram could have helped me control the flow, but the project was exploratory. I’ve probably made enough excuses by now, so I can admit that I soon no longer knew which datasets had which encoding until I ran them and noticed the errors. By now, there must be a better way?
I found two attractive options which approach the problem in different ways: chardet
which relies on a finite state machine and charset_normalizer
which uses a brute force approach to pick the encoding with the least amount of noise. Let’s briefly explore these both.
chardet
The chardet
package has a number of underlying finite state machines which track all the possible interpretations of a given sequence for a particular encoding. It reads byte-by-byte and outputs a dictionary with the recommended encoding, a probability, and the language. (Language is detected using a language model — but is apparently not particularly good, there are better options…) The primary entry point is the detect
function. Here’s how we can use it:
import chardet
from pathlib import Path # my usual file interface
import urllib.request # another way to get text: from the internet
# file-based
input_file = Path('file.txt') # what's my encoding?
with open(input_file, 'rb') as fh: # get content of file as bytes since we don't know encoding
bytestr = fh.read()
result = chardet.detect(bytestr)
print(result)
#> {'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}
# web-based
url = '...' # pick a URL
raw = urllib.request.urlopen(url).read()
result = chardet.detect(raw)
print(result)
#> {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
In the above code example, we are first using an input file which we read in bytes
mode (the default, r
will read in text
mode which requires an encoding).
charset_normalizer
The charset_normalizer
package has an annoyingly long name, but claims to be faster and more accurate than chardet
, doesn’t have LGPL license ‘restriction’, and supports more character sets. It even includes a detect
function for those who might want to migrate from chardet
. The documentation says that it basically performs a brute forced approach using each encoding on a chunk of the text, with those with a noise
below a particular threshold being retained.
Beside detect
which works the same as in chardet
, charset_normalizer
has three basic interfaces: from_bytes
, from_path
, and from_fp
. All of them returns a CharsetMatches
object which basically behaves as a list of CharsetMatches
(which contain encoding information via .encoding
). Here’s the context for each of these:
from charset_normalizer import from_path, from_fp, from_bytes
# use `from_bytes` to get from bytes already in Python
some_bytestr = b'\xd8\xa7\xd9\x84\xd8\xac\xd9\x88 \xd9\x85\xd8\xb4\xd9\x85\xd8\xb3 \xd8\xa8\xd8\xa7\xd9\x84\xd8\xae\xd8\xa7\xd8\xb1\xd8\xac'
cm_list = from_bytes(some_bytestr) # result is a list of CharsetMatch objects
# use `from_fp` for an opened filelike object
with open('what-am-i.txt', 'rb') as fh:
cm = from_fp(fh).best() # calling `.best()` returns the best-matching CharsetMatch from the list
# use `from_path` to point to a file
path = 'what-am-i.txt'
cm = from_path(path).best()
The cm
return from all these functions (after .best()
is called) provides a number of properties, of which encoding
is the one I will focus on in my use cases. I have two particular use cases: 1) determine file’s encoding and read it line-by-line, and 2) determine the file encoding and read the file as a single text block. These are related, so let’s look at how they work.
from pathlib import Path
from charset_normalizer import from_path
file = Path('path/to/something.txt') # file to read; might also be `for file in path.iterdir():`
# case 1: read file line-by-line
charset = from_path(file).best() # get the best character set
for line in str(charset).splitlines(keepends=True): # read line by line, retaining newlines
mylib.parse(line)
# case 2: determine file encoding and read the entire text
charset = from_path(file).best() # get the best character set
text = str(charset) # this is the text of the file with the best encoding applied
encoding = charset.encoding # the is the selected encoding
charset_normalizer
also boast a command line interface. Once the package is installed into your virtual environment, you can activate that virtual environment and run normalizer
.
# single file
normalizer /path/to/something.txt
#> long JSON output
normalizer -m /path/to/something.txt
#> utf_8
# multiple files
normalizer /path/to/something.txt /path/to/somewhere.txt
#> long JSON output for each file, separate by a newline
normalizer -m /path/to/something.txt /path/to/somewhere.txt
#> ascii
#> utf_8
Discussion
I ended up using charset_normalizer
for my purposes. One slight annoyance is that since utf-8
is a strict superset of ascii
, if a file is utf-8
encoded, it might be considered ascii
. This makes sense from the design of the project which is not so much as to determine the correct encoding as to retrieve the correct characters from the file/text. A simple rule of encoding = 'utf8' if cm.encoding == 'ascii' else cm.encoding
would also avoid this issue.