I use a lot of regular expressions in my work. They are very powerful for extracting, replacing, or locating text strings of interest, particularly in their flexibility. Character classes, case insensitivity, etc. are very powerful. Take a simple use case: let’s find all the words (letter-only sequences) in some text:
import re import pathlib from collections import Counter path = pathlib.Path('/some/directory/with/text/files/') pattern = re.compile(r'[\W\d]+') # match all non-letters all_words = Counter() for file in path.iterdir(): with open(file, encoding='utf8') as fh: text = fh.read() words = pattern.split(text) # split on all non-letters all_words.update(words) # add to counter all_words.most_common(10) # or whatever analysis you want to do
Regular expressions do have some potential drawbacks, with the most frequent (and quite justifiable) objection is their future readability. I have seen suggestions ranging from using the re.VERBOSE
flag to including comments (?#watch out)
. While these may work for you, my approach is using f-string
replacement, extracting out frequent patterns.
Let’s take an example. Suppose we wish to identify descriptions of the concept of opioid abuse
in some corpus. We might start with the quite simple (and naive):
OPIOID_ABUSE_PAT = re.compile('opioid abuse', re.I)
But we also need cases of ‘abuse of alcohol’, ‘abusing alcohol’, etc.
OPIOID_ABUSE_PAT = re.compile('opioid abuse|abuse of opioids|abusing opioids', re.I)
As we look through our corpus, however, we also notice that ‘abuse’ and ‘opioid’ appear to have various synonyms: ‘misuse’, ‘misuses’, ‘misusing’, ‘narcotics’, ‘opiates’, or even the name of a particular opiate. Sometimes the words are plural, and other times they are not.
A different approach tries to break down the regular expression into coherent parts and then build these together. We’ll combine all the words for ‘opioids’ into a string called opioids
, and all the terms for abuse
into a string called abuse
. We’ll also use some shorthand allowing for up to 2 intervening words: words2
. Note that I picked the \s
as a separator, but in some cases we may prefer \W
(i.e., all non-letters, non-numbers, and non-underscore) to allow for intervening punctuation.
opioids = r'(?:(?:opioid|opiate|narcotic)s?)' abuse = r'(?:(?:misus|abus)\w+)' words2 = r'(?:\w+\s+){,2}' OPIOID_ABUSE_PAT = re.compile( rf'\b(?:' # \b to ensure we start at a word boundary rf'{opioids}\s*{words2}\s*{abuse}' rf'|{abuse}\s*{words2}\s*{opioids}' rf')\b', re.IGNORECASE )
One warning about using the f-string
inside a regular expression is that when specifying the number of repeats \w{1,2}
(i.e., exactly one or two letters/numbers/underscores) will not be interpreted correctly in an f-string
. Instead, use \w{{1,2}}
.
We can print the resulting mess of a pattern:
>> print(OPIOID_ABUSE_PAT.pattern) \b(?:(?:(?:opioid|opiate|narcotic)s?)\s*(?:\w+\s+){,2}\s*(?:(?:misus|abus)\w+)|(?:(?:misus|abus)\w+)\s*(?:\w+\s+){,2}\s*(?:(?:opioid|opiate|narcotic)s?))\b
That would be quite annoying to maintain, but the version we have is, I’d argue, significantly more readable and modifiable. Suppose that we needed to add specific drugs, e.g., ‘codeine’, how difficult would that be? Well, we’d modify the opioids
variable and the new expression would be populated throughout.
Oh, but we’re forgetting one thing: how do we know it works? And, how can we ensure it still works after making changes? Tests. And by ‘tests’, we, of course, mean pytest
. pip install pytest
and then use a test suite like this:
import pytest @pytest.mark.parametrize('text', [ 'opioid abuse', 'abusing narcotics', 'abuse of several narcotics', ]) def test_opioid_abuse_pat(text): m = OPIOID_ABUSE_PAT.search(text) assert m is not None
These tests also serve as useful documentation. Ever wonder what you that regular expression was supposed to do? Just jump over to your tests and you’ll know.