I recently had the challenge of determining which flags had been set in a compiled regular expression. In other words, write a function that given a compile regular expression (e.g., re.compile('test', re.I | re.M)
, determine that the flags were re.I
and re.M
.
A first attempt might assume that the class re.Pattern
has a flags
attribute which will provide us the answer. And so it does (well, it has the attribute):
> re.compile('test', re.I | re.M).flags
42
> re.compile('test', re.I).flags
34
> re.compile('test').flags
32
Hmm…so why these particular numbers? Well, we could create a mapping of these numbers to the set of flags that produced them. Unfortunately, there are eight flags which leaves around 40,320 combinations to account for. There must be a better way.
There is, and the key is in how we specify multiple flags. Notice that we combine them with the bitwise ‘or’ operator |
. If we specified the flags as a list, it would make sense that the result of re.Pattern.flags
should be a list of flags. But it’s not. Instead, the re
module uses a space efficient approach to determining which flags have been set, and it’s a method used more broadly as well such as with user permissions on a website, etc.
Each of these flags are re.RegexFlag
types. We can iterate through the individual flags (like an enum
) and see what there binary representation is:
import re
for flag in re.RegexFlag:
print(f'{flag.name:12}: {flag:010b}')
ASCII : 0100000000
IGNORECASE : 0000000010
LOCALE : 0000000100
UNICODE : 0000100000
MULTILINE : 0000001000
DOTALL : 0000010000
VERBOSE : 0001000000
DEBUG : 0010000000
# also, include the NoFlag
print(f'{re.NOFLAG.name:12}: {re.NOFLAG:010b}')
NOFLAG : 0000000000
Starting with NOFLAG (i.e., all zeroes), each flag flips a different bit. Let’s see what happens when we apply two different flags use the bitwise or operator |
:
print(f'{re.IGNORECASE | re.MULTILINE:010b}')
0000001010 # result: includes two flipped bits
0000000010 # ignorecase
0000001000 # multiline
print(f'{re.DEBUG | re.ASCII:010b}')
0110000000 # result: includes two flipped bits
0100000000 # ascii
0010000000 # debug
print(f'{re.DEBUG | re.ASCII | re.VERBOSE:010b}')
0111000000 # result: includes THREE flipped bits
0100000000 # ascii
0010000000 # debug
0001000000 # verbose
Each binary representation is also how a particular integer is stored, which explains the result of re.compile('test').flags
from before.
> int(re.DEBUG | re.UNICODE)
160
> int(re.DEBUG | re.ASCII)
384
> int(re.NOFLAG)
0
Compare those values with those of:
# NB: if you just include re.DEBUG, re.UNICODE included by default
> re.compile('test', re.DEBUG | re.UNICODE).flags
160
> re.compile('test', re.DEBUG).flags # includes re.UNICODE
160
> re.compile('test', re.DEBUG | re.ASCII).flags
384
Thus, when these values are ‘or’d together (with |
), it will keep any ‘1’ that it finds. To solve our original problem about determining which flags were used, we’ll need to include the bitwise ‘and’ operator &
. Consider how these work:
# bitwise or retains all flags
1 | 1 # 1
1 | 0 # 1
0 | 1 # 1
0 | 0 # 0
# bitwise and shows if flag already included
1 & 1 # 1
1 & 0 # 0
0 & 1 # 0
0 & 0 # 0
We can, therefore, extend this to the regex flags by using the bitwise and
. If the flag already exists in the flags
attribute, it will return the matched flag (think of this as just returning 1
; i.e., bool(flag) == True
). If the flag doesn’t exist, it will return 0
, which is the same as re.NOFLAG
(think of this as returning 0
; i.e., bool(re.NOFLAG) == False
). Here’s an example:
> re.IGNORECASE & re.IGNORECASE
re.IGNORECASE # flag exists
> re.IGNORECASE & re.DEBUG
re.NOFLAG # flag doesn't exist; int(re.NOFLAG) == 0
> (re.IGNORECASE | re.DEBUG) & re.IGNORECASE
re.IGNORECASE # both contain re.IGNORECASE as in the first example
# now, same thing but get the flags from `flags` attribute
> re.compile('test', re.IGNORECASE).flags & re.IGNORECASE
re.IGNORECASE
# now, convert to bool
> bool(re.compile('test', re.IGNORECASE).flags & re.IGNORECASE)
True
> re.compile('test', re.IGNORECASE).flags & re.MULTILINE
re.NOFLAG
> bool(re.compile('test', re.IGNORECASE).flags & re.MULTILINE)
False
We can thus check whether or not a flag is present in a compile regular expression by using bitwise and
plus the flag (e.g., pat.flags & re.I
). Now, we only need to iterate through all of the flags and check if they’re present.
import re
def get_re_flags(pat):
flags = []
for flag in re.RegexFlag:
if pat.flags & flag:
flags.append(flag.name)
return flags
And, for the tests. It’s important to note that each pattern must have either re.UNICODE
or re.ASCII
set, and that re.UNICODE
is set by default (in Python 3). Also, you can’t specify both re.ASCII
and re.UNICODE
, since re.ASCII
means ‘only match ascii characters’ and re.UNICODE
means ‘match all unicode characters’. They really are the same thing, but I think having two flags allows the user to specify them more clearly (else, how would you turn off the non-default using bitwise or
?). Anyway, to the tests:
def test_get_re_flags():
# Test no flags set (will include unicode by default)
pat = re.compile('test')
assert get_re_flags(pat) == ['UNICODE']
# Test single flag (will include unicode by default)
pat = re.compile('test', re.IGNORECASE)
assert get_re_flags(pat) == ['IGNORECASE', 'UNICODE']
# Test multiple flags
pat = re.compile('test', re.IGNORECASE | re.MULTILINE)
assert sorted(get_re_flags(pat)) == ['IGNORECASE', 'MULTILINE', 'UNICODE']
# Test with all standard flags
all_flags = re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE | re.ASCII
pat = re.compile('test', all_flags)
assert set(get_re_flags(pat)) == {'ASCII', 'DOTALL', 'IGNORECASE', 'MULTILINE', 'VERBOSE'}
# passes: test_get_re_flags PASSED [100%]
One final concluding note: we can use these strings to get the original flag by, again, treating re.RegexFlag
like an enum
:
> re.RegexFlag['I']
re.IGNORECASE
> re.RegexFlag['IGNORECASE']
re.IGNORECASE
> re.compile('test', re.RegexFlag['IGNORECASE']).flags & re.IGNORECASE
re.IGNORECASE
Summary
If you weren’t already familiar with bitwise operators, perhaps this exploration of regular expression flags has provided a brief introduction.
Post Scriptum
If you want to understand how (and why) permission systems use this same approach, consider the following examples (or see an explanation here). You’ll need to play with them, but here’s a start:
# define permissions
READ = 0b001
WRITE = 0b010
EXECUTE = 0b100
Give a user permissions:
class User:
def __init__(self):
self.permissions = 0
# User can read and write, but not execute
user1 = User()
user1.permissions = READ | WRITE
Then, before any action the user takes, ensure that he’s allowed to:
if user1.permissions & READ:
print('Read permission granted.')
# allow user to read something
if user1.permissions & WRITE:
print('Write permission granted.')
# allow user to write something
if user1.permissions & EXECUTE:
print('Execute permission granted.')
# allow user to execute something