How to Determine the Flags of a Compiled Regular Expresson?

I recently had the challenge of determining which flags had been set in a compiled regular expression. In other words, write a function that given a compile regular expression (e.g., re.compile('test', re.I | re.M), determine that the flags were re.I and re.M.

A first attempt might assume that the class re.Pattern has a flags attribute which will provide us the answer. And so it does (well, it has the attribute):

> re.compile('test', re.I | re.M).flags
42
> re.compile('test', re.I).flags
34
> re.compile('test').flags
32

> re.compile('test', re.I | re.M).flags
42
> re.compile('test', re.I).flags
34
> re.compile('test').flags
32

Hmm…so why these particular numbers? Well, we could create a mapping of these numbers to the set of flags that produced them. Unfortunately, there are eight flags which leaves around 40,320 combinations to account for. There must be a better way.

There is, and the key is in how we specify multiple flags. Notice that we combine them with the bitwise ‘or’ operator |. If we specified the flags as a list, it would make sense that the result of re.Pattern.flags should be a list of flags. But it’s not. Instead, the re module uses a space efficient approach to determining which flags have been set, and it’s a method used more broadly as well such as with user permissions on a website, etc.

Each of these flags are re.RegexFlag types. We can iterate through the individual flags (like an enum) and see what there binary representation is:

import re


for flag in re.RegexFlag:
    print(f'{flag.name:12}: {flag:010b}')
    
ASCII       : 0100000000
IGNORECASE  : 0000000010
LOCALE      : 0000000100
UNICODE     : 0000100000
MULTILINE   : 0000001000
DOTALL      : 0000010000
VERBOSE     : 0001000000
DEBUG       : 0010000000

# also, include the NoFlag
print(f'{re.NOFLAG.name:12}: {re.NOFLAG:010b}')
NOFLAG      : 0000000000

import re


for flag in re.RegexFlag:
    print(f'{flag.name:12}: {flag:010b}')
    
ASCII       : 0100000000
IGNORECASE  : 0000000010
LOCALE      : 0000000100
UNICODE     : 0000100000
MULTILINE   : 0000001000
DOTALL      : 0000010000
VERBOSE     : 0001000000
DEBUG       : 0010000000

# also, include the NoFlag
print(f'{re.NOFLAG.name:12}: {re.NOFLAG:010b}')
NOFLAG      : 0000000000

Starting with NOFLAG (i.e., all zeroes), each flag flips a different bit. Let’s see what happens when we apply two different flags use the bitwise or operator |:

print(f'{re.IGNORECASE | re.MULTILINE:010b}')
0000001010  # result: includes two flipped bits
0000000010  # ignorecase
0000001000  # multiline

print(f'{re.DEBUG | re.ASCII:010b}')
0110000000  # result: includes two flipped bits
0100000000  # ascii
0010000000  # debug

print(f'{re.DEBUG | re.ASCII | re.VERBOSE:010b}')
0111000000  # result: includes THREE flipped bits
0100000000  # ascii
0010000000  # debug
0001000000  # verbose

print(f'{re.IGNORECASE | re.MULTILINE:010b}')
0000001010  # result: includes two flipped bits
0000000010  # ignorecase
0000001000  # multiline

print(f'{re.DEBUG | re.ASCII:010b}')
0110000000  # result: includes two flipped bits
0100000000  # ascii
0010000000  # debug

print(f'{re.DEBUG | re.ASCII | re.VERBOSE:010b}')
0111000000  # result: includes THREE flipped bits
0100000000  # ascii
0010000000  # debug
0001000000  # verbose

Each binary representation is also how a particular integer is stored, which explains the result of re.compile('test').flags from before.

> int(re.DEBUG | re.UNICODE)
160

> int(re.DEBUG | re.ASCII)
384

> int(re.NOFLAG)
0

> int(re.DEBUG | re.UNICODE)
160

> int(re.DEBUG | re.ASCII)
384

> int(re.NOFLAG)
0

Compare those values with those of:

# NB: if you just include re.DEBUG, re.UNICODE included by default
> re.compile('test', re.DEBUG | re.UNICODE).flags
160

> re.compile('test', re.DEBUG).flags  # includes re.UNICODE
160

> re.compile('test', re.DEBUG | re.ASCII).flags
384

# NB: if you just include re.DEBUG, re.UNICODE included by default
> re.compile('test', re.DEBUG | re.UNICODE).flags
160

> re.compile('test', re.DEBUG).flags  # includes re.UNICODE
160

> re.compile('test', re.DEBUG | re.ASCII).flags
384

Thus, when these values are ‘or’d together (with |), it will keep any ‘1’ that it finds. To solve our original problem about determining which flags were used, we’ll need to include the bitwise ‘and’ operator &. Consider how these work:

# bitwise or retains all flags
1 | 1  # 1
1 | 0  # 1
0 | 1  # 1
0 | 0  # 0

# bitwise and shows if flag already included
1 & 1  # 1
1 & 0  # 0
0 & 1  # 0
0 & 0  # 0

# bitwise or retains all flags
1 | 1  # 1
1 | 0  # 1
0 | 1  # 1
0 | 0  # 0

# bitwise and shows if flag already included
1 & 1  # 1
1 & 0  # 0
0 & 1  # 0
0 & 0  # 0

We can, therefore, extend this to the regex flags by using the bitwise and. If the flag already exists in the flags attribute, it will return the matched flag (think of this as just returning 1; i.e., bool(flag) == True). If the flag doesn’t exist, it will return 0, which is the same as re.NOFLAG (think of this as returning 0; i.e., bool(re.NOFLAG) == False). Here’s an example:

> re.IGNORECASE & re.IGNORECASE
re.IGNORECASE  # flag exists

> re.IGNORECASE & re.DEBUG
re.NOFLAG  # flag doesn't exist; int(re.NOFLAG) == 0

> (re.IGNORECASE | re.DEBUG) & re.IGNORECASE
re.IGNORECASE  # both contain re.IGNORECASE as in the first example

# now, same thing but get the flags from `flags` attribute
> re.compile('test', re.IGNORECASE).flags & re.IGNORECASE
re.IGNORECASE

# now, convert to bool
> bool(re.compile('test', re.IGNORECASE).flags & re.IGNORECASE)
True

> re.compile('test', re.IGNORECASE).flags & re.MULTILINE
re.NOFLAG

> bool(re.compile('test', re.IGNORECASE).flags & re.MULTILINE)
False

> re.IGNORECASE & re.IGNORECASE
re.IGNORECASE  # flag exists

> re.IGNORECASE & re.DEBUG
re.NOFLAG  # flag doesn't exist; int(re.NOFLAG) == 0

> (re.IGNORECASE | re.DEBUG) & re.IGNORECASE
re.IGNORECASE  # both contain re.IGNORECASE as in the first example

# now, same thing but get the flags from `flags` attribute
> re.compile('test', re.IGNORECASE).flags & re.IGNORECASE
re.IGNORECASE

# now, convert to bool
> bool(re.compile('test', re.IGNORECASE).flags & re.IGNORECASE)
True

> re.compile('test', re.IGNORECASE).flags & re.MULTILINE
re.NOFLAG

> bool(re.compile('test', re.IGNORECASE).flags & re.MULTILINE)
False

We can thus check whether or not a flag is present in a compile regular expression by using bitwise and plus the flag (e.g., pat.flags & re.I). Now, we only need to iterate through all of the flags and check if they’re present.

import re


def get_re_flags(pat):
    flags = []
    for flag in re.RegexFlag:
        if pat.flags & flag:
            flags.append(flag.name)
    return flags

import re


def get_re_flags(pat):
    flags = []
    for flag in re.RegexFlag:
        if pat.flags & flag:
            flags.append(flag.name)
    return flags

And, for the tests. It’s important to note that each pattern must have either re.UNICODE or re.ASCII set, and that re.UNICODE is set by default (in Python 3). Also, you can’t specify both re.ASCII and re.UNICODE, since re.ASCII means ‘only match ascii characters’ and re.UNICODE means ‘match all unicode characters’. They really are the same thing, but I think having two flags allows the user to specify them more clearly (else, how would you turn off the non-default using bitwise or?). Anyway, to the tests:

def test_get_re_flags():
    # Test no flags set (will include unicode by default)
    pat = re.compile('test')
    assert get_re_flags(pat) == ['UNICODE']

    # Test single flag (will include unicode by default)
    pat = re.compile('test', re.IGNORECASE)
    assert get_re_flags(pat) == ['IGNORECASE', 'UNICODE']

    # Test multiple flags
    pat = re.compile('test', re.IGNORECASE | re.MULTILINE)
    assert sorted(get_re_flags(pat)) == ['IGNORECASE', 'MULTILINE', 'UNICODE']

    # Test with all standard flags
    all_flags = re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE | re.ASCII
    pat = re.compile('test', all_flags)
    assert set(get_re_flags(pat)) == {'ASCII', 'DOTALL', 'IGNORECASE', 'MULTILINE', 'VERBOSE'}

# passes: test_get_re_flags PASSED          [100%]

def test_get_re_flags():
    # Test no flags set (will include unicode by default)
    pat = re.compile('test')
    assert get_re_flags(pat) == ['UNICODE']

    # Test single flag (will include unicode by default)
    pat = re.compile('test', re.IGNORECASE)
    assert get_re_flags(pat) == ['IGNORECASE', 'UNICODE']

    # Test multiple flags
    pat = re.compile('test', re.IGNORECASE | re.MULTILINE)
    assert sorted(get_re_flags(pat)) == ['IGNORECASE', 'MULTILINE', 'UNICODE']

    # Test with all standard flags
    all_flags = re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE | re.ASCII
    pat = re.compile('test', all_flags)
    assert set(get_re_flags(pat)) == {'ASCII', 'DOTALL', 'IGNORECASE', 'MULTILINE', 'VERBOSE'}

# passes: test_get_re_flags PASSED          [100%]

One final concluding note: we can use these strings to get the original flag by, again, treating re.RegexFlag like an enum:

> re.RegexFlag['I']
re.IGNORECASE

> re.RegexFlag['IGNORECASE']
re.IGNORECASE

> re.compile('test', re.RegexFlag['IGNORECASE']).flags & re.IGNORECASE
re.IGNORECASE

> re.RegexFlag['I']
re.IGNORECASE

> re.RegexFlag['IGNORECASE']
re.IGNORECASE

> re.compile('test', re.RegexFlag['IGNORECASE']).flags & re.IGNORECASE
re.IGNORECASE

Summary

If you weren’t already familiar with bitwise operators, perhaps this exploration of regular expression flags has provided a brief introduction.

Post Scriptum

If you want to understand how (and why) permission systems use this same approach, consider the following examples (or see an explanation here). You’ll need to play with them, but here’s a start:

# define permissions
READ = 0b001
WRITE = 0b010
EXECUTE = 0b100

# define permissions
READ = 0b001
WRITE = 0b010
EXECUTE = 0b100

Give a user permissions:

class User:

  def __init__(self):
     self.permissions = 0
# User can read and write, but not execute
user1 = User()
user1.permissions = READ | WRITE

class User:

  def __init__(self):
     self.permissions = 0
# User can read and write, but not execute
user1 = User()
user1.permissions = READ | WRITE

Then, before any action the user takes, ensure that he’s allowed to:

if user1.permissions & READ:
    print('Read permission granted.')
    # allow user to read something
if user1.permissions & WRITE:
    print('Write permission granted.')
    # allow user to write something
if user1.permissions & EXECUTE:
    print('Execute permission granted.')
    # allow user to execute something

if user1.permissions & READ:
    print('Read permission granted.')
    # allow user to read something
if user1.permissions & WRITE:
    print('Write permission granted.')
    # allow user to write something
if user1.permissions & EXECUTE:
    print('Execute permission granted.')
    # allow user to execute something