Dynamic Regex Queries in Pandas

I encountered an issue recently in which I wanted to dynamically retain only those rows which matched a group of regular expressions, or, in some cases, to be able to exclude rows matching a particular set of regular expressions. This is relatively straightforward should the regular expressions be known in advance.

Let’s begin by setting up a pandas dataframe we can play around with, loading some text from a few Wikipedia articles.

import pandas as pd

df = pd.DataFrame().from_records([
{'docid': 1, 'text': 'Joan was born to a peasant family at Domrémy in northeast France. In 1428, she traveled to Vaucouleurs and requested to be taken to Charles, later testifying that she had received visions from the archangel Michael, Saint Margaret, and Saint Catherine instructing her to support Charles and recover France from English domination.'},
{'docid': 2, 'text': 'Joan organized a company of French volunteers to relieve Compiègne, which had been besieged by the Burgundians. She was captured by Burgundian troops on 23 May and exchanged to the English. She was put on trial by the pro-English bishop, Pierre Cauchon, on a charge of heresy. She was declared guilty and burned at the stake on 30 May 1431, dying at about 19 years of age.'},
{'docid': 3, 'text': 'Just before Joan arrived on the scene in 1429, the English had nearly achieved their goal of an Anglo-French dual monarchy. Henry V\'s brothers, John of Lancaster, 1st Duke of Bedford and Humphrey, Duke of Gloucester had continued the English conquest of France. Nearly all of northern France, Paris, and parts of southwestern France were under Anglo-Burgundian control. The Burgundians controlled Reims, which had served as the traditional site for the coronation of French kings. This was important, as Charles had not yet been crowned, and doing so at Reims would help legitimize his claim to the throne. During this time, there were two prophecies circulating around the French countryside. One promised that a maid from the borderlands of Lorraine would come forth to work miracles, and the other was that France had been lost by a woman, but would be restored by a virgin.'},
{'docid': 4, 'text': 'In 1793, Congress suspended repayments of French loans incurred during the American Revolutionary War. The dispute escalated further due to different interpretations of the 1778 treaties of Alliance and Commerce between the two countries. France, then engaged in the 1792–1797 War of the First Coalition, which included Great Britain, viewed the 1794 Jay Treaty between the United States and Britain as incompatible with those treaties, and retaliated by seizing American ships trading with Britain.'},
{'docid': 5, 'text': 'Diplomatic negotiations failed to resolve these differences, and in October 1796 French privateers began attacking all merchant ships in American waters. The dissolution of Federal military forces following independence left the US unable to mount an effective response and by October 1797, over 316 American ships had been captured. In March 1798, Congress reassembled the United States Navy and in July authorized the use of military force against France'},
{'docid': 6, 'text': 'Kuappi is a restaurant in Iisalmi, Finland. According to the Guinness Book of Records, it is the smallest restaurant in the world. The building has a footprint of 8 square metres, of which 3.6 square metres is indoors.'},
])

Let’s suppose, too, that we are developing a program that the end user needs to be able to select an arbitrary set of parameters (e.g., note type, dates, subject, etc.). The program will fetch documents based on these parameters and filter the documents according to user-specified regular expressions — some to require, and others as exclusionary. There will be subsequent processing to these texts as well (e.g., NLP extraction), so it doesn’t make sense to do this as either a pre- or post-processing step.

The base case may look like developing in a Jupyter Notebook where we know the regular expressions. We can simply include the relevant regular expressions in the query using pandas’ df['column'].str.contains function. We may want to include texts that refer to france and war but not america. We can capture these in the following statements:

import re

df['text'].str.contains(r'(?:france|french)', flags=re.IGNORECASE)
df['text'].str.contains(r'(?:war|troop|volunteer|army)', flags=re.IGNORECASE)
~df['text'].str.contains(r'\b(?:america\w+|united states|u\.?s\.?a)\b', flags=re.IGNORECASE)

We use non-capturing regular expressions (?: ...) to prevent pandas warning us about capture groups, and the ignorecase flag will capture these expression regardless of whether they are capitalized or lowercase.

Adding these to our code, we can include them separately:

df = df[df['text'].str.contains(r'(?:france|french)', flags=re.IGNORECASE)]
print(df.docid.tolist())  # [1, 2, 3, 4, 5]

df = df[df['text'].str.contains(r'(?:war|troop|volunteer|army)', flags=re.IGNORECASE)]
print(df.docid.tolist())  # [2, 4]

df = df[~df['text'].str.contains(r'\b(?:america\w+|united states|u\.?s\.?a)\b', flags=re.IGNORECASE)]
print(df.docid.tolist())  # [2]

…or as a single statement:

df = df[
    df['text'].str.contains(r'(?:france|french)', flags=re.IGNORECASE) &
    df['text'].str.contains(r'(?:war|troop|volunteer|army)', flags=re.IGNORECASE) &
    ~df['text'].str.contains(r'\b(?:america\w+|united states|u\.?s\.?a)\b', flags=re.IGNORECASE)
]
print(df.docid.tolist()). # [2]

To do this dynamically, we should first establish our variables (we’ll assume a list of strings, one for include, one for exclude):

include_regexes = [
    '(?:france|french)',
    '(?:war|troop|volunteer|army)',
]
exclude_regexes = [
    r'(?:america|united states|\bu\.?s\.?a\b)',
]

One option is to iterate through these, repeatedly applying the inclusion/exclusion regular expressions to the corpus.

for rx in include_regexes:
    df = df[df['text'].str.contains(rx, flags=re.IGNORECASE)]
for rx in exclude_regexes:
    df = df[~df['text'].str.contains(rx, flags=re.IGNORECASE)]
df.docid.tolist(). # [2]

This works, and we could probably stop our investigation there…but is there a way to do this in a single step, a single pass through the dataframe? (Or two, one for exclude and another for include?) Since we don’t know exactly how long these lists might be, we will to specify everything dynamically.

pandas provides a function for doing this called df.query, which takes a string as an argument. We can convert these statements to string, and join with the ampersand (adding the all-important not ~ at the front of our exclusions.

# DOESN'T WORK
if include_regexes:
        include_regexes = ' & '.join(f'text.str.contains("{rx}", flags=re.IGNORECASE)' for rx in include_regexes)
        df = df.query(include_regexes)
if exclude_regexes:
    exclude_regexes = '~(' + ' & '.join(
        f'text.str.contains("{rx}", flags=re.IGNORECASE)' for rx in exclude_regexes
    ) + ')'
    df = df.query(exclude_regexes)

# UndefinedVariableError: name 're' is not defined

Unfortunately, this doesn’t quite work, giving an UndefinedVariableError. I’m not sure how exactly to fix this and ensure that the re library is in the appropriate namespace, but there are a couple workarounds.

First, we can look at the value of re.IGNORECASE (int(re.IGNORECASE) gives us 2), and then replace the re.IGNORECASE in the above to 2.

if include_regexes:
        include_regexes = ' & '.join(f'text.str.contains("{rx}", flags=2)' for rx in include_regexes)
        df = df.query(include_regexes)
if exclude_regexes:
    exclude_regexes = '~(' + ' & '.join(
        f'text.str.contains("{rx}", flags=2)' for rx in exclude_regexes
    ) + ')'
    df = df.query(exclude_regexes)
df.docid.tolist()  # [2]

Second, we can replace the flags argument with case=False to force ignoring case.

if include_regexes:
        include_regexes = ' & '.join(f'text.str.contains("{rx}", case=False)' for rx in include_regexes)
        df = df.query(include_regexes)
if exclude_regexes:
    exclude_regexes = '~(' + ' & '.join(
        f'text.str.contains("{rx}", case=False)' for rx in exclude_regexes
    ) + ')'
    df = df.query(exclude_regexes)
df.docid.tolist()  # [2]