Building Language Rules in SpaCy

spaCy provides a number of useful methods for exploring and creating patterns after a particular text or document has been read. To see this in action, let’s use spaCy to build some rules in the more computational linguistic side of NLP. So, for those less interested in language, forgive a brief digression into Polish.

In Polish, if you wanted to compare the size of something with another object (‘that’s so big, it must be the size of three football fields!’), you’d probably use wielkości (the genitive form of wielkośc, ‘size’). But wielkości can be used in contexts not relating to this comparative such as when it is just referring to the size of something. In addition, it can mean more than just ‘size’, but also ‘greatness’. How could we limit the scope to only uses of the comparative? In reviewing some cases, we might see that when wielkości is both preceded by and followed by a noun, adjective, or pronoun, then it is more likely to be a comparative. How could we code up this rule in spaCy?

First, we’ll need a Polish corpus to test this against. A subset of the NKJP (National Corpus of Polish) is available for download on their website: http://www.nkjp.pl/index.php?page=14.

Second, we’ll need to get the corpus into a usable form. I don’t want to spend too much figuring out the details, so I’ll write an iterator to go through the unpacked corpus directory. I’ll just skip the places where this might fail.

from pathlib import Path
import xml.etree.ElementTree as etree


def corpus_iter(d=Path('NKJP-PodkorpusMilionowy-1.2'), word='wielkości'):
    for folder in d.iterdir():
        if folder.is_file():
            continue
        tree = etree.parse(str(folder / 'text.xml'))
        root = tree.getroot()
        for element in root.iterfind('.//{http://www.tei-c.org/ns/1.0}ab'):
            text = element.text
            if word in text:
                yield text

Third, we’ll need to get a Polish model loaded into spaCy. I’ve selected the pl_core_news_lg which is, in fact, trained on the NKJP (among other sources). python -m spacy download pl_core_news_lg.

I used the corpus_iter method above to isolate a few examples. Here’s one exemplary positive and negative (the first of each category I found while the corpus was still being unpacked…):

Positive example (note wielkości appearing between two nouns):

token lemma dep head pos

W w case r ADP

1789 1789 amod:flat r ADJ

r rok obl znaleziono X

. . punct r PUNCT

na na case brzegu ADP

brzegu brzeg obl znaleziono NOUN

rzeki rzeka nmod:arg brzegu NOUN

Lujana Lujan nmod rzeki PROPN

koło koło case Buenos ADP

Buenos Buenos nmod rzeki PROPN

Aires Aires flat Buenos PROPN

znaleziono znaleźć ROOT znaleziono VERB

kości kość obj znaleziono NOUN

tajemniczego tajemniczy amod ssaka ADJ

ssaka ssak nmod kości NOUN

wielkości wielkość nmod ssaka NOUN

słonia słoń nmod:arg wielkości NOUN

. . punct znaleziono PUNCT

2. Negative example (not wielkości appearing after a verb):

token lemma dep head pos

Przecież przecież advmod:emph Pigmeje PART

afrykańscy afrykański amod Pigmeje ADJ

Pigmeje Pigmej nsubj zmienili PROPN

, , punct zmniejszyli PUNCT

choć choć mark zmniejszyli SCONJ

też też advmod:emph zmniejszyli PART

zmniejszyli zmniejszyć advcl zmienili VERB

rozmiary rozmiar obj zmniejszyli NOUN

swojego swój det:poss ciała DET

ciała ciało nmod rozmiary NOUN

, , punct zmniejszyli PUNCT

nie nie advmod:neg zmienili PART

zmienili zmienić ROOT zmienili VERB

wielkości wielkość obj zmienili NOUN

mózgów mózg nmod:arg wielkości NOUN

. . punct zmienili PUNCT

Let’s use token-based matching in which we’ll look for the pattern of NOUN + wielkości + NOUN/ADJ/PRON. The available set of token attributes can be found here: https://spacy.io/usage/rule-based-matching/#adding-patterns-attributes.

We’ll create our first rule by using a pattern. The pattern is a list of dicts, where the dict defines the type of token that should be matched. For this first example, we’ll just look for the pattern NOUN+wielkości+NOUN. This can be represented by the pattern: [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}]. In the following code, we will add this pattern to a Matcher object, and then search our example sentence.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('pl_core_news_lg')
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}]
matcher.add('comparative', [pattern])  # give name 'comparative' to this list of patterns

text = 'W 1789 r. na brzegu rzeki Lujana koło Buenos Aires znaleziono kości tajemniczego ssaka wielkości słonia.'
doc = nlp(text)  # run pipeline
matches = matcher(doc)  # do matching
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # matched span
    sentence = str(span.sent)  # matched sentence
    print(string_id, start, end, span.text, sentence)

We don’t want to just match a NOUN at the end, but also allow a pronoun (PRON) or adjective (ADJ). There are two ways to add these. First, we could create separate patterns (this method is somewhat repetitive):

patterns = [  # list of different patterns
    [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}],
    [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'ADJ'}],
    [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'PRON'}],
]
matcher.add('comparative', patterns)

We can also use the IN operator, which is probably a bit quicker.

patterns = [
    [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': {'IN': ['NOUN', 'ADJ', 'PRON']}}],
]
matcher.add('comparative', patterns)

Now that we have our pattern, let’s put it all together. We’ll load the model, add the patterns, and then iterate through our entire corpus using the corpus_iter function we created before.

nlp = spacy.load('pl_core_news_lg')
matcher = Matcher(nlp.vocab)
patterns = [
    [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': {'IN': ['NOUN', 'ADJ', 'PRON']}}],
]
matcher.add('comparative', patterns)
for doc in nlp.pipe(corpus_iter()):
    for match_id, start, end in matcher(doc):
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # matched span
        sentence = str(span.sent)  # matched sentence
        print(string_id, start, end, span.text, sentence)

This process results in 11 result sentences, of which half are the appropriate form of wielkości, including these gems:

apple pie wielkości Przylądka Kanagawa: ‘apple pie the size of Cape Kanagawa’
rząd wielkości kurna: ‘order of magnitude of a hen’
kopertę wielkości karty kredytowej: ‘envelope the size of a credit card’
planetę wielkości piłki do hurlati: ‘planet the size of a hurlati(?) ball’

Additional work could better fine tune these rules, though having to only review a mere 11 cases instead of the entire corpus containing 60 appearances of wielkości. More importantly, we got to see the pattern matching capabilities of spaCy in action.