spaCy provides a number of useful methods for exploring and creating patterns after a particular text or document has been read. To see this in action, let’s use spaCy to build some rules in the more computational linguistic side of NLP. So, for those less interested in language, forgive a brief digression into Polish.
In Polish, if you wanted to compare the size of something with another object (‘that’s so big, it must be the size of three football fields!’), you’d probably use wielkości (the genitive form of wielkośc, ‘size’). But wielkości can be used in contexts not relating to this comparative such as when it is just referring to the size of something. In addition, it can mean more than just ‘size’, but also ‘greatness’. How could we limit the scope to only uses of the comparative? In reviewing some cases, we might see that when wielkości is both preceded by and followed by a noun, adjective, or pronoun, then it is more likely to be a comparative. How could we code up this rule in spaCy?
First, we’ll need a Polish corpus to test this against. A subset of the NKJP (National Corpus of Polish) is available for download on their website: http://www.nkjp.pl/index.php?page=14.
Second, we’ll need to get the corpus into a usable form. I don’t want to spend too much figuring out the details, so I’ll write an iterator to go through the unpacked corpus directory. I’ll just skip the places where this might fail.
from pathlib import Path import xml.etree.ElementTree as etree def corpus_iter(d=Path('NKJP-PodkorpusMilionowy-1.2'), word='wielkości'): for folder in d.iterdir(): if folder.is_file(): continue tree = etree.parse(str(folder / 'text.xml')) root = tree.getroot() for element in root.iterfind('.//{http://www.tei-c.org/ns/1.0}ab'): text = element.text if word in text: yield text
Third, we’ll need to get a Polish model loaded into spaCy. I’ve selected the pl_core_news_lg
which is, in fact, trained on the NKJP (among other sources). python -m spacy download pl_core_news_lg
.
I used the corpus_iter
method above to isolate a few examples. Here’s one exemplary positive and negative (the first of each category I found while the corpus was still being unpacked…):
- Positive example (note wielkości appearing between two nouns):
token lemma dep head pos
W w case r ADP
1789 1789 amod:flat r ADJ
r rok obl znaleziono X
. . punct r PUNCT
na na case brzegu ADP
brzegu brzeg obl znaleziono NOUN
rzeki rzeka nmod:arg brzegu NOUN
Lujana Lujan nmod rzeki PROPN
koło koło case Buenos ADP
Buenos Buenos nmod rzeki PROPN
Aires Aires flat Buenos PROPN
znaleziono znaleźć ROOT znaleziono VERB
kości kość obj znaleziono NOUN
tajemniczego tajemniczy amod ssaka ADJ
ssaka ssak nmod kości NOUN
wielkości wielkość nmod ssaka NOUN
słonia słoń nmod:arg wielkości NOUN
. . punct znaleziono PUNCT
2. Negative example (not wielkości appearing after a verb):
token lemma dep head pos
Przecież przecież advmod:emph Pigmeje PART
afrykańscy afrykański amod Pigmeje ADJ
Pigmeje Pigmej nsubj zmienili PROPN
, , punct zmniejszyli PUNCT
choć choć mark zmniejszyli SCONJ
też też advmod:emph zmniejszyli PART
zmniejszyli zmniejszyć advcl zmienili VERB
rozmiary rozmiar obj zmniejszyli NOUN
swojego swój det:poss ciała DET
ciała ciało nmod rozmiary NOUN
, , punct zmniejszyli PUNCT
nie nie advmod:neg zmienili PART
zmienili zmienić ROOT zmienili VERB
wielkości wielkość obj zmienili NOUN
mózgów mózg nmod:arg wielkości NOUN
. . punct zmienili PUNCT
Let’s use token-based matching in which we’ll look for the pattern of NOUN + wielkości + NOUN/ADJ/PRON. The available set of token attributes can be found here: https://spacy.io/usage/rule-based-matching/#adding-patterns-attributes.
We’ll create our first rule by using a pattern. The pattern is a list of dict
s, where the dict
defines the type of token that should be matched. For this first example, we’ll just look for the pattern NOUN+wielkości+NOUN. This can be represented by the pattern: [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}]
. In the following code, we will add this pattern to a Matcher
object, and then search our example sentence.
import spacy from spacy.matcher import Matcher nlp = spacy.load('pl_core_news_lg') matcher = Matcher(nlp.vocab) pattern = [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}] matcher.add('comparative', [pattern]) # give name 'comparative' to this list of patterns text = 'W 1789 r. na brzegu rzeki Lujana koło Buenos Aires znaleziono kości tajemniczego ssaka wielkości słonia.' doc = nlp(text) # run pipeline matches = matcher(doc) # do matching for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] # Get string representation span = doc[start:end] # matched span sentence = str(span.sent) # matched sentence print(string_id, start, end, span.text, sentence)
We don’t want to just match a NOUN at the end, but also allow a pronoun (PRON) or adjective (ADJ). There are two ways to add these. First, we could create separate patterns (this method is somewhat repetitive):
patterns = [ # list of different patterns [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'NOUN'}], [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'ADJ'}], [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': 'PRON'}], ] matcher.add('comparative', patterns)
We can also use the IN
operator, which is probably a bit quicker.
patterns = [ [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': {'IN': ['NOUN', 'ADJ', 'PRON']}}], ] matcher.add('comparative', patterns)
Now that we have our pattern, let’s put it all together. We’ll load the model, add the patterns, and then iterate through our entire corpus using the corpus_iter
function we created before.
nlp = spacy.load('pl_core_news_lg') matcher = Matcher(nlp.vocab) patterns = [ [{'POS': 'NOUN'}, {'LOWER': 'wielkości'}, {'POS': {'IN': ['NOUN', 'ADJ', 'PRON']}}], ] matcher.add('comparative', patterns) for doc in nlp.pipe(corpus_iter()): for match_id, start, end in matcher(doc): string_id = nlp.vocab.strings[match_id] # Get string representation span = doc[start:end] # matched span sentence = str(span.sent) # matched sentence print(string_id, start, end, span.text, sentence)
This process results in 11 result sentences, of which half are the appropriate form of wielkości, including these gems:
- apple pie wielkości Przylądka Kanagawa: ‘apple pie the size of Cape Kanagawa’
- rząd wielkości kurna: ‘order of magnitude of a hen’
- kopertę wielkości karty kredytowej: ‘envelope the size of a credit card’
- planetę wielkości piłki do hurlati: ‘planet the size of a hurlati(?) ball’
Additional work could better fine tune these rules, though having to only review a mere 11 cases instead of the entire corpus containing 60 appearances of wielkości. More importantly, we got to see the pattern matching capabilities of spaCy in action.