spaCy: The Basics – Foggy Programmer

I learned much of my natural language processing using Python’s `nltk` library which, coupled with the nltk book (https://www.nltk.org/book/), provides a great introduction to the topic. When I hit industry, however, I never really found a use for it, nor motivate myself to learn the intricacies of creating a corpus from my own dataset. Many of the functions could be obtained from other sources (e.g., `scikit-learn`) or hand-coded (e.g., ngrams). I’m certainly not doing the library justice, but it seemed as though `nltk` required an ecosystem that I wasn’t quite committed to.

The spaCy ecosystem, in contrast, required very little investment to get started. First, install the package and download a language pack.

pip install spacy
python -m spacy download en_core_web_sm

If you’re on Windows and haven’t yet installed VS Tools, that’s probably a prerequisite.
en_core_web_sm will download an English (en) model trained on web (web) data, but the smaller (sm) one — perhaps less training data or missing components. Find a list of languages/models here: https://spacy.io/models.
If for whatever reason you can’t download with the spacy command (and have spaCy installed):
- Manually download the wheel file (e.g., from https://spacy.io/models/fi) and use the ‘Download Link’.
- Run pip install <model>

For this model, run pip install fi_core_news_lg-3.5.0-py3-none-any.whl.

Now, you’re ready to go with spaCy, and whatever you want to do in the language, we can always start with the same pattern:

Read your target text as a Python `str`.
Run the spacy pipeline/model.
Do something with the results.

spaCy doesn’t care how your data is stored — files, database, CSV, etc. — just build your own custom method for reading in the text using pathlib, sqlalchemy, csv, etc. Then, run that extracted text through the spaCy pipeline (we’ll just use the default), before determining how to use the data. Let’s look at some code:

import spacy

text = 'Colorless green ideas are sleeping furiously.'

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print(f'{"token":15}\t{"lemma":15}\t{"dep":15}\t{"head":15}\t{"pos":15}\t{"tag":15}')
for token in doc:  # iterate through each found token
    print(f'{str(token):15}\t{token.lemma_:15}\t{token.dep_:15}\t{str(token.head):15}\t{token.pos_:15}\t{token.tag_:15}')

In the printed output, you can see a number of the linguistic elements determined for each token in the input text. These include the original token (token column), the lemma (i.e., the base form of the word), its role in a dependency parse, part of speech (pos), etc. Each of these can be used in developing an algorithm. For example, to build ngrams, we might choose to omit punctuation tokens (e.g., dep/pos=’punct’) and retain the lemmatized form. (This is, e.g., done in the spacy-ngram library.)

token          	lemma          	dep            	head           	pos            	tag            
Colorless      	Colorless      	nmod           	ideas          	PROPN          	NNP            
green          	green          	amod           	ideas          	ADJ            	JJ             
ideas          	idea           	nsubj          	sleeping       	NOUN           	NNS            
are            	be             	aux            	sleeping       	AUX            	VBP            
sleeping       	sleep          	ROOT           	sleeping       	VERB           	VBG            
furiously      	furiously      	advmod         	sleeping       	ADV            	RB             
.              	.              	punct          	sleeping       	PUNCT          	.

This example is not provided in a generalizable form. To apply this to our corpus, we can imagine using some CSV file with text stored in the ‘text’ column, and the other columns containing some sort of metadata. We’ll take the same pattern from above, though process on a sentence by sentence (rather than ignoring sentences).

We can create the corpus reader. Note that I am returning more than just the text as we may need the other metadata to determine the source.

def read_corpus_from_csv(file):
    with open(file) as fh:
        for row in csv.DictReader(fh):
            text = row.pop('text', '')
            yield text, row  # text:str, row:dict with metadata

And then we modify our code in a few ways to efficiently handle this corpus:

Provide an output CSV file for the created tokens.
Use the `doc.sents` to iterate through the sentences
Use `nlp.pipe` to more quickly read the elements from our corpus

import csv
from pathlib import Path
import spacy

nlp = spacy.load('en_core_web_sm')
path = Path('corpus.csv')
outpath = Path('corpus_out.csv')

with open(outpath, 'w', newline='') as fh:
    writer = csv.writer(fh)
    # write header row
    writer.writerow(['token', 'lemma', 'dep', 'head', 'pos', 'tag'])
    for i, (doc, context) in enumerate(nlp.pipe(read_corpus_from_csv(path), as_tuples=True)):
        for sentence_num, sentence in enumerate(doc.sents):
            for token in sentence:
                writer.writerow([str(token), token.lemma_, token.dep_, str(token.head), token.pos_, token.tag_])

In a real-world application, we’d probably want to manipulate the variables in some way, or at least include the document id and sentence number so that subsequent processes could know which bit of text produced which output. That aside, this is all that it takes to get started with spaCy, though it is wardrobe door that leads to a whole new world…