Using spaCy for Sentence Splitting

By default, spaCy carries around a powerful battery of pipelines and swings these mighty chainsaws at every passing tree and twig. Sometimes, however, you only want a small pruner to accomplish some smaller task. Can spaCy still work in such a use case? For example, suppose that all I want from spaCy are my documents chunked into sentences.

By default, the spaCy pipeline includes (n.b., this post will focus on English, and details may vary based on selection of languages/models) the following components:

tok2vec: looks up a word vector for each token
tagger: part-of-speech tagger
parser: dependency parser
attribute_ruler: rule-based token attribute assignment (especially for corrections; see docs)
lemmatizer: lookup lemmas (cf. ‘root word’) for words
ner: do named-entity recognition (stored in Doc.ents)

Each of these components depends on certain predecessors, as summarized in the following schematic.

Source: spaCy website: https://spacy.io/models#design-cnn

The schematic should be considered guidelines, so individual models or selections (e.g., there are different lemmatizers one might select) might have different sets of dependencies. Perhaps one the simplest ways to figure out what works together is by trial and error (which I’ll show below).

Disabling default modules or enabling non-default modules can usually be completed when loading the model itself. Let’s look at two examples to see the impacts.

nlp1 = spacy.load('en_core_web_sm')  # default pipeline
nlp2 = spacy.load('en_core_web_sm', disable=('tagger', 'attribute_ruler', 'lemmatizer', 'ner'))  # just keep `tok2vec` and `parser`
# the following is the same as nlp2, but might preferred by enablers
nlp3 = spacy.load('en_core_web_sm', enable=('tok2vec', 'parser'))

# Compare pipeline components
print([x[0]for x in nlp1.pipeline])
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
print([x[0]for x in nlp2.pipeline])
# ['tok2vec', 'parser']

text = 'Hanna is at the store.'

doc1 = nlp1('Hanna is at the store.')
doc2 = nlp2('Hanna is at the store.')

# Compare entities (`ner`)
print(doc1.ents)
# ('Hanna',)
print(doc2.ents)
# ()

# Compare part-of-speech (`pos`)
print(doc1[0].pos_)
# 'PROPN'
print(doc2[0].pos_)
# ''

Pipelines can also be altered after the model is loaded using the enable_pipe and disable_pipe. The enable_pipe will add that named element to the pipeline while disable_pipe will remove it. (There is also an add_pipe which specializes in adding a pipeline component which is not pre-configured.) In the following snippet, I’ll start with the parser in the pipeline, remove it, and then add it back.

nlp = spacy.load('en_core_web_sm', enable=('tok2vec', 'parser'))
print([x[0]for x in nlp.pipeline])
# ['tok2vec', 'parser']

nlp.disable_pipe('parser')  # remove `parser`
print([x[0]for x in nlp.pipeline])
# ['tok2vec']

nlp.enable_pipe('parser')  # add `parser` back
print([x[0]for x in nlp.pipeline])
# ['tok2vec', 'parser']

This can be useful in adding your own custom pipelines. Much of this functionality is available from configuration files, but we’ll stick to behavior after launching Python for this post.

Now that we can customize our model pipeline, we can turn to our original quandary: how to use spaCy just to split sentences? There are actually a few different solutions.

First, let’s consider our baseline (nlp1) in which we just follow the default and load all the components. Then, we’ll see how to get sentences for a single piece of text, as well as how we might use it to get sentences.

nlp1 = spacy.load('en_core_web_sm')

# for a single text
sentences = list(nlp1(text).sents)

# across a corpus, where `corpus_iter` is a generator yielding a strings of text
sentences = []
for doc in nlp1.pipe(corpus_iter()):
    for sentence in doc.sents:
        sentences.append(sentence)

But wait, do we need all the components? We need to identify which component is responsible for doing sentence segmentation, and then trace through its dependencies. In this case, the guilty component is parser, but parser depends on tok2vec. We can thus keep just the tok2vec and parser components. (While you can technically eliminate tok2vec as well, this has negative effects…)

nlp2 = spacy.load('en_core_web_sm', enable=('tok2vec', 'parser'))

# for a single text
sentences = list(nlp2(text).sents)

# across a corpus, where `corpus_iter` is a generator yielding a strings of text
sentences = []
for doc in nlp2.pipe(corpus_iter()):
    for sentence in doc.sents:
        sentences.append(sentence)

nlp2 will have exactly the same results as nlp1, though without having to run several of the components, thereby saving some processing time.

A third approach is to use the senter component alone. This is effective if you don’t need the dependency parse (i.e., the parser component) since we can use senter alone. (The sentencizer component will also work, but senter is more accurate than sentencizer.)

For the moment, adding senter directly to the enable parameter list won’t work (see this issue), so we’ll need to use our knowledge of enable_pipe to add it.

nlp3 = spacy.load('en_core_web_sm', enable=('tok2vec',))
nlp3.disable_pipe('tok2vec')  # not actually required, but easier than disabling everything
nlp3.enable_pipe('senter')

# for a single text
sentences = list(nlp3(text).sents)

# across a corpus, where `corpus_iter` is a generator yielding a strings of text
sentences = []
for doc in nlp3.pipe(corpus_iter()):
    for sentence in doc.sents:
        sentences.append(sentence)
len(sentences), sentences

This third method will not behave as the previous 2 did since it relies on a different sentence-splitting algorithm.

Does the third method actually take less time? It might be worth experimenting on a subset of your own data. For me, using a random set of 5000 records (on a crappy laptop), got these results using timeit:

# function
def ssplit(nlp):
    for doc in nlp.pipe(corpus_iter()):
        len(list(doc.sents))

ssplit(nlp1)
# 6min 28s ± 3.71 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
ssplit(nlp2)
# 2min 55s ± 5.46 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
ssplit(nlp3)
# 31.4 s ± 1.18 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
ssplit(nlp4)  # nlp3 without disabling `tok2vec`
# 2min 34s ± 2.88 s per loop (mean ± std. dev. of 2 runs, 1 loop each)

So it does appear to speed things up considerably. Note that it is important to use nlp.pipe to avoid reloading components.