Fixing Healthcare Text for NLP: Spell Correction and Word Segmentation

Healthcare text can be challenging to work with. The transformations, simplifications, and shortcuts taken to store this data for secondary use (e.g., research) result in major problems for ultimate use. These upstream failures might strip spaces (thereby causing run-together words), remove other formatting characters (e.g., newlines and tabs), and combine what were once pretty-looking tables into disordered jumble. The run-together words might, e.g., resemble something like: The 27-year old patient presents today withsob and wheezing.Here, with and sob (‘shortness of breath’) have been run-together. If we are looking for language like ‘shortness of breath’ in a note, this example is unlikely to get picked up — even if we’ve identified the acronym ‘sob’.

Additionally, the content of the notes themselves provides certain challenges: the use of broad (and often difficult-to-spell) medical terminology, medical abbreviations, jargon, and succint phrasing presents a challenge when attempting to perform some level of spell-correction. Is the encountered word misspelled? Or just an abbreviation not yet added to the vocabulary? Here, I want to explore some solutions to this challenge.

Defining the Tasks

First, it’s worth noting that these are, in fact, two independent tasks: spell correction and text segmentation. Spell correction focuses on identifying a misspelled (i.e., out-of-vocabulary) word and proposing what the intended word was. Text segmentation (more often a task in CJK languages), in contrast, seeks to ‘fix’ out of vocabulary sequences by identifying words. There is also a third task lurking in these depths: run-together words in which one (or both) are misspelled.

Second, both (or, all three) of these tasks are unsolved (i.e., there doesn’t appear to be an optimal solution — as when autocorrect ‘fixes’ your texts, emails, or documents). For word segmentation, we might ask how ‘areusable’ should be split? a reusable [bag] or [both paper airplanes] are usable. For spell correction, we might consider sobe — did the author intended the beverage company? misspell sober? or misspell sob (having intended to describe crying, symptoms, or insults).

Third, The solutions to these tasks, quite naturally, require an underlying vocabulary — a set of accepted words which will differ by domain. Further, they might require some understanding of frequency to know which of several possible corrections to prefer. We might use frequencies (and/or context) to determine whether ‘sober’, ‘sob’, or ‘sobe’ are the most likely. Context might help us predict whether the most likely next word in the sequence ‘He has been ___.‘ (Here, a solution like BERT might prove useful — though, again, the underlying vocabulary should be controlled. BERT trained on the newspaper articles might prefer words that do not have the same sequences in healthcare text.

A Typical Approach

As alluded to before, an ideal, joint algorithm might follow these steps:

Identify an out-of-vocabulary term (or low frequency term).
Select most likely replacement based on:
- Overall word frequency,
- ‘Shortest path’ (e.g., viterbi),
- Context,
- Other?
(Optional) Give a user the option to select one of these (e.g., if there is a certain amount of uncertainty/ambiguity).
- It’s interesting how certain services (e.g., MS Word, text messaging apps, etc.) autocorrect certain perceived typos while just providing a red squiggle in other cases (alerting the user to confirm the service’s diagnosis).

Solutions

Rather than building a solution from scratch, it’s often best to reach for off-the-shelf (and often free) solutions before setting to work on re-inventing the wheel. Let’s explore autocorrect (for spell correction) and wordninja (for word segmentation).

The interface for both of these tools is well-documented and relatively simple:

import wordninja as wn
from autocorrect import Speller

speller = Speller()
speller('What iz nott speled rigt?')
# 'What iz not speed right?'

wn.split('The 27-year old patientpresentstoday withsob.')
# ['The', '27', 'year', 'old', 'patient', 'presents', 'today', 'with', 'sob']

We can already see some of the issues with the spell correction: autocorrect prefers speed (deletion) to spelled (insertion), but right (insertion) to rig (deletion). This is due to autocorrect resorting to frequency. Here’s a snippet showing that ‘speed’ is 10x as frequent as ‘spelled’.

speller.nlp_data['spelled'], speller.nlp_data['speed']
# (51604, 584152)

We can update these frequencies, and force our preferred solution:

speller.nlp_data['spelled'] = speller.nlp_data['speed'] + 1
speller('What iz not speled rigt?')
# 'What iz not spelled right?'

This means that we could augment the dictionary to include our own preferred frequencies (and vocabulary). The default dictionary does not contain even the word ‘anaphylaxis’:

# By default, there is no 'anaphylaxis':
speller.nlp_data['anaphylaxis']  # no frequency
# KeyError
speller('Patient has anaphilaxis.')  # no change
# 'Patient has anaphilaxis.'

# Let's add anaphylaxis to the data:
speller.nlp_data['anaphylaxis] = 1
speller.nlp_data['anaphylaxis']
# 1
speller('Patient has anaphilaxis.')  # no change
# 'Patient has anaphylaxis.'

It is also worth noting that since autocorrect relies on unigram frequencies, it does not take into account context. autocorrect could, in theory, be expanded to ask the question: given the prior 3 words (i.e., trigram) what is the most likely next word (i.e., 4gram) — but data sparseness and computational complexity will come into play. One implementation that attempts to handle this is precise_spelling_corrector — but it suffers from computational complexity (not to mention that it focuses on only spell correcting target words (rather than any word in the vocabulary). Thus, if you have an NLP solution that only cares about 100 words, why not just spell correct those?

We could modify autocorrect to focus only on this ‘precise’ spelling correction by limiting its dictionary:

speller.nlp_data = {
    'wheezing': 1,
    'anaphylaxis': 1,
    'sob': 1,
    'patient': 1,
}
speller('Patent has anaphilaxis. Presents with sb and whezing. Is nott that rigt?')
# 'Patient has anaphylaxis. Presents with sob and wheezing. Is nott that rigt?'

Note that in the above text, nott and rigt no longer get corrected since they are not in the dictionary. Also, autocorrect prefers spell correcting whezing to whaling rather than wheezing.

A word of caution as well: if you are supplying your own dictionary based on your existing text corpora, it may be worth requiring a minimum number of appearances prior to being included to avoid misspellings. In addition, you can include potentially missing words by ensuring that an existing medical vocabulary is fully contained (e.g., MedDRA or the UMLS). If not, set those values to 1.

`wordninja`

wordninja has a great name. It uses the Viterbi algorithm to find the ‘shortest path’ through an out-of-vocabulary word using its vocabulary. With this super power, it can perform text segmentation quite well:

wn.split('The27yearoldpatientpresentstodaywithsob.')
# ['The', '27', 'year', 'old', 'patient', 'presents', 'today', 'with', 'sob']
#

It does have some limitations, however. First, it is unable to resolve ambiguity based on context. To resort to our areusable example, let’s create a sentence where the target phrase ‘a reusable’ is the only option:

wn.split('I want areusable bag.')
# ['I', 'want', 'are', 'usable', 'bag']

We can change the underlying ‘language model’ (i.e., a word list built on the fly from a .tar.gz file). This is a bit challenging to demonstrate, so we’ll directly modify the LanguageModel to show how it works.

def set_lm(words):
    """Manual run fucntions in __init__"""
    from math import log
    wn.DEFAULT_LANGUAGE_MODEL._wordcost = dict((k, log((i + 1) * log(len(words)))) for i, k in enumerate(words))
    # wordcost assumes most frequent words are provided firt
    wn.DEFAULT_LANGUAGE_MODEL._maxword = max(len(x) for x in words)
    return wn
    

# Create only two words in the language model
set_lm(['a', 'reusable']).split('areusable')
# ['a', 'reusable']

# Words supplied earlier are assumed to be more frequent (and higher probability)
set_lm(['a', 'reusable'] + ['are', 'usable']).split('areusable')
# ['a', 'reusable']
set_lm(['are', 'usable'] + ['a', 'reusable']).split('areusable')
# ['are', 'usable']

For supplying your own language model, you can create a text file with words order by frequency, from most frequent to lowest: a an the for ... anaphylaxis. As suggested before, be sure to cutoff low frequency words as they are likely misspellings. In addition, add other data sources to your list (e.g., MedDRA or UMLS for medical-related).

Other Complications

There are a few additional complications which ought to be mentioned, but I won’t go into too much detail.

First, if applying these to your text out of the box, order is very important. Let’s take the sequence ‘The 27-year old patient presents today withsob and whezing.’ and apply using spell correction first, then word segmentation, and then the reverse.

# spell correct, then text segment
wn.split(speller('The 27-year old patient presents today withsob and whezing.'))
# 'The 27 year old patient presents today with sob and whaling'

# text segment, then spell correct
speller(' '.join(wn.split('The 27-year old patient presents today withsob and whezing.')))
# 'The 27 year old patient presents today with sob and the king'

These produce two very different sequences for the misspelled ‘wheezing’.

Second, as suggested before, mispelled words that have been run together present even greater challenges (though not quite insurmountable). This is because both autocorrect and wordninja would need to work together. Here’s an example of the nature of the problem:

' '.join(wn.split('Patientpresentstoday'))
# 'Patient presents today'

# Now, let misspell some things:
' '.join(wn.split('Patietpresntstoda'))
# 'Pati et pre s nts toda'
speller(' '.join(wn.split('Patietpresntstoda')))
# 'Path et pre s nts today'