One of my regular tasks in presentations is to dedicate a couple slides to introduce word embeddings. Words are, unfortunately, arbitrary in their spelling (and, relatedly, their pronunciation). For example, if we were to forget our knowledge of English and glance at the English words rock, sock, and rook, we might assume that they are…
Category: nlp
Reviewing Regex Matches with Context Window in `polars`
In natural language processing tasks (especially when building regular expression-based tools), it’s important to be able to review text efficiently. When I first started, the default approach was reviewing in an Excel workbook. This involved a few columns of metadata, a giant blurb of text to be reviewed, followed by a column to record the…
Fixing Healthcare Text for NLP: Spell Correction and Word Segmentation
Healthcare text can be challenging to work with. The transformations, simplifications, and shortcuts taken to store this data for secondary use (e.g., research) result in major problems for ultimate use. These upstream failures might strip spaces (thereby causing run-together words), remove other formatting characters (e.g., newlines and tabs), and combine what were once pretty-looking tables…
Building Language Rules in SpaCy
spaCy provides a number of useful methods for exploring and creating patterns after a particular text or document has been read. To see this in action, let’s use spaCy to build some rules in the more computational linguistic side of NLP. So, for those less interested in language, forgive a brief digression into Polish. In…
Using spaCy for Sentence Splitting
By default, spaCy carries around a powerful battery of pipelines and swings these mighty chainsaws at every passing tree and twig. Sometimes, however, you only want a small pruner to accomplish some smaller task. Can spaCy still work in such a use case? For example, suppose that all I want from spaCy are my documents…