One of my regular tasks in presentations is to dedicate a couple slides to introduce word embeddings. Words are, unfortunately, arbitrary in their spelling (and, relatedly, their pronunciation). For example, if we were to forget our knowledge of English and glance at the English words rock, sock, and rook, we might assume that they are related in meaning since their spelling is so close. The transformation from rock to sock is a mere r -> s; that from rock to rook is c -> o. Should we append ‘stone’ or ‘pebble’ to the list, we’d expect their meanings to be quite distinct: the spellings of stone and pebble have little if any relationship to rock. We might chart these assumptions on some coordinates where more proximal nodes are similar in meaning.

This makes words incredibly difficult to work with in computer science. We would like to develop some function which creates a meaningful representation of words, such that rock, pebble, and stone are quite close to one another. Resembling more the following chart:

We can’t use the spellings themselves. If you don’t yet know the answer – how would you solve it? The trick is to not focus on just the word itself, but the wider context. Thus, a word is known by the company it keeps. In other words, the insight is that similar words appear in similar contexts. And, fortunately for us, algorithms have been developed to help create these representation which are known as embeddings. Some popular algorithms include Word2Vec and fasttext (both accessible via Python in the gensim package), though there are many others.
These embeddings take the form of long, many-dimensional arrays (called vectors). In the example above, we’ve projected hypothetical embeddings onto a 2-dimensional plane with x and y coordinates. Our actual embeddings may take 300-dimensional vectors, and depicting that is left as an exercise to the reader… This is a bit difficult for humans to explore, and even adding color and video (i.e., movement over time), we may be able to edge towards visualising 4 or 5 dimensions, but 300 is too much for mere mortals. The dimensionality of a 300-dimensional vector is too high.
There are various methods for reducing dimensionality and projecting onto a 2d or 3d plane. While these obviously lose quite a bit of information, they are now something that humans can interact with visually. Also, enough information is retained so that the 3-dimensional output can be valuably explored.
One relatively straightforward tool is the Embedding Projector, with a demo hosted at https://projector.tensorflow.org. In this projection, 10.000 points (i.e., words) are included in the dataset with a dimension of 200 (though, reduced to 3 dimensions for the projection). The Embedding Projector comes with three reduction algorithms: principle component analysis (PCA; default), t-distributed Stochastic Neighbour Analysis (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). It’s worth jumping over to t-SNE or UMAP as these do a bit better at displaying local clusters than PCA.
We can explore the projection by performing a search in the top right for nearest neighbours in the original space (i.e., not the projection), but this will display the neighbours in the projection. For example, we can search for ‘document’:

and we can then view the local cluster:

If we zoom into one of these nodes (e.g., ‘software’) we’ll notice that their other close neighbours in the projection (though these are not as close in the original 200 dimensions! recall that not all relationships can be maintained when we reduce the dimensionality to such a degree…). For ‘software’, I can hover over its neighbours and see ‘bug’, ‘functionality’, and ‘libraries’ — all terms related to software.

One word that is not at all close to the cluster of ‘document’ synonyms is ‘office’. If we click it, we’ll look at its neighbours. Intuitively, we can understand how ‘document’ is related to ‘office’, but also that ‘office’ won’t be related to a lot of other document-related terms (e.g., ‘manuscript’, ‘software’, ‘format’, ‘file’, ‘copy’, ‘application’, ‘pdf’, etc.).

What might the practical use cases be for something like this? Well, a projection like this allows us to actually explore our model to ensure that it has captured the expected associations and has groupings that make sense. If we explore the clusters, we’ll find a groupings of names, of food items, of emotions, etc. There is also an option to color the points based on other metadata (e.g., we could pre-label different groups we’re interested in based on some lexicon and ensure that they are appropriately grouped).
If we jump over to the ‘MNIST with images’ dataset (top left dropdown) and run UMAP on that, we can see more clearly what sort of analysis we might be interested in: grouping together numbers (as well as what errors or confusions are being made):

Beyond just exploring projections of word embeddings, we can use sentence-level or even document-level embeddings and explore them. One of my favourite visualisations was done on the programming historian blog in which abstracts from philosophy, history, and linguistics were converted to vectors and compared: https://programminghistorian.org/en/lessons/clustering-visualizing-word-embeddings#visualising-the-results. This same visualisation could be done in the tensorflow Embedding Projector. This shows points in which the different domains seem to overlap — or, perhaps, where an abstract about a particular domain has been misclassified.
Over the next few weeks, I’ll explore how to build vectors for both words and documents, how to load these into a local embedding projector, and consider other projections that allow us to interpret and troubleshoot embedding models.