I am one of the primary managers of my company’s public github site. While we generate a lot of code, most of this work is not designed to be shareable (run once or twice for a particular project and then move on to the next project) as the focus is on the research work. For…
Exploring `polars` Masks
When working with arrays and dataframes, a ‘mask’ is a filter that selects a subset of the source array or dataframe. This is often represented as a boolean array or Series like: [True, False, False, True]. When evaluated against a DataFrame, we’ll get the first and fourth rows back since these are both True. Since…
Visualising Word Embeddings: Exploring Tensorflow’s Embedding Projector
One of my regular tasks in presentations is to dedicate a couple slides to introduce word embeddings. Words are, unfortunately, arbitrary in their spelling (and, relatedly, their pronunciation). For example, if we were to forget our knowledge of English and glance at the English words rock, sock, and rook, we might assume that they are…
Reviewing Regex Matches with Context Window in `polars`
In natural language processing tasks (especially when building regular expression-based tools), it’s important to be able to review text efficiently. When I first started, the default approach was reviewing in an Excel workbook. This involved a few columns of metadata, a giant blurb of text to be reviewed, followed by a column to record the…
Coding with a Chatbot for Dummies
My first attempt to code with a chatbot was several years ago and involved using ChatGPT to do a couple data transformations using pandas. The dataset was not large. My procedure was something like: Regarding #3, I think I stared by trying to type it out myself — the idea of making sure I understood…
Evaluating Generative Chatbots
I was at an epidemiology conference about a month ago – not a typical location for a data scientist, but circumstances found me there. A number of sessions have embraced a certain (albeit nervous) enthusiasm regarding access to decoder-only transformers, often called ‘AI’ or ‘large language models’. There is a certain buzz and excitement —…
`polars`: `replace_strict` vs `replace`
When I was first learningn polars, I had an immediate need to replace a certain column with a mapping. This often happens in data science where a variable is stored as using a numerical representation rather than a string to save space, simplify filtering, etc. The ‘mapping’ is stored in either documentation or some sort…
The Journey from ‘Getting Started’ to Expert
One of the challenges when picking up a new programming tool or package is moving from the very basic ‘Getting Started’ page to the vast array of API documentation. The middle ground is immense and disorienting. It takes effort and persistance to advance — to actually learn the technology. The cognitive load is heavy and…
Installing a Project from Github with `uv`
Here’s a short guide (targeted primarily at myself) on how to get a project from Github (or some other git-based repository) onto my machine. This guide assumes you have git and uv installed and added to your path. Also, the secret to understanding uv is seeing the lockfile as foundational rather than the packages currently…
Starting `polars`: A New Paradigm for Learning Technologies
I’ve been using polars a lot more, recently. This was a library I’d intended to learn a while ago, but it’s very difficult to start using (i.e., learning) a new package for a project, particularly as deadlines begin to loom. My clients don’t really care whether I use polars or pandas, so long as I…