Here’s a short guide (targeted primarily at myself) on how to get a project from Github (or some other git-based repository) onto my machine. This guide assumes you have git and uv installed and added to your path. Also, the secret to understanding uv is seeing the lockfile as foundational rather than the packages currently…
Starting `polars`: A New Paradigm for Learning Technologies
I’ve been using polars a lot more, recently. This was a library I’d intended to learn a while ago, but it’s very difficult to start using (i.e., learning) a new package for a project, particularly as deadlines begin to loom. My clients don’t really care whether I use polars or pandas, so long as I…
Should `8.475` round to `8.48` or `8.47`?
If you use Python’s built-in round, the answer is easy. round(8.475, 2) in theory looks at the 5 and will therefore round to the nearest even number (i.e., 8 not 7) so the result should be 8.48. EOM. But, when using pandas I get 8.48, but polars gives me 8.47 — why the difference? First,…
How to Determine the Flags of a Compiled Regular Expresson?
I recently had the challenge of determining which flags had been set in a compiled regular expression. In other words, write a function that given a compile regular expression (e.g., re.compile(‘test’, re.I | re.M), determine that the flags were re.I and re.M. A first attempt might assume that the class re.Pattern has a flags attribute…
Getting Started with `uv`
I started using uv as my default package manager. I’ve only ever used pip, but some echoes of uv had been reverberating in my head, so I gave it a try. And, after a couple months, I’m still using it. I’ve enjoyed the speed of installation and general dependency management, though have endured a few…
The Problem of Citationware: Unkept Promises In Scientific Publications
I have run into a lot of references in papers to software that was used to support various research efforts, particularly in the area of healthcare research. The idea behind this sharing is admirable: supporting reproducible research and clearly providing the methods. In a number of papers, the tool with its cute, acronym-formed name boldly…
Testing Logged Output using `pytest`
Previously, we consider how to test print output using pytest. This is a continuation which considers how to test logged output. For printed output, we used pytest‘s capsys fixture to grab the stdout (quite useful) and stderr (unclear utility?). capsys doesn’t provide a mechanism to tap into the logging stream. So, what should we use…
Testing Printed Output using `pytest`
While typically testing focuses on evaluating the output of a function (i.e., what it returns) or the state of an object at a particular point in time, there are occasions when it can be important to test the printed or logged output. This might include warnings about module, parameter, or function deprecation; messages about problematic…
Optimizing `to_sql` Method in `pandas`
My environment requires a lot of database work in SQL Server to access data. The data I work with (i.e., text) isn’t stored particularly efficiently so I will sometimes need to pull down data, perform some manipulations, re-upload, do some joins, and download again. Sure, there are a number of shiny ‘solutions’ that would make…
Calculating Jaccard Similarity Coefficients in `pandas`
I’m quite accustomed to looking at performance against some gold (or silver) standard. It’s nice to have some ready definition of ‘truth’ and then, when applying some algorithm, we can clearly see if it matched or failed to match. More recently, however, I was attempting to compare the outputs of multiple UMLS-processing NLP systems on…