I have run into a lot of references in papers to software that was used to support various research efforts, particularly in the area of healthcare research. The idea behind this sharing is admirable: supporting reproducible research and clearly providing the methods. In a number of papers, the tool with its cute, acronym-formed name boldly…
Author: Foggy Programmer
Testing Logged Output using `pytest`
Previously, we consider how to test print output using pytest. This is a continuation which considers how to test logged output. For printed output, we used pytest‘s capsys fixture to grab the stdout (quite useful) and stderr (unclear utility?). capsys doesn’t provide a mechanism to tap into the logging stream. So, what should we use…
Testing Printed Output using `pytest`
While typically testing focuses on evaluating the output of a function (i.e., what it returns) or the state of an object at a particular point in time, there are occasions when it can be important to test the printed or logged output. This might include warnings about module, parameter, or function deprecation; messages about problematic…
Optimizing `to_sql` Method in `pandas`
My environment requires a lot of database work in SQL Server to access data. The data I work with (i.e., text) isn’t stored particularly efficiently so I will sometimes need to pull down data, perform some manipulations, re-upload, do some joins, and download again. Sure, there are a number of shiny ‘solutions’ that would make…
Calculating Jaccard Similarity Coefficients in `pandas`
I’m quite accustomed to looking at performance against some gold (or silver) standard. It’s nice to have some ready definition of ‘truth’ and then, when applying some algorithm, we can clearly see if it matched or failed to match. More recently, however, I was attempting to compare the outputs of multiple UMLS-processing NLP systems on…
Connecting to Teradata with Python
Teradata is a relational database released by Teradata Corporation. I have some data that lives there and occasionally need to access it — how to approach this with Python? Teradata Corporation publishes the teradatasql library to provide a PEP 249-compatible interface to the Teradata database. This Python package is actively maintained internally. Connection Basic usage…
Hyphen/Underscore Acting up on Ergodox EZ Keyboard
I have been using an Ergodox EZ keyboard for over seven years. The thing starts at $300 (and you probably want some tent tilts (I have), and maybe some wrist rests (I don’t use these anymore), but after 7 years of having to occasionally use someone else’s keyboard (or the default laptop keyboard), this was…
`seaborn`: The Basics
seaborn is a Python graphing library which interacts incredibly well with pandas. Yes, pandas does have its own plotting functions accessible from df.plot, which are particularly easy to build and (quite conveniently) don’t require another external library. I’ve fond pandas‘ plots particularly useful to do quick checks and calculations while doing some other aspect of…
Fixing Healthcare Text for NLP: Spell Correction and Word Segmentation
Healthcare text can be challenging to work with. The transformations, simplifications, and shortcuts taken to store this data for secondary use (e.g., research) result in major problems for ultimate use. These upstream failures might strip spaces (thereby causing run-together words), remove other formatting characters (e.g., newlines and tabs), and combine what were once pretty-looking tables…
Logging Function Parameters with `loguru`
Log files can often be useful sources of historical information about how programs run. I have found them sitting next to datasets and used them to get more information on the provenance of the dataset. Perhaps I could add a function that would log all of the parameters that were run? Sure, a configuration file…