Starting `polars`: A New Paradigm for Learning Technologies

I’ve been using polars a lot more, recently. This was a library I’d intended to learn a while ago, but it’s very difficult to start using (i.e., learning) a new package for a project, particularly as deadlines begin to loom. My clients don’t really care whether I use polars or pandas, so long as I can provide them high quality data in a reasonable amount of time. Learning new technologies is therefore a burden.

Suppose, for example, I decide to work with polars: pip install polars, start a Jupyter notebook, and then rely on the examples on the polars home page (and PyCharm’s autocomplete) to load a csv file:

import polars as pl

df = pl.read_csv('data.csv')

import polars as pl

df = pl.read_csv('data.csv')

This, of course, resulted in an error (a pleasant way to start exploring a library): ComputeError: Could not parse '0.5' as dtype Int64 at column 13. Fortunately, the error includes a useful workaround: add infer_schema_length and set to 10,000. (In brief, polars guessed the wrong datatype for a column, so I’m telling it to look at more rows first.)

df = pl.read_csv('data.csv', infer_schema_length=10_000)
df.head()

df = pl.read_csv('data.csv', infer_schema_length=10_000)
df.head()

I’m feeling proud of myself — and rather tempted to celebrate with piece of chocolate and latte, though I resist the temptation as I explore the dataframe. Hmm…one of the columns is an integer, but I want it to be the string representation of that integer. Here’s the mapping:

mapping = {
  1: 'affirmed',
  2: 'probable',
  3: 'possible',
  4: 'unlikely',
  5: 'negated',
}

mapping = {
  1: 'affirmed',
  2: 'probable',
  3: 'possible',
  4: 'unlikely',
  5: 'negated',
}

Well, my brain immediately jumps to pandas where I would run:

# pandas equivalent
# I typically reach for `apply`
df['status'].apply(lambda x: mapping[x])
# although this is probably better to use with `map`
df['status'].map(mapping)

# pandas equivalent
# I typically reach for `apply`
df['status'].apply(lambda x: mapping[x])
# although this is probably better to use with `map`
df['status'].map(mapping)

But I have polars loaded…I know it’d probably be quicker to just use pandas, and it takes some force of will to put in the effort…not to figure out the syntax, but to think of how to search for it. (This is, perhaps, an interesting generational development: my manager would immediately jump to the documentation page; I’d default to search; and the next generation will type it into their chat client?) Perhaps 'polars equivalent of df.apply'? or 'polars map value with dict'?

This is where the effort in acquiring a new technology is at its most challenging: figuring out how to discover the appropriate methods to actual accomplish the learning. Had I even read through a text on polars or reviewed all of its documentation, I’m not sure that I would be able to recall the appropriate functions or approach. Now, however, we can rely on chat-based LLM clients. These chatbots have a better interpretation of my meaning or request, and they are ability to synthesize search results into a readable extract (i.e., summarisation). If I describe my problem in free-text (e.g., 'Polars: replace column with values from dict.'), I will get an extended response with examples.

# Chat Response:

# [...omitted explanation...]

# Replace values according to dictionary
df = df.with_columns(
    pl.col("role").map_dict(role_to_department).alias("department")
)

# Chat Response:

# [...omitted explanation...]

# Replace values according to dictionary
df = df.with_columns(
    pl.col("role").map_dict(role_to_department).alias("department")
)

Unfortunately, this is wrong. It was right for polars version 0.17 and earlier. When I first tried to figure this out, the bot insisted that I was wrong (now, however, it lists both versions). An online search eventually revealed that map_dict should be replaced by replace (for the same datatype) or replace_strict (for switching datatypes, as I’m doing from int -> str).

df.with_columns(
  pl.col('status').replace_strict(mapping).alias('status')
)

df.with_columns(
  pl.col('status').replace_strict(mapping).alias('status')
)

I bring up this example in particular because there will be challenges as the APIs for various technologies grow and evolve. Almost all of the other questions I posed (including a lot of 'polars equivalent for [...] in pandas') were answered more-or-less correctly, with useful examples.

Reflections

This experience using a GPT-based chatbot to start using polars has been instrumental in my replacing pandas on all of my projects with polars (except where some legacy code still uses pandas). I’m confident that if I don’t know how to do something, I’ll be able to ask a chatbot for advice and useful examples.

Why not just have the chatbot write the code for me? Well, there are some straightforward objections:

I wanted to learn the new technology (not just use it). Having to query a chatbot for everything doesn’t seem particularly efficient, and I’m not longer the one doing the thinking (or the learning).
I have a mortal fear of having to debug code — especially code that I have not written. And, especially code in a technology I don’t know.
It’s hard to think about data wholistically, and about possible issues, errors, etc. when I’m not actively exploring it or able to think about it in procedural way. What missing data might there be? What trends occur overtime that might impact our study? While I could ask these to the chatbot as well, I’m less confident in my ability to think through my steps if I’m relying on an external system.
Dependence. It doesn’t seem like a good idea to be dependent on technologies like this for my daily work. I use PyCharm, but know that I could get my work done (albeit, in large projects a bit more slowly) without it. I want to have the same independence from chatbots.

These concerns were made evident in a recent internal presentation I attended. The speaker, a masterful practitioner of the SAS language, was reporting on a migration attempt of SAS to Python. SAS has become inordinately expensive, and Python has a much more budget-friendly pricetag. The speaker had used a chatbot (ChatGPT? Maybe Copilot?), and started with instructions to translate all of the code. This is equivalent to the recommendation of just have a chatbot write the polars code for me.

Next, the speaker reported running the code in Python. When he came to an error, he’s dump it into his friendly neighbourhood chatbot, and copy-paste the response. Eventually, he got the code working.

He didn’t provide a lot of code examples in his presentation (though some of them were slightly concerning), so I followed up with a request for the code. They were planning to use it in production, and I thought that I could help clean it up a little bit and remove the SAS-ness (i.e., make it a bit more Pythonic with refactoring into functions, etc.). Well, the code was not good, and gave me terrible chills and sweats just looking at it. try-except blocks three layers deep, overly-complex sqlalchemy calls to read a database, etc.

I’m also fairly confident that the speaker learned almost nothing about Python. He can run the code, but I believe that he treated everything as one might treat magical spells — carefully copying, terrified of altering lest devastating consequences befall… Any change will need to go through the chatbot. And, if I asked him to do a simple operation in pandas (the technology used through the ten or so rather long Python scripts), I’m not sure he could do it. Also, based on the amount of time it took him, I think he might have done better learning step by step:

Look at the first bit of SAS code (the smallest ‘function’ that does something), figure out what it does.
Ask your chatbot, 'assume I have a sas7bdat file/database/csv file/etc., how do I do X in Python'? Perhaps provide equivalent SAS code.
Read the response, and type it in. Make sure you understand everything you’re writing — if not, just ask!
Run what you just wrote to make sure it works! (If you are actually doing this, SAS -> Python conversion, trying using a Jupyter Notebook to get a more procedural, SAS-like feel. Next, move the code to an actualy *.py for actual use.)

Summary

When learning a new technology, rely heavily on chatbots to summarise methodology and individual ideas, but not to write all of your code for you. Not only will it probably not work, but you’ll end up with terrible code, it will take longer, and you’ll learn nothing.