`polars`: `replace_strict` vs `replace`

When I was first learningn polars, I had an immediate need to replace a certain column with a mapping. This often happens in data science where a variable is stored as using a numerical representation rather than a string to save space, simplify filtering, etc. The ‘mapping’ is stored in either documentation or some sort of lookup table which might contain information like:

id	name	description
1	high school/GED	Highest level of education…
2	associate’s	Obtained associates or equiva…
3	bachelor’s	Obtained bachelor’s or equiva…

And the dataset will then look like:

subject_id	education
1	1
2	3
3	3
4	2
…	…

However, when displaying this information, it can be useful to apply the mapping before generating reports or graphs. This is quite simply done in polars by using a with_columns block (i.e., a method for efficiently altering multiple columns at the same time) and then replace_strict.

mapping = {
    1: 'high school',
    2: 'associate',
    3: 'bachelor',
    ...
}
df = df.with_columns(
    pl.col('education').replace_strict(mapping).alias('education')
)

mapping = {
    1: 'high school',
    2: 'associate',
    3: 'bachelor',
    ...
}
df = df.with_columns(
    pl.col('education').replace_strict(mapping).alias('education')
)

Here, we are replacing the polars dataframe (df) with a new dataframe (df) where the education column is no longer represented by integers but strings. replace_strict allows us to change the type. Since I was just learning polars, I didn’t look to much into why I could use replace_strict, but replace in the same context would raise an error (namely, I couldn’t change types).

My entire mental model of polars, however, fell apart when I received an error that my mapping was incomplete. Apparently, there were elements that could appear in my data which were not mapped. replace wouldn’t do since I wanted to change the type, but I didn’t necessarily know the complete mapping beforehand.

Probably the best approach is to therefore build the mapping from the data using df['education'].unique() and ensuring that mapping contained all these elements (or supply some sort of default mapping). Alternatively (and conveniently) replace_strict offers a default parameter to automatically map these to some other value. It makes since that all values require a mapping since we are potentially changing the datatype.

replace, on the other hand, operates under the assumption that only part of the values are being remapped. Suppose, e.g., we wanted to merely merge 1 (high school) and 2 (associate’s). We could use replace with an incomplete mapping, but the datatype must be the same:

mapping = {
    1: 1,
    2: 1,  # preserve datatype
}
df = df.with_columns(
    pl.col('education').replace(mapping).alias('education')
)

mapping = {
    1: 1,
    2: 1,  # preserve datatype
}
df = df.with_columns(
    pl.col('education').replace(mapping).alias('education')
)

Now, instead, we have simply merged 1 and 2 into 1, so replace is the appropriate choice. All of the other values will remain as they are.

Summary

In brief, use replace_strict when changing datatypes, however all values will need a mapping (or the default specified). In contrast, use replace to alter a subset of the values, keeping all that do not match as they are.