When I was first learningn polars
, I had an immediate need to replace a certain column with a mapping. This often happens in data science where a variable is stored as using a numerical representation rather than a string to save space, simplify filtering, etc. The ‘mapping’ is stored in either documentation or some sort of lookup table which might contain information like:
id | name | description |
1 | high school/GED | Highest level of education… |
2 | associate’s | Obtained associates or equiva… |
3 | bachelor’s | Obtained bachelor’s or equiva… |
And the dataset will then look like:
subject_id | education |
1 | 1 |
2 | 3 |
3 | 3 |
4 | 2 |
… | … |
However, when displaying this information, it can be useful to apply the mapping before generating reports or graphs. This is quite simply done in polars
by using a with_columns
block (i.e., a method for efficiently altering multiple columns at the same time) and then replace_strict
.
mapping = {
1: 'high school',
2: 'associate',
3: 'bachelor',
...
}
df = df.with_columns(
pl.col('education').replace_strict(mapping).alias('education')
)
Here, we are replacing the polars
dataframe (df
) with a new dataframe (df
) where the education
column is no longer represented by integers but strings. replace_strict
allows us to change the type. Since I was just learning polars
, I didn’t look to much into why I could use replace_strict
, but replace
in the same context would raise an error (namely, I couldn’t change types).
My entire mental model of polars
, however, fell apart when I received an error that my mapping was incomplete. Apparently, there were elements that could appear in my data which were not mapped. replace
wouldn’t do since I wanted to change the type, but I didn’t necessarily know the complete mapping beforehand.
Probably the best approach is to therefore build the mapping from the data using df['education'].unique()
and ensuring that mapping
contained all these elements (or supply some sort of default mapping). Alternatively (and conveniently) replace_strict
offers a default
parameter to automatically map these to some other value. It makes since that all values require a mapping since we are potentially changing the datatype.
replace
, on the other hand, operates under the assumption that only part of the values are being remapped. Suppose, e.g., we wanted to merely merge 1
(high school) and 2
(associate’s). We could use replace with an incomplete mapping, but the datatype must be the same:
mapping = {
1: 1,
2: 1, # preserve datatype
}
df = df.with_columns(
pl.col('education').replace(mapping).alias('education')
)
Now, instead, we have simply merged 1
and 2
into 1
, so replace
is the appropriate choice. All of the other values will remain as they are.
Summary
In brief, use replace_strict
when changing datatypes, however all values will need a mapping (or the default
specified). In contrast, use replace
to alter a subset of the values, keeping all that do not match as they are.