When working with arrays and dataframes, a ‘mask’ is a filter that selects a subset of the source array or dataframe. This is often represented as a boolean array or Series like: [True, False, False, True]. When evaluated against a DataFrame, we’ll get the first and fourth rows back since these are both True. Since rows 2 and 3 are False, those will be omitted. This can often be helpful by providing readable names of the different filters we want to apply to a single dataframe. Rather than a complex query with multiple queries, we can apply a series of masks like df.filter(year_between_2008_2020, name_startswith_vowel, ...). Additionally, in polars we can re-use the masks across multiple dataframes so long as the same columns are shared.
Let’s begin with pandas as a useful reference point. In pandas, we can create a mask like this:
import pandas as pd
# create a toy dataframe
df = pd.DataFrame({
'id': range(26),
'year': range(2000, 2026),
})
df.shape[0] # 26
mask = (df['year'] >= 2008) & (df['year'] <= 2020)
res_df = df[mask]
res_df.shape[0] # 13The mask we’ve created is a pandas boolean Series of True and False values — it is not a dynamic filter. Thus, if we try to use the mask on the result DataFrame (res_df), we’ll get an Exception since the length of the mask (26) is greater than the height of the DataFrame (13).
Let’s turn to polars. We can do the same thing as above with almost identical syntax (just changing the filtering expression to use the function filter):
import polars as pl
# create a toy dataframe
df = pl.DataFrame({
'id': range(26),
'year': range(2000, 2026),
})
df.height # 26
mask = (df['year'] >= 2008) & (df['year'] <= 2020)
res_df = df.filter(mask)
res_df.height # 13This will, however, run into the same limitation as the pandas filter, where we cannot apply it to the output res_df:
import polars as pl
type(mask) # pl.Series
print(mask)
# shape: (26,)
# Series: 'year' [bool]
# [
# false
# false
# false
# false
# false
# …
# false
# false
# false
# false
# false
# ]
res_df.filter(mask) # mask is a length-26 Series, but res_df is length 13
# polars.exceptions.ShapeError: filter's length: 26 differs from that of the series: 13This severely limits the utility of masks. E.g., suppose I had two dataframes from different source and want to apply the same masks to each. Fortunately, polars prefers a different approach when making masks that uses expressions rather than series. This is done with pl.col(column_name). Note that there is still a limitation based on the column_name itself (i.e., the same expression cannot be used on differently named columns). Let’s observe how this works:
# create the mask, but let it be DataFrame agnostic
mask = (pl.col('year') >= 2008) & (pl.col('year') <= 2020)
type(mask)
# <class 'polars.expr.expr.Expr'>
print(mask)
# [([(col("year")) >= (dyn int: 2008)]) & ([(col("year")) <= (dyn int: 2020)])]This mask is not limited to a particular DataFrame and does not have a pre-defined length. We can thus apply to both the original DataFrame as well as the result DataFrame (albeit, in the latter case, it will have no effect).
res_df = df.filter(mask)
res_df.height # 13
res_df.filter(mask)
res_df.height # 13Additional Notes
pandascan allow something like this using a separate function to generate these expressions, perhaps something like this:
def col(name: str):
"""Make a dataframe-independent mask"""
def _expr(df: pd.DataFrame):
return df[name]
return _expr
# usage
mask = col('year')(df) >= 2008
res_df = df[mask]- For these examples, we’ve created a two-stage filter, there is an easier way in both
polarsandpandas:polars:pl.col('year').is_between(2008, 2020)(n.b., inclusive)pandas:df['year'].between(2008, 2020)