If you use Python’s built-in round
, the answer is easy. round(8.475, 2)
in theory looks at the 5
and will therefore round to the nearest even number (i.e., 8
not 7
) so the result should be 8.48
. EOM. But, when using pandas
I get 8.48
, but polars
gives me 8.47
— why the difference?
First, let’s make sure we understand how Python’s round
function works. Why does it round 5
to the. nearest even number?
Rounding the 5
to the nearest even number (rather than ‘up’ as I was taught in middle school math) is called banker’s rounding, and solves the problem that always rounding up will cause an overall distribution to also shift up. For example, imagine you’re rounding 1000s of more-or-less random numbers. You’d like the distribution of these numbers to be relatively stable (i.e., you don’t want the mean/average or median to fluctuate too much). Well, assuming a uniform distribution of numbers 0-9, 100 of each 1000 numbers will end in a 5
, and all those 5
s will be rounded by. All the 6+ will be rounded up, and all of the 4- will be rounded down. These will exactly offset so the final mean is the same. Unfortunately, rounding all of those 5
s up will add 500 units (pounds/dollars?) to the total, throwing off the mean. If we do banker’s rounding, then half of these should be rounded up, and the other half rounded down.
We can see this in action with the following code:
import numpy as np
import random
def middle_school_round(n):
remainder = n % 10
if remainder >= 5:
return n + (10 - remainder)
else:
return n - remainder
# Generate 10 random floats between 1 and 20 with arbitrary decimal parts
random_numbers = [random.randint(1, 100) for _ in range(1_000_000)]
middleschool_rounded = [middle_school_round(x) for x in random_numbers]
# Banker's rounding: Python's built-in round (round half to even)
bankers_rounded = [round(x, -1) for x in random_numbers]
print(f'Random numbers: {random_numbers[:10]}')
print(f'Middle school rounded: {middleschool_rounded[:10]}')
print(f'Banker rounded: {bankers_rounded[:10]}')
print(f'Mean after middle school: {np.mean(middleschool_rounded)}')
print(f'Mean after banker: {np.mean(bankers_rounded)}')
print(f'Original mean: {np.mean(random_numbers)}')
> Mean after middle school: 50.96469
> Mean after banker: 50.4635
> Original mean: 50.463131
We can see very little divergence from the mean with ‘banker’s mean’, but a movement in the upward direction using the middle-school 5-goes-up rounding.
Rounding with Floats
In theory, this works the same with floating point numbers (i.e., decimals). Practically, however, there are other potential pitfalls. I ran into one of these recently when building tests for a web app that uses pandas
under the hood to do some intricate filtering and then calling mean
. The test module that I was writing relied on polars
, but would perform the same calculations to prove that the data was correctly loaded and displaying correctly given different parameter sets.
Everything was working so well, and then I ran into issues when checking the mean. I confirmed that both polars
and pandas
were working with identical datasets, that I called the mean
on both, and then performed rounding using round(x, 2)
. Here’s what I’d see:
# pandas mean
8.475
# round(x, 2)
8.48
# polars mean
8.475
# round(x, 2)
8.47
Not a big deal — I should have probably been using a range to confirm that float values were close like polars_value - 0.05 <= pandas_value <= polars_value + 0.05
(and there are probably some nice pre-package pytest
fixtures to do this too. But why?
Let’s generate a dataset and see if we can replicate it. First, will some basic math we can create a small dataset that will have a mean of 8.475
:
import pandas as pd
import polars as pl
values = [9] * 19 + [8] * 21
pandas_df = pd.DataFrame({'value': values})
polars_df = pl.DataFrame({'value': values})
print(f'Mean: {pandas_df["value"].mean()}')
print(f'Mean: {polars_df["value"].mean()}')
> 8.475
> 8.475
Great, we have our dataset and have confirmed that the dataset in both packages has a mean of 8.475
. Now, let’s do some rounding to the second decimal place (round(x, 2)
):
print(f'Pandas Round: {round(pandas_df['value'].mean(), 2)}')
print(f'Polars Round: {round(polars_df["value"].mean(), 2)}')
> 8.48
> 8.47
The reason for this is that even though both are 8.475
(or 8.47499999999999964
), they use different rounding libraries which make different assumptions. You can see the actual stored number is 8.4749999...
, so polars
may look at that and say that the third decimal value is 4
so round down. pandas
‘ backend rounding library must interpret this as 8.475
(or slightly above) and rounds accordingly.
Summary
The built-in round
function will perform banker’s rounding: i.e., rounding <=4
down, >=6
up, and 5
to the nearest even. Libraries like pandas
and polars
rely on their own rounding libraries and this can cause subtle differences in how numbers are actually rounded.
Post scriptum
# implement rounding in your own class
class SillyString:
def __init__(self, s: str):
self._s = s
def __round__(self, ndigits=0):
return type(self)(self._s[:-ndigits])
def __repr__(self):
return self._s
s = SillyString('hello!')
print(round(s, 2))
> 'hell'