In this write-up, I want to discuss a more encapsulated solution to the fencepost problem which relies on Python’s context managers. By ‘encapsulated’, I mean ‘hidden from the user’, or ‘handled by the object’ in an object oriented programming sense.
Before starting, let’s digress briefly into the fencepost problem (at least as how I was taught, or have subsequently misunderstood it). Supposing you wanted to build a fence with 10 sections/spans, how many fenceposts do you need? Well, if you imagine an existing fence, each section/span of fencing requires 1 additional fencepost. However, when starting from nothing, you need to start (or end) with an extra fencepost, so you’ll need 11. We can see this with a simple Python program.
def fence_builder(num_fences): for i in range(num_fences): print('=', end='') # fence print('|', end='') # fencepost fence_builder(10) #> =|=|=|=|=|=|=|=|=|=|
We have a very nice looking fence here, but we’re missing one fencepost. This cannot be handled within the loop. In order to build our fence, we’ll have to add another post before or after the for-loop.
def fence_builder(num_fences): print('|', end='') # starting fencepost for i in range(num_fences): print('=', end='') # fence print('|', end='') # fencepost fence_builder(10) #> |=|=|=|=|=|=|=|=|=|=|
Success! Now, this function can build fences of an arbitrary length (where num_fences > 0
). I think of a fencepost problem as any case in programming where additional processing which feels like it should be done in the for-loop must be done outside of it.
Let’s consider a slightly more practical example. We have a set of words which we’ll need to print out, joining by a comma. (Yes, we could just use ', '.join(words)
…) We can do this by either separating out the first word (listise
, below) or treating the last word special (listise2
).
def listise(words): print(words[0], end='') # first fencepost for word in words[1:]: print(f', {word}', end='') # fence span + fencepost print() # add final newline (another sort of fencepost) def listise2(words): for word in words[0:-1]: print(f'{word}, ', end='') print(words[-1]) # final fencepost listise(['cheese', 'butter', 'broccoli', 'bread']) #> cheese, butter, broccoli, bread listise(['cheese']) #> cheese
We can make this a little more complex by requiring that the last element be separated by a , and
.
def listise(words): print(words[0], end='') for word in words[1:-1]: print(f', {word}', end='') if len(words) > 1: print(f', and {words[-1]}') def listise2(words): for word in words[0:-1]: print(f'{word}, ', end='') if len(words) > 1: print('and ', end='') print(words[-1])
These new forms of listise
require additional ‘fenceposts’ to support the for-loop’s operation. This often seems to lose some readability, and as the operations inside a ‘fencepost’ become more complex, the possibility of either forgetting the final fencepost (i.e., the one outside the for-loop) or failing to ensure that it is updated can be a challenge. The repetition annoys me too.
Taking one final example (which, in a slightly different context, I face quite consistently), let’s suppose we have a streaming dataset of rows of (user_id, date, comment/tweet/etc.)
. These are ordered by user_id
and date
, and we need to ‘join’ the comment/tweet/etc.
field by unique user/date. We might also stipulate that the entire dataset cannot fit into memory (or we want to process this as a stream) in order to eliminate possibilities of panda
‘s groupby.
To do this, we’ll need to keep track of the current user (curr_user
), current date (curr_date
), and the current comments (curr_comments
). Whenever the date or user changes, we’ll output (in our case, ‘yield’) the comments so far, and reset them.
Let’s first take a look at our dataset:
# dataset.csv <- this is our input dataset 0,2023-06-10,"Hello" 0,2023-06-10,"," 0,2023-06-10,"world!" 0,2023-06-20,"Bonjour" 1,2023-05-20,"The sun is shining." 1,2023-05-20,"And now it's not."
We want all of the ‘comments’ (the final string value) to be joined together when they occur by the same user on the same date.
import csv def process_dataset(file): curr_date = None # current date, default to None curr_user = None # current user, default to None curr_comments = None # list of comments currently being processed with open(file) as fh: for user_id, date, comment in csv.reader(fh): if user_id != curr_user: # new user if curr_user is not None: # don't return results if this is the first record yield curr_user, curr_date, ' '.join(curr_comments) curr_user = user_id curr_date = date curr_comments = [] elif date != curr_date: # same user, new date yield curr_user, curr_date, ' '.join(curr_comments) curr_date = date curr_comments.append(comment) # final fencepost if curr_user is not None: # make sure at least one element appeared in the file yield curr_user, curr_date, ' '.join(curr_comments) # call the function for user, date, comments in process_dataset('dataset.csv'): print(user, date, comments) # print results #> 0 2023-06-10 Hello , world! #> 0 2023-06-20 Hello , world! Bonjour #> 1 2023-05-20 The sun is shining. And now it's not.
If you remove the final fencepost, the final line of output will not be printed. Unfortunately, we have to repeat some processing. This is fine (it works), but I recently experimented with a context manager to see how this might look.
For this example to work easily, I’m going to slightly alter the requirements of process_dataset
to print out the results (rather than yielding them). This suggests some of the limitations…but pay attention to the size of the process_dataset
function itself. For better or worse, we’ve encapsulated the some redundant logic and made way for reuse. In addition, we’ve hidden the last fencepost in the __exit__
function.
import csv class CM: def __init__(self): self.curr_date = None self.curr_user = None self.curr_comments = None def print(self): if self.curr_user is not None: print(self.curr_user, self.curr_date, ' '.join(self.curr_comments)) def process(self, user_id, date, comment): if user_id != self.curr_user: # new user self.print() self.curr_user = user_id self.curr_date = date self.curr_comments = [] elif date != self.curr_date: # same user, new date self.print() self.curr_date = date self.curr_comments.append(comment) def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.print() def process_dataset(file): with CM() as cm, open(file) as fh: for user_id, date, comment in csv.reader(fh): cm.process(user_id, date, comment) process_dataset('dataset.csv') #> 0 2023-06-10 Hello , world! #> 0 2023-06-20 Hello , world! Bonjour #> 1 2023-05-20 The sun is shining. And now it's not.
I’m not entirely sure that this is ‘better’, but an interesting alternative.