Context Managers and the Fencepost Problem

In this write-up, I want to discuss a more encapsulated solution to the fencepost problem which relies on Python’s context managers. By ‘encapsulated’, I mean ‘hidden from the user’, or ‘handled by the object’ in an object oriented programming sense.

Before starting, let’s digress briefly into the fencepost problem (at least as how I was taught, or have subsequently misunderstood it). Supposing you wanted to build a fence with 10 sections/spans, how many fenceposts do you need? Well, if you imagine an existing fence, each section/span of fencing requires 1 additional fencepost. However, when starting from nothing, you need to start (or end) with an extra fencepost, so you’ll need 11. We can see this with a simple Python program.

def fence_builder(num_fences):
    for i in range(num_fences):
        print('=', end='')  # fence
        print('|', end='')  # fencepost


fence_builder(10)

#> =|=|=|=|=|=|=|=|=|=|

We have a very nice looking fence here, but we’re missing one fencepost. This cannot be handled within the loop. In order to build our fence, we’ll have to add another post before or after the for-loop.

def fence_builder(num_fences):
    print('|', end='')  # starting fencepost
    for i in range(num_fences):
        print('=', end='')  # fence
        print('|', end='')  # fencepost


fence_builder(10)

#> |=|=|=|=|=|=|=|=|=|=|

Success! Now, this function can build fences of an arbitrary length (where num_fences > 0). I think of a fencepost problem as any case in programming where additional processing which feels like it should be done in the for-loop must be done outside of it.

Let’s consider a slightly more practical example. We have a set of words which we’ll need to print out, joining by a comma. (Yes, we could just use ', '.join(words)…) We can do this by either separating out the first word (listise, below) or treating the last word special (listise2).

def listise(words):
    print(words[0], end='')  # first fencepost
    for word in words[1:]:
        print(f', {word}', end='')   # fence span + fencepost
    print()  # add final newline (another sort of fencepost)


def listise2(words):
    for word in words[0:-1]:
        print(f'{word}, ', end='')
    print(words[-1])  # final fencepost


listise(['cheese', 'butter', 'broccoli', 'bread'])
#> cheese, butter, broccoli, bread
listise(['cheese'])
#> cheese

We can make this a little more complex by requiring that the last element be separated by a , and.

def listise(words):
    print(words[0], end='')
    for word in words[1:-1]:
        print(f', {word}', end='')
    if len(words) > 1:
        print(f', and {words[-1]}')


def listise2(words):
    for word in words[0:-1]:
        print(f'{word}, ', end='')
    if len(words) > 1:
        print('and ', end='')
    print(words[-1])

These new forms of listise require additional ‘fenceposts’ to support the for-loop’s operation. This often seems to lose some readability, and as the operations inside a ‘fencepost’ become more complex, the possibility of either forgetting the final fencepost (i.e., the one outside the for-loop) or failing to ensure that it is updated can be a challenge. The repetition annoys me too.

Taking one final example (which, in a slightly different context, I face quite consistently), let’s suppose we have a streaming dataset of rows of (user_id, date, comment/tweet/etc.). These are ordered by user_id and date, and we need to ‘join’ the comment/tweet/etc. field by unique user/date. We might also stipulate that the entire dataset cannot fit into memory (or we want to process this as a stream) in order to eliminate possibilities of panda‘s groupby.

To do this, we’ll need to keep track of the current user (curr_user), current date (curr_date), and the current comments (curr_comments). Whenever the date or user changes, we’ll output (in our case, ‘yield’) the comments so far, and reset them.

Let’s first take a look at our dataset:

# dataset.csv &lt;- this is our input dataset
0,2023-06-10,"Hello"
0,2023-06-10,","
0,2023-06-10,"world!"
0,2023-06-20,"Bonjour"
1,2023-05-20,"The sun is shining."
1,2023-05-20,"And now it's not."

We want all of the ‘comments’ (the final string value) to be joined together when they occur by the same user on the same date.

import csv


def process_dataset(file):
    curr_date = None  # current date, default to None
    curr_user = None  # current user, default to None
    curr_comments = None  # list of comments currently being processed
    with open(file) as fh:
        for user_id, date, comment in csv.reader(fh):
            if user_id != curr_user:  # new user
                if curr_user is not None:  # don't return results if this is the first record
                    yield curr_user, curr_date, ' '.join(curr_comments)
                curr_user = user_id
                curr_date = date
                curr_comments = []
            elif date != curr_date:  # same user, new date
                yield curr_user, curr_date, ' '.join(curr_comments)
                curr_date = date
            curr_comments.append(comment)
    # final fencepost
    if curr_user is not None:  # make sure at least one element appeared in the file
        yield curr_user, curr_date, ' '.join(curr_comments)


# call the function
for user, date, comments in process_dataset('dataset.csv'):
    print(user, date, comments)  # print results

#> 0 2023-06-10 Hello , world!
#> 0 2023-06-20 Hello , world! Bonjour
#> 1 2023-05-20 The sun is shining. And now it's not.

If you remove the final fencepost, the final line of output will not be printed. Unfortunately, we have to repeat some processing. This is fine (it works), but I recently experimented with a context manager to see how this might look.

For this example to work easily, I’m going to slightly alter the requirements of process_dataset to print out the results (rather than yielding them). This suggests some of the limitations…but pay attention to the size of the process_dataset function itself. For better or worse, we’ve encapsulated the some redundant logic and made way for reuse. In addition, we’ve hidden the last fencepost in the __exit__ function.

import csv


class CM:

    def __init__(self):
        self.curr_date = None
        self.curr_user = None
        self.curr_comments = None

    def print(self):
        if self.curr_user is not None:
            print(self.curr_user, self.curr_date, ' '.join(self.curr_comments))

    def process(self, user_id, date, comment):
        if user_id != self.curr_user:  # new user
            self.print()
            self.curr_user = user_id
            self.curr_date = date
            self.curr_comments = []
        elif date != self.curr_date:  # same user, new date
            self.print()
            self.curr_date = date
        self.curr_comments.append(comment)

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.print()


def process_dataset(file):
    with CM() as cm, open(file) as fh:
        for user_id, date, comment in csv.reader(fh):
            cm.process(user_id, date, comment)


process_dataset('dataset.csv')
#> 0 2023-06-10 Hello , world!
#> 0 2023-06-20 Hello , world! Bonjour
#> 1 2023-05-20 The sun is shining. And now it's not.

I’m not entirely sure that this is ‘better’, but an interesting alternative.