zipfile — Work with ZIP archives

Python is probably not your first thought when it comes to opening zip archives or compressing directories. In fact, if you’re like me, zip means something rather different… For most needs of handling zip archives, your favourite shell or window GUI handles most of your needs. In fact, if you want Python to emulate this behaviour and open a zip archive for you, the zipfile library is probably not for you — shutil will serve you better with its shutil.make_archive and shutil.unpack_archive.

When does the zipfile module come into play? The zipfile module is useful when you want to do more than unpacking or creating a single zip archive. Suppose you need to work with some sort of transaction data that is stored in a directory, which each day’s transactions in a zip archive. To do your analysis, you need to open all the zip archives and identify the import component without needlessly wasting space by unzipping everything.

Or, perhaps, you have annotations on some text data where each ‘document’ is stored in a zip file with the original text, added tags, etc.? In these and many other cases, the zipfile module can provide a tool to skillfully extract what you need.

Useful Functions

Let’s start with a simple, nested directory structure to explore.

zipme/  <- directory
        outer.txt  <- file inside `zipme/` directory
        inner/  <- directory inside `zipme/` directory
                inner.txt

All of these directories/files could be placed inside zipme.zip:

zipme.zip/
        zipme/   <- inside `zipme.zip` archive
                outer.txt
                inner/
                        inner.txt

Unpack/Build Archive

Even though I said that shutil is probably the best option for unpacking a zip file, let’s see how we would do it. This will allow us to get acquainted.

from zipfile import ZipFile  # import 


with ZipFile('zipme.zip') as zipr:  # context manager to open
    zipr.extractall()  # extract all elements into current directory

This code will extract the files into the current directory. If we want to specify an output directory, we can supply that as an argument: zipr.extractall(path).

We can reverse this process using write(filename, arcname). In this context, filename is the location of the file on the filesystem, and arcname is the name/path to be used within the archive itself (i.e., the archive name).

from pathlib import Path
from zipfile import ZipFile


path = Path('zipme')
with ZipFile('zipme.zip', 'w') as zipr:
    zipr.write(path / 'outer.txt', 'outer.txt')
    zipr.mkdir('inner')  # create the 'inner' directory inside the archive 
    zipr.write(path / 'inner' / 'inner.txt', 'inner/inner.txt')

After opening the archive in write mode, we copy ‘outer.txt’ from the filesystem into the archive, create an ‘inner’ directory, and then copy ‘inner.txt’ from the filesystem to its cozy place within the archive. Technically, the zipr.mkdir('inner') is redundant, since the next command will create it. You’d only need to use mkdir if you didn’t have anything to put inside. (Not sure why you’d do that…?)

If the archive is password protected, supply the pwd argument to ZipFile.

Print Zip Contents

You can print the contents of the zip archive to stdout using the zipr.printdir():

with ZipFile('zipme.zip') as zipr:
    zipr.printdir()

## Output:
# File Name                                             Modified             Size
# zipme/inner/                                   2023-06-16 02:00:00            0
# zipme/inner/inner.txt                          2023-06-16 02:00:00            0
# zipme/outer.txt                                2023-06-16 02:00:00            0

More usefully, we can get the same information with the paths within the archive using zipr.namelist():

with ZipFile('zipme.zip') as zipr:
    print(zipr.namelist())  # returns relative paths
    print(zipr.filelist)  # returns files as ZipInfo objects (contain some metadata)

## Output:
# ['zipme/inner/', 'zipme/inner/inner.txt', 'zipme/outer.txt']
# [<ZipInfo filename='zipme/inner/' external_attr=0x10>, <ZipInfo filename='zipme/inner/inner.txt' external_attr=0x20 file_size=0>, <ZipInfo filename='zipme/outer.txt' external_attr=0x20 file_size=0>]

The zipr.filelist function can also be used, and instead of strings will return a ZipInfo object. The can be used to access additional metadata about the contained files (see below or online doco).

Read an Archived File (in memory)

Now that we can open the archive and list its contents, why not peek into the files themselves? Here, we’ll use zipr.open within a context manager to read the files. Everything will be in bytes, so we’ll need to decode these into utf8 in order to get a string representation:

with ZipFile('zipme.zip') as zipr:
    with zipr.open('outer.txt', 'r') as fh:
        print(fh.read().decode('utf8'))

## Output:
# Hello,

Using zipr.filelist or zipr.namelist(), we can collect the text from all files within the directory since these functions will provide the archive’s contents.

with ZipFile('zipme.zip') as zipr:
    for file in zipr.filelist:  # list of ZipInfo, so we can check if it's a directory
        if file.is_dir():
            continue  # skip directories
        with zipr.open(file, 'r') as fh:
            print(fh.read().decode('utf8'), end='')  # don't print a newline after the file contents

## Output:
# Hello, world!

Going through the zipr.filelist, we can work with ZipInfo objects. These include metadata include:

whether or not it’s a directory/file (ZipInfo.is_dir)
retrieve the filename to be able to, e.g., check the extension (ZipInfo.filename.endswith('.txt'))

Parting Thoughts

The zipfile module is not one I commonly use, but when working with large data dumps where zip archives are frequent, or when needing to work with a large archive without wanting to or being able to unpack it, accessing via zipfile.Zipfile is quick and convenient.