Calculating Jaccard Similarity Coefficients in `pandas`

I’m quite accustomed to looking at performance against some gold (or silver) standard. It’s nice to have some ready definition of ‘truth’ and then, when applying some algorithm, we can clearly see if it matched or failed to match.

More recently, however, I was attempting to compare the outputs of multiple UMLS-processing NLP systems on a large dataset of notes. By default, these pieces of text are divided into multiple lines for storage (so that the database column can have fixed width). Merging these together can be somewhat complex for certain programming languages (e.g., SAS), so did we really need to do it? The analysis pitted cTAKES against MetaMap and MetaMapLite, using different underlying datasets and configurations. Each of these parameter sets was run against a corpus where the notes were reconstituted, and another one where they were not. If the performance was more or less equivalent, we could dispense with the costly process of joining all these lines.

The experiments were run, and the data extracted using the command line tools of mml_utils. Now, we end up with a dataframe for each experiment (e.g., NLP tool with parameter set), where each variables/columns represents a CUI, each record/row represents a unique note/document, and each value represents the count of that CUI in the note. We can concatenate these dataframes by moving the experiment to a separate variable/column.

note_id	C0002792	C0013404	…	experiment
0	1	4	…	ctakes1
1	0	0	…	ctakes1
…	…	…	…	…
0	0	3	…	ctakes2
1	0	0	…	ctakes2

Jaccard similarity does not care about the counts themselves, but only the presence/absence of a particular attribute. We will take the overlap (i.e., number of times the same CUI was found in the same note by both experiments) divided by the total number of CUIs which were found by either experiment (or both). We ignore when neither method found the CUI. Thus, in the table above, computing the Jaccard similarity for note_id #1, neither C0002791 nor C0013404 will contribute since they were not found.

	ctakes1 has count=0	ctakes1 has count >= 1
ctakes2 has count=0	Ignore	Denominator
ctakes2 has count >= 1	Denominator	Numerator and Denominator

For our little excerpt, the results would be the following for note_id #0:

NOTE_ID = 0	ctakes1 has count=0	ctakes1 has count >= 1
ctakes2 has count=0	0	1
ctakes2 has count >= 1	0	1

Our Jaccard similarity for note_id #0 is: 1 / 1 + 0 + 1 or 0.5. For note_id #1, for our current excerpt, Jaccard similarity is undefined (i.e., there is nothing to compare).

This seemed easy to implement: we’re just calculating the intersection of CUIs divided by their union. Here’s my naive implementation:

dfs = []
experiments = list(df['experiment'].unique())
cuis = list(x for x in df.columns if x.startswith('C'))
docids = list(df.docid.unique())
for i, lexp in enumerate(experiments[:-1]):
    ldf = df[(df.experiment == lexp)][['docid'] + list(cuis)].set_index('docid').sort_index()
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        rdf = df[(df.experiment == rexp)][['docid'] + list(cuis)].set_index('docid').sort_index()
        jaccard_coefs = []
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            ls = ldf.T[docid]
            l_cuis = set(ls[ls > 0].index)
            rs = rdf.T[docid]
            r_cuis = set(rs[rs > 0].index)
            try:
                jac_coef = len(l_cuis & r_cuis) / len(l_cuis | r_cuis)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
        df_ = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])
        dfs.append(df_)
jac_df = pd.concat(dfs)

dfs = []
experiments = list(df['experiment'].unique())
cuis = list(x for x in df.columns if x.startswith('C'))
docids = list(df.docid.unique())
for i, lexp in enumerate(experiments[:-1]):
    ldf = df[(df.experiment == lexp)][['docid'] + list(cuis)].set_index('docid').sort_index()
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        rdf = df[(df.experiment == rexp)][['docid'] + list(cuis)].set_index('docid').sort_index()
        jaccard_coefs = []
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            ls = ldf.T[docid]
            l_cuis = set(ls[ls > 0].index)
            rs = rdf.T[docid]
            r_cuis = set(rs[rs > 0].index)
            try:
                jac_coef = len(l_cuis & r_cuis) / len(l_cuis | r_cuis)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
        df_ = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])
        dfs.append(df_)
jac_df = pd.concat(dfs)

The constant indexing of dataframes with sorting is problematic, and the iteration through lots of dataframes seems less than ideal. It almost seems better if all of the notes and experiments were set as columns. This might be a more efficient approach?

Let’s reformat the dataframe:

df = df.set_index(['experiment', 'docid']).T

df = df.set_index(['experiment', 'docid']).T

	ctakes1	ctakes1	ctakes2	ctakes2
cui \ note_id	0	1	0	1
C0002792	1	0	0	0
C0013404	4	0	3	0
…

And our implementation with this quite wide dataframe:

dfs = []  # TODO: replace with `jaccard_coefs`
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        jaccard_coefs = []  # TODO: this should be moved outside for loops
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            ls = df[lexp][docid]
            l_cuis = set(ls[ls > 0].index)
            rs = df[rexp][docid]
            r_cuis = set(rs[rs > 0].index)
            try:
                jac_coef = len(l_cuis & r_cuis) / len(l_cuis | r_cuis)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
        df_ = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])
        dfs.append(df_)
jac2_df = pd.concat(dfs)  # TODO: just create dataframe once here

dfs = []  # TODO: replace with `jaccard_coefs`
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        jaccard_coefs = []  # TODO: this should be moved outside for loops
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            ls = df[lexp][docid]
            l_cuis = set(ls[ls > 0].index)
            rs = df[rexp][docid]
            r_cuis = set(rs[rs > 0].index)
            try:
                jac_coef = len(l_cuis & r_cuis) / len(l_cuis | r_cuis)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
        df_ = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])
        dfs.append(df_)
jac2_df = pd.concat(dfs)  # TODO: just create dataframe once here

Unfortunately this takes 3 times as long (from 15 to 45 minutes).

Another similar approach will look to use a comparison of each column directly rather than using the union/intersection. We’ll also fix our loop to stop creating a bunch of unnecessary dataframes…and just simplify all the counts to 1.

df[df > 0] = 1

jaccard_coefs = []
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            m01 = (df[lexp][docid].eq(0) & df[rexp][docid].eq(1)).sum()
            m10 = (df[lexp][docid].eq(1) & df[rexp][docid].eq(0)).sum()
            m11 = (df[lexp][docid].eq(1) & df[rexp][docid].eq(1)).sum()
            try:
                jac_coef = m11 / (m11 + m01 + m10)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
jac3_df = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])

df[df > 0] = 1

jaccard_coefs = []
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        for docid in docids:
            # for each doc, get cuis where at least one appears in doc
            m01 = (df[lexp][docid].eq(0) & df[rexp][docid].eq(1)).sum()
            m10 = (df[lexp][docid].eq(1) & df[rexp][docid].eq(0)).sum()
            m11 = (df[lexp][docid].eq(1) & df[rexp][docid].eq(1)).sum()
            try:
                jac_coef = m11 / (m11 + m01 + m10)
            except ZeroDivisionError:
                continue
            jaccard_coefs.append((docid, jac_coef, name))
jac3_df = pd.DataFrame(jaccard_coefs, columns=['docid', 'jaccard', 'name'])

This took even longer… I stopped it after an hour.

The inefficiencies here are likely due to having to approach multiple columns — what if all of the columns could just be compared against each other? Thus, instead of the index just being CUIs, it might be more efficient to have a multi-index of CUIs and the document.

First, we’ll place this into a long table format (via pd.melt):

experiment	docid	cui	value
ctakes1	0	C0002792	1
ctakes1	0	C0013404	1
ctakes1	1	C0002792	0
ctakes1	1	C0013404	0
ctakes2	0	C0002792	0
ctakes2	0	C0013404	1
ctakes2	1	C0002792	0
ctakes2	1	C0013404	0

Data in long format. Note that the transformation df[df >= 1] = 1 has been applied.

Next, we’ll pivot the table so that each column represents an experiment:

df = df.pivot(index=['docid', 'cui'], columns='experiment', values='value')

df = df.pivot(index=['docid', 'cui'], columns='experiment', values='value')

docid	cui	ctakes1	ctakes2	…
0	C0002792	1	0	…
0	C0013404	1	1	…
1	C0002792	0	0	…
1	C0013404	0	0	…
…	…	…	…	…

Now, we can calculate the jaccard similarity for the entire experiment:

jaccard_coefs = []
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        # for each doc, get cuis where at least one appears in doc
        m01 = (df[lexp].eq(0) & df[rexp].eq(1)).sum()
        m10 = (df[lexp].eq(1) & df[rexp].eq(0)).sum()
        m11 = (df[lexp].eq(1) & df[rexp].eq(1)).sum()
        try:
            jac_coef = m11 / (m11 + m01 + m10)
        except ZeroDivisionError:
            continue
        jaccard_coefs.append((jac_coef, name))
jac4_df = pd.DataFrame(jaccard_coefs, columns=['jaccard', 'name'])

jaccard_coefs = []
for i, lexp in enumerate(experiments[:-1]):
    for rexp in experiments[i + 1:]:
        name = f'{lexp}-{rexp}'
        # for each doc, get cuis where at least one appears in doc
        m01 = (df[lexp].eq(0) & df[rexp].eq(1)).sum()
        m10 = (df[lexp].eq(1) & df[rexp].eq(0)).sum()
        m11 = (df[lexp].eq(1) & df[rexp].eq(1)).sum()
        try:
            jac_coef = m11 / (m11 + m01 + m10)
        except ZeroDivisionError:
            continue
        jaccard_coefs.append((jac_coef, name))
jac4_df = pd.DataFrame(jaccard_coefs, columns=['jaccard', 'name'])

While this provides the similarity of the entire experiment, it doesn’t allow us to see how the similarity may vary across documents. With the first three methods (all produce an identical dataset), we can build a chart to visualise how the notes vary between methods.

Visualisation with `seaborn`

We can visualise the distribution of Jaccard similarity using seaborn.

import seaborn as sns


names = list(jac_df.name.unique())
plt.figure(figsize=(8, len(names))
ax = sns.violinplot(
    data=jac_df,
    y='name', x='jaccard', scale='width',
)

import seaborn as sns


names = list(jac_df.name.unique())
plt.figure(figsize=(8, len(names))
ax = sns.violinplot(
    data=jac_df,
    y='name', x='jaccard', scale='width',
)

This will, e.g., show us the close similarity between using the same NLP tool (ct=ctakes) under different settings vs. two separate NLP tools (mm=Metamap).

A heatmap can also be particularly instructive, showing at a glance how different methods compare. In the following, we will combine by finding the mean, but the median would also be instructive to explore.

# split name into left/right
jac_df['left'] = jac_df.name.str.split('-').str[0]
jac_df['right'] = jac_df.name.str.split('-').str[-1]
# find comparisons, but this is only the top right corner
c_df = jac_df.groupby(['left', 'right'])['jaccard'].mean().reset_index()
# create second dataframe, switch right/left labels
c2_df = c_df.copy()
c2_df.columns = ['right', 'left', 'jaccard']
# combine top left and bottom right halves together
hm_df = pd.concat((c_df, c2_df)).pivot(
    index='left', columns='right', values='jaccard'
).fillna(1)  # fill when comparing algorithm to itself
sns.heatmap(hm_df)

# split name into left/right
jac_df['left'] = jac_df.name.str.split('-').str[0]
jac_df['right'] = jac_df.name.str.split('-').str[-1]
# find comparisons, but this is only the top right corner
c_df = jac_df.groupby(['left', 'right'])['jaccard'].mean().reset_index()
# create second dataframe, switch right/left labels
c2_df = c_df.copy()
c2_df.columns = ['right', 'left', 'jaccard']
# combine top left and bottom right halves together
hm_df = pd.concat((c_df, c2_df)).pivot(
    index='left', columns='right', values='jaccard'
).fillna(1)  # fill when comparing algorithm to itself
sns.heatmap(hm_df)

Visualisation with seaborn

Visualisation with `seaborn`