I’m quite accustomed to looking at performance against some gold (or silver) standard. It’s nice to have some ready definition of ‘truth’ and then, when applying some algorithm, we can clearly see if it matched or failed to match. More recently, however, I was attempting to compare the outputs of multiple UMLS-processing NLP systems on…