Evaluating Generative Chatbots

I was at an epidemiology conference about a month ago – not a typical location for a data scientist, but circumstances found me there. A number of sessions have embraced a certain (albeit nervous) enthusiasm regarding access to decoder-only transformers, often called ‘AI’ or ‘large language models’. There is a certain buzz and excitement — can ChatGPT be used for X? How can we integrate these tools to improve efficiency? I was initially rather interested in listening to these talks to see what brilliant steps were being taken (and which I could imitate).

One session in particular dwelt on incorporating tools like ChatGPT into their workflows. For example, could abstraction be performed with ‘AI’ they ask? How well did it perform (especially with respect to chart review)? Or, suppose we develop a particularly intricate pipeline involving (I forget the details) generation of codes/word lists, evaluation of structured data, etc., etc. The presenters proudly reported that ChatGPT had a performance of say 91.24% and Claude version whatever only performed at 89.86%. There is nothing inherently wrong with asking the question if ChatGPT or Claude can be leveraged as a tool within a larger pipeline – comparable to the way one might ask if RandomForests might be leveraged for some task. The problem arose with the attempt to provide detailed performance metrics between approaches and compare them to each other as informative. One question from the audience asked if the presenters had any theories as to why ChatGPT outperformed Claude overall and why Claude may have performed better in one portion. The presenters made several suggestions. I came up and asked a straightforward question: were the results stable across runs? We would anticipate the same performance across multiple runs?

No, of course not. We’re dealing with stochastic, closed source algorithms. As a proof of concept, this is interesting — what sort of pipelines could be constructed? Would these actually work? How is this performance compared with a human? Is performance comparable across systems? These questions are different than attempting to do error analysis on a closed source system.

First, let’s consider how we might do this with an open-source decoder-only transformer like Llama. To begin with, we can recall that we have complete control over the system: a single algorithm that will be stable. We control how the input is passed in and how to interpret the output. In theory, we can control randomness by setting the seed:

import torch
import random
import numpy as np

seed = 42
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
# there might be some model-specific changes too, like disabling sampling
#   or a model might accept a seed

import torch
import random
import numpy as np

seed = 42
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
# there might be some model-specific changes too, like disabling sampling
#   or a model might accept a seed

Now, we have control over the random numbers which, in theory, should provide reproducible output. This is not yet enough: with stochastic systems we’ll want to see how they perform across multiple runs. If there is clear (i.e., automatable) evaluation, we would generate a large number of these as simulations and plot the results. If we did this with, e.g., llama2 and llama3 we might be able to give this talk and say if one performed better than the other.

That’s with open source systems, but what about closed-source systems? While the chat interface always looks the same, do you know if each query actually gets processed the same way? What sort of control flow is your input/output gated through? Not to mention that there is no control over random numbers making the results non-reproducible. We’d want to be able to evaluate performance across multiple runs, however, so we should at least run the query or pipeline through the system multiple times. How many times? How many runs should be made through closed source systems to encounter the breadth of responses?

One approach is to consider the variance of responses (in other words, how difficult/complex is the task?) and continue repeating only if the variance is high. I’ve seen recommendations (well, practice rather than recommendations) of 3-5 repetitions, but this strikes me as a pragmatic simplification rather than an ideal measure. I think this comes from 3-5 repetitions as being practicable given the challenge of evaluating the text-based output. I wonder if the same input even goes to the same underlying model? How can you ensure that no additional contextual information (e.g., past prompt/dialogue history) isn’t included?

I few takeaways:

Limit use of closed-source generative AI to proof-of-concept or human-in-the-loop applications.
Prefer open-source generative AI for evaluation. It can be useful to compare performance of this with #1.
Prefer a method for automatically evaluating performance to allow for large numbers of simulations. E.g., how can you provide instructions to constrain output possibilities (e.g., explicity yes/no).