• The third test consisted in measuring the alignment with our quality assessment criteria. The test
consisted of 31 different questions asking the annotators to grade different prompt-answer pairs,
as well as ranking different answers to the same prompt. To measure alignment, we first collected
responses from different team members, and the annotators who agreed with our preferences in
more than 26 of the questions passed the test.
• Finally, the last test consisted of a prompt response assessment where annotators choose a minimum of
6 out of 18 prompts to write responses for. We manually assess each response to evaluate production
readiness. Annotators that have scored an average of >4 have passed the training.
A.6 Dataset Contamination
With the increasing scale of publicly available training data, it has become inevitable that some portion of
evaluation data is seen during training, and may provide an undue boost in evaluation performance.
Earlier work (Brown et al. (2020), Wei et al. (2022a), Du et al. (2022) in measuring such dataset contamination
considered an example from an evaluation set to be “contaminated” if there existed a collision between
a high-order n-gram (generally, n = 13) from the sample and the training data. This was a deliberately
conservative approach in order to produce a “clean” subset of the data with high precision, and is used in
open-sourced evaluation libraries (e.g. Gao et al. (2021)).
This approach, however, was unable to detect precisely what proportion of a given sample is contaminated,
and didn’t take into account how evaluation datasets are constructed. Furthermore, as noted in Chowdhery
et al. (2022), some datasets (such as BoolQ) contain contexts extracted verbatim from the web, but not the
question and answer continuation. As such, highly contaminated samples from these datasets are unlikely
to gain an unfair advantage. The methodology in Chowdhery et al. (2022) further improves on the earlier
n-gram collision detection by considering a sample to be contaminated if 70% of all 8-grams can be found at
least once in the training data.
The previous methodologies noted above all consider contamination in text space, and don’t appear to
consider the formatting of prompts used for actual evaluation. In contrast, we instead match on tokenized
input, being careful to pass fully verbalized evaluation samples to the tokenizer. We also diverge from the
previous methodologies by considering contamination from a bottom-up perspective. We consider a token
to be contaminated if it appears in any token n-gram longer than 10 tokens in both the evaluation sample
and the training set, and define the contamination percentage of a sample to be the percentage of tokens
contaminated. This allows us to view the benchmark performance of our models on a range of contamination
scales, while retaining the ability to test a high-precision clean subset (samples with < 20% contamination)
and a high-precision contaminated subset (samples with > 80% contamination). In order to account for the
vagaries of the precise format of verbalized samples, we allow a small "skipgram budget" of four tokens, so
that matched spans between an evaluation sample and the training data can differ in at most four positions
(we do not allow trailing mismatches, or mismatches in the first 10 tokens).
We identify such 10(+)-skipgrams with suffix arrays implemented using a variation of the library from Lee
et al. (2022), modified to work on a PySpark cluster (effectively without random access to disk). Given the
embarrassingly parallel nature of the task, we are able to find all such 10-grams (and their full lengths) in
our entire dataset in around seven hours (including time to tokenize), utilizing an estimated 1,500 cores.
As there are many confounding factors at play when determining whether dataset contamination has
contributed to evaluation performance (mostly stemming from the fact that "clean" and "dirty" subsets do
not necessarily well-estimate the population distribution), we make the following assumption: In the event
of dataset contamination contributing to evaluation performance, we expect both the "cleanest" examples to
have an overall worse average score than their complement, and the "dirtiest" samples to have an overall better
average score than their complement. It is insufficient evidence for contamination if only one of these were
true. To this end, we define four (non-disjoint) subset types as follows:
• “Clean” samples, with less than 20% token contamination,
• “Not clean” samples, with greater than (or equal to) 20% token contamination,
• “Not dirty” samples, with less than 80% token contamination,
• “Dirty” samples, with greater than (or equal to) 80% token contamination.
There is an additional confounding factor that we attempt to address directly. With the given definition of
contamination (as well as other definitions mentioned in the literature), there is a possibility that a sample
75