The third test consisted in measuring the alignment with our quality assessment criteria. The test consisted of 31 different questions asking the annotators to grade different prompt-answer pairs, as well as ranking different answers to the same prompt. To measure alignment, we first collected responses from different team members, and the annotators who agreed with our preferences in more than 26 of the questions passed the test. Finally, the last test consisted of a prompt response assessment where annotators choose a minimum of 6 out of 18 prompts to write responses for. We manually assess each response to evaluate production readiness. Annotators that have scored an average of >4 have passed the training. A.6 Dataset Contamination With the increasing scale of publicly available training data, it has become inevitable that some portion of evaluation data is seen during training, and may provide an undue boost in evaluation performance. Earlier work (Brown et al. (2020), Wei et al. (2022a), Du et al. (2022) in measuring such dataset contamination considered an example from an evaluation set to be “contaminated” if there existed a collision between a high-order n-gram (generally, n = 13) from the sample and the training data. This was a deliberately conservative approach in order to produce a “clean” subset of the data with high precision, and is used in open-sourced evaluation libraries (e.g. Gao et al. (2021)). This approach, however, was unable to detect precisely what proportion of a given sample is contaminated, and didn’t take into account how evaluation datasets are constructed. Furthermore, as noted in Chowdhery et al. (2022), some datasets (such as BoolQ) contain contexts extracted verbatim from the web, but not the question and answer continuation. As such, highly contaminated samples from these datasets are unlikely to gain an unfair advantage. The methodology in Chowdhery et al. (2022) further improves on the earlier n-gram collision detection by considering a sample to be contaminated if 70% of all 8-grams can be found at least once in the training data. The previous methodologies noted above all consider contamination in text space, and don’t appear to consider the formatting of prompts used for actual evaluation. In contrast, we instead match on tokenized input, being careful to pass fully verbalized evaluation samples to the tokenizer. We also diverge from the previous methodologies by considering contamination from a bottom-up perspective. We consider a token to be contaminated if it appears in any token n-gram longer than 10 tokens in both the evaluation sample and the training set, and define the contamination percentage of a sample to be the percentage of tokens contaminated. This allows us to view the benchmark performance of our models on a range of contamination scales, while retaining the ability to test a high-precision clean subset (samples with < 20% contamination) and a high-precision contaminated subset (samples with > 80% contamination). In order to account for the vagaries of the precise format of verbalized samples, we allow a small "skipgram budget" of four tokens, so that matched spans between an evaluation sample and the training data can differ in at most four positions (we do not allow trailing mismatches, or mismatches in the first 10 tokens). We identify such 10(+)-skipgrams with suffix arrays implemented using a variation of the library from Lee et al. (2022), modified to work on a PySpark cluster (effectively without random access to disk). Given the embarrassingly parallel nature of the task, we are able to find all such 10-grams (and their full lengths) in our entire dataset in around seven hours (including time to tokenize), utilizing an estimated 1,500 cores. As there are many confounding factors at play when determining whether dataset contamination has contributed to evaluation performance (mostly stemming from the fact that "clean" and "dirty" subsets do not necessarily well-estimate the population distribution), we make the following assumption: In the event of dataset contamination contributing to evaluation performance, we expect both the "cleanest" examples to have an overall worse average score than their complement, and the "dirtiest" samples to have an overall better average score than their complement. It is insufficient evidence for contamination if only one of these were true. To this end, we define four (non-disjoint) subset types as follows: “Clean” samples, with less than 20% token contamination, “Not clean” samples, with greater than (or equal to) 20% token contamination, “Not dirty” samples, with less than 80% token contamination, “Dirty” samples, with greater than (or equal to) 80% token contamination. There is an additional confounding factor that we attempt to address directly. With the given definition of contamination (as well as other definitions mentioned in the literature), there is a possibility that a sample 75