BoolQ PIQA SIQA Hella-Swag ARC-e ARC-c NQ TQA MMLU GSM8K Human-Eval MHA 71.0 79.3 48.2 75.1 71.2 43.0 12.4 44.7 28.0 4.9 7.9 MQA 70.6 79.0 47.9 74.5 71.6 41.9 14.5 42.8 26.5 4.8 7.3 GQA 69.4 78.8 48.6 75.4 72.1 42.5 14.0 46.2 26.9 5.3 7.9 Table 18: Attention architecture ablations. We report 0-shot results for all tasks except MMLU(5-shot) and GSM8K(8-shot). For GSM8K and Human-Eval we report maj@1 and pass@1 results. For NQ and TriviaQA we report EM. For all other tasks we report accuracy. Figure 24: Multi-query variants enable higher throughput with larger batch sizes, and show similar latency on smaller batches. Output length is fixed at 128 tokens. The first data point corresponds to batch size 1, and then we double it until the model runs out of memory. The MHA variant triggers an out-of- memory error at a batch size of 1024 for a context of 256 tokens and at a batch size of 128 for 2k context, whereas MQA and GQA have successful runs in those settings. Therefore, based on the ablation results and ease of scaling inference, for the 34B and 70B Llama 2 models we chose to use GQA instead of MQA. Figure 24 shows how inference speed changed for the 30B GQA and MQA ablation models compared to the MHA baseline, in an experiment using 8 x 80 GiB A100s with tensor parallelism. In these runs we simply duplicated the KV heads for MQA in all GPUs, so the KV cache size for MQA became equal to the GQA and the two variants behaved very similar (with MQA just having a slightly larger FFN dimension). A.2.2 Additional Details for Pretrained Models Evaluation MMLU details. In Table 19, we report details of the MMLU (Hendrycks et al., 2020) evaluation for Llama 2 models and others open-source models. Standard Benchmarks. In Table 20, we show results on several standard benchmarks. Code Generation. In Table 21, we compare results of Llama 2 with popular open source models on the Human-Eval and MBPP code generation benchmarks. World Knowledge. We evaluate the Llama 2 model together with other open-source models on the Natu- ralQuestions and TriviaQA benchmarks (Table 22). Reading Comprehension In Table 23 we report zero-shot and few-shot results on SQUAD and zero-shot and one-shot experiments on QUAC. Here Llama 2 performs best on all evaluation settings and models except the QUAC 0-shot where Llama 1 30B performs slightly better. Exams. In Table 24, we present fine-grained results from the English part of the AGI Eval (Zhong et al., 2023) benchmark. AGI Eval is a collection of standardized exams in different subjects. 48