Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original Llama team who helped get this work started. Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the figures in the paper. Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the internal demo. Early reviewers of this paper, who helped us improve its quality, including Mike Lewis, Joelle Pineau, Laurens van der Maaten, Jason Weston, and Omer Levy. A.2 Additional Details for Pretraining A.2.1 Architecture Changes Compared to Llama 1 Context Length. We expand the context window for Llama 2 from 2048 tokens to 4096 tokens. The longer context window enables models to process more information, which is particularly useful for supporting longer histories in chat applications, various summarization tasks, and understanding longer documents. Table 16 compares the performance of 2k and 4k context pretraining on long-context benchmarks. Both models are trained for 150B tokens, keeping the same architecture and hyperparameters as a baseline, varying only the context length. We observe improvement on SCROLLS (Shaham et al., 2022), where the average input length is 3.5k, and no performance degradation on SQUAD (Rajpurkar et al., 2018). Table 17 shows that the longer context model retains strong performance on various general-purpose tasks. Grouped-Query Attention. A standard practice for autoregressive decoding is to cache the key (K) and value (V) pairs for the previous tokens in the sequence, speeding up attention computation. With increasing context windows or batch sizes, however, the memory costs associated with the KV cache size in multi-head attention (MHA) models grow significantly. For larger models, where KV cache size becomes a bottleneck, key and value projections can be shared across multiple heads without much degradation of performance (Chowdhery et al., 2022). Either the original multi-query format with a single KV projection (MQA, Shazeer, 2019) or a grouped-query attention variant with 8 KV projections (GQA, Ainslie et al., 2023) can be used. In Table 18, we compare MQA and GQA variants with an MHA baseline. We train all models with 150B tokens while keeping a fixed 30B model size. To keep a similar overall parameter count across GQA and MQA, we increase the dimension of the feed-forward layers to compensate for the reduction in the attention layers. For the MQA variant, we increase the FFN dimension by a factor of 1.33, and for the GQA variant, we increase it by a factor of 1.3. From the results, we observe that the GQA variant performs comparably to the MHA baseline on most evaluation tasks and is better than the MQA variant on average. To optimize for latency, we host our largest models using 8 A100s in a single node with tensor parallelism (Shoeybi et al., 2019). In this setting, sharding for MQA cannot be done across heads anymore, given the number of heads is lower than the number of GPUs. Either you duplicate the KV values in all GPUs (making the KV cache size equal to GQA), or an alternative is to shard across the batch dimension instead (Pope et al., 2022). The latter, however, can complicate an inference service, as it works only when batch sizes are larger than the number of shards and the additional communication cost is not worth it in all cases. Context NarrativeQA Qasper QuALITY QMSum ContractNLI SQuAD Length (F1) (F1) (acc) (Rouge 1/2/L) (EM) (EM/F1) 2k 0.21 0.71 26.1 0.13/0.01/0.12 11.76 57.23/62.89 4k 17.26 18.52 29.6 15.08/3.55/12.16 16.33 57.99/64.46 Table 16: Context length ablation on long-context tasks. Context Hella-Swag NQ TQA GSM8K Human-Eval Length (0-shot) (64-shot) (64-shot) (8-shot) (0-shot) 2k 75.1 25.5 53.7 4.9 7.9 4k 74.8 25.5 52.2 6.5 7.3 Table 17: Context length ablation on general tasks. 47