llama2.pdf

Dialogue Turn Baseline + GAtt 2 100% 100% 4 10% 100% 6 0% 100% 20 0% 100% Table 30: GAtt results. Llama 2-Chat with GAtt is able to refer to attributes 100% of the time, for up to 20 turns from our human evaluation. We limited the evaluated attributes to public figures and hobbies. The attention now spans beyond 20 turns. We tested the model ability to remember the system arguments trough a human evaluation. The arguments (e.g. hobbies, persona) are defined during the first message, and then from turn 2 to 20. We explicitly asked the model to refer to them (e.g. “What is your favorite hobby?”, “What is your name?”), to measure the multi-turn memory ability of Llama 2-Chat. We report the results in Table 30. Equipped with GAtt, Llama 2-Chat maintains 100% accuracy, always referring to the defined attribute, and so, up to 20 turns (we did not extend the human evaluation more, and all the examples had less than 4048 tokens in total over the turns). As a comparison, Llama 2-Chat without GAtt can not anymore refer to the attributes after only few turns: from 100% at turn t+1, to 10% at turn t+3 and then 0%. GAtt Zero-shot Generalisation. We tried at inference time to set constrain not present in the training of GAtt. For instance, “answer in one sentence only”, for which the model remained consistent, as illustrated in Figure 28. We applied first GAtt to Llama 1, which was pretrained with a context length of 2048 tokens and then fine-tuned with 4096 max length. We tested if GAtt works beyond 2048 tokens, and the model arguably managed to understand attributes beyond this window. This promising result indicates that GAtt could be adapted as an efficient technique for long context attention. A.3.6 How Far Can Model-Based Evaluation Go? To measure the robustness of our reward model, we collected a test set of prompts for both helpfulness and safety, and asked annotators to judge quality of the answers based on a 7 point Likert-scale (the higher the better) using triple reviews. As illustrated in Figure 29 (in Appendix), we observe that our reward models overall are well calibrated with human preference. Note that this enables us to use the reward as a point-wise metric, despite being trained with a Pairwise Ranking Loss. Figure 27: Reward model score distribution shift caused by incorporating preference rating based margin in ranking loss. With the margin term, we observe a binary split pattern in reward distribution, especially with a larger margin. 54