Dialogue Turn Baseline + GAtt
2 100% 100%
4 10% 100%
6 0% 100%
20 0% 100%
Table 30: GAtt results. Llama 2-Chat with GAtt is able to refer to attributes 100% of the time, for up to 20
turns from our human evaluation. We limited the evaluated attributes to public figures and hobbies.
The attention now spans beyond 20 turns. We tested the model ability to remember the system arguments
trough a human evaluation. The arguments (e.g. hobbies, persona) are defined during the first message, and
then from turn 2 to 20. We explicitly asked the model to refer to them (e.g. “What is your favorite hobby?”,
“What is your name?”), to measure the multi-turn memory ability of Llama 2-Chat. We report the results
in Table 30. Equipped with GAtt, Llama 2-Chat maintains 100% accuracy, always referring to the defined
attribute, and so, up to 20 turns (we did not extend the human evaluation more, and all the examples had
less than 4048 tokens in total over the turns). As a comparison, Llama 2-Chat without GAtt can not anymore
refer to the attributes after only few turns: from 100% at turn t+1, to 10% at turn t+3 and then 0%.
GAtt Zero-shot Generalisation. We tried at inference time to set constrain not present in the training of
GAtt. For instance, “answer in one sentence only”, for which the model remained consistent, as illustrated in
Figure 28.
We applied first GAtt to Llama 1, which was pretrained with a context length of 2048 tokens and then
fine-tuned with 4096 max length. We tested if GAtt works beyond 2048 tokens, and the model arguably
managed to understand attributes beyond this window. This promising result indicates that GAtt could be
adapted as an efficient technique for long context attention.
A.3.6 How Far Can Model-Based Evaluation Go?
To measure the robustness of our reward model, we collected a test set of prompts for both helpfulness and
safety, and asked annotators to judge quality of the answers based on a 7 point Likert-scale (the higher the
better) using triple reviews. As illustrated in Figure 29 (in Appendix), we observe that our reward models
overall are well calibrated with human preference. Note that this enables us to use the reward as a point-wise
metric, despite being trained with a Pairwise Ranking Loss.
Figure 27: Reward model score distribution shift caused by incorporating preference rating based margin
in ranking loss. With the margin term, we observe a binary split pattern in reward distribution, especially
with a larger margin.
54