A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models Conference’17, July 2017, Washington, DC, USA 3.4 Retrieval Augmentation Necessity and Frequency The retrieval operation in LLM-based generation generally aims to supplement knowledge to enhance generation. Although retrieval- augmented models have emerged promising, they have been criti- cized for not being a universal solution [75, 109] as indiscriminately augmenting LLMs with irrelevant passages can override poten- tially correct knowledge already possessed by LLMs and result in incorrect responses instead [99]. Thakur et al. [147] contribute a human-annotated dataset to help evaluate the robustness of LLMs against errors in external retrieved knowledge and observe that LLMs may double the hallucination rate on the non-relevant re- trieved passages than on the relevant ones. Therefore, it is critical for RA-LLMs to accurately recall the prior knowledge while selec- tively incorporating retrieved information only when necessary, which is the path to robust RA-LLMs. Most existing methods determine the necessity of retrieval based on the preliminary answers of LLMs or their internal reasoning results [102, 117]. For example, Self-RAG [5] introduces special tokens to assess the necessity of retrieval and control retrieval be- havior. Other methods design iterative prompts to decide if extra information is required during generation, which thereby needs to invoke retrieval or other actions for LLMs [162, 175]. Wang et al. [159] propose Self-Knowledge guided Retrieval augmentation (SKR) method, which uses LLMs themselves or explicit small trainable models to offer self-knowledge as the reference for the adaptive calling of a retriever. In traditional RAGs, retrieval necessity judg- ment has also been explored and proposed to address by intuitive approaches such as assessing the confidence of the logits produced by the generation models [47, 56, 59]. Such a solution is also appli- cable to RA-LLMs, for example, FLARE [57] dynamically triggers RAG if logits are lower than a specific threshold. Tan et al. [146] introduce a more flexible model SlimPLM, which detects missing knowledge in LLMs with a slim proxy model, which functions to generate a “heuristic answer”. The “heuristic answer” is used to assess the necessity of retrieval and facilitate the retrieval process for query rewriting when necessary. In traditional RAGs that rarely consider retrieval necessity, re- trieval frequency (also called retrieval stride) is an important design aspect to determine the degree of using the retrieval in the gen- eration, thereby greatly affecting the overall performance of RAG models [117]. Retrieval frequency controls how much to rely on the retrieval results, thereby affecting both the efficiency and effective- ness of the model. When the necessity of retrieval is not considered, retrieval frequency is often pre-defined and fixed, which have three common settings: one-time, every-n-token, and every-token. One- time retrieval invokes the retrieval function only once and tries to find all desired information in that one-time operation. One-time retrieval is usually operated at the beginning of the generation pro- cess, and then provides all retrieved documents to the generation models along with the original input, as applied in REALM [46]. One-time retrieval is more suitable for the cases that the informa- tion needs in external databases are obvious to LLMs [57]. However, for language tasks requiring long-form output such as open-domain summarization, the dependency among the tokens in the output is more important to be considered during the generation. In these cases, pre-retrieved documents (through one-time retrieval) might not be enough to support the generation of the whole sequence of output, which calls for in-generation retrieval operations. To this end, In-Context RALM [117] and RETRO [6] apply every-n-token retrieval in the process of generation for better augmentation. In comparison, kNN-LM [62] adopts a more frequent retrieval strat- egy, which retrieves information for the prediction of every token during the generation. Overall, applying different frequencies of re- trieval can impact both the effectiveness and efficiency of the whole RAG method. For example, more frequent retrieval leads to better performance but also increases the computing cost [117]. Choosing retrieval frequency is almost a trade-off between computing cost and performance. 4 RA-LLMS TRAINING Based on whether training is required or not, existing RAG methods can be categorized into two main classes: train-free approaches and training-based approaches. Training-free methods usually directly leverage the retrieved knowledge during inference time without introducing extra training by inserting the retrieved text into the prompt, which is computationally efficient. However, one potential challenge is that the retriever and generator components are not specifically optimized for downstream tasks, which could easily lead to sub-optimal utilization of the retrieved knowledge. To fully exploit the external knowledge, extensive methods are proposed to fine-tune the retriever and generator, thereby guiding large language models to effectively adapt and integrate retrieved information [127, 128, 130, 135, 153, 199]. According to the training strategies, we categorize these training- based approaches into three classes: 1) Independent Training approaches independently train each component in the RAG proce- dure, 2) Sequential Training methods train one module first and freeze the well-trained component to guide the tuning process of the other part, and 3) Joint Training approaches train retriever and generator simultaneously. In the following section, we will comprehensively review the training-free, independent training, sequential training, and joint training methods. The comparison of these different training methods is depicted in Figure 5. 4.1 Training-free With the huge number of parameters, LLMs have exhibited human- level intelligence and achieved promising prediction performance on various downstream tasks. However, it is extremely challenging to frequently perform fine-tuning and update the knowledge stored in the model parameters [74] due to the considerable time and computational resources required. Recently, numerous studies have suggested enhancing LLMs with retrieval mechanisms, enabling them to dynamically acquire new knowledge from external sources without extra training processes (i.e., training-free)[54, 57, 63], instead of relying solely on the implicit knowledge encoded in the model’s parameters. These approaches have shown significant performance improvement for various knowledge-intensive tasks, such as open-domain question answering [74]. According to the different ways in which LLMs utilize retrieved information, we cat- egorize these training-free methods into two categories: 1) Prompt Engineering-based Methods integrate retrieved knowledge into 9