Conference’17, July 2017, Washington, DC, USA Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li Spain won the Women's World Cup 2023. Which country won the Women's World Cup 2023? User w/o RAG with RAG Pre-trained LLMs External Database Output Prompt Output Prompt Query Additional information: New, Domain-specific, etc. Context As of my last update in January 2022, I can't provide which country won ... 2023. Figure 1: Retrieval-Augmented Generation (RAG) meets Large Language Models (LLMs). When the user’s query is out- of-scope, e.g., unseen content in training data or the need for the latest information for the answer, LLMs might show in- ferior generation performance. With the help of RAG, LLMs can leverage additional relevant information from external database to enhance their text generation capability. Recent years have witnessed the rapid development of pre-trained foundation models, particularly Large Language Models (LLMs), which have demonstrated impressive performance across various tasks [1, 18], including recommender systems [195], molecule dis- covery [77], and report generation [27]. Technically, the great suc- cess of LLMs can be technically attributed to the advanced architec- tures with billion-level parameters pre-training on a huge amount of training corpus from various sources. These technical improve- ments have given rise to the remarkable emergence capabilities of LLMs [194, 195], particularly in language understanding and generation, in-context learning, and others. For instance, GPT-FAR introduces detailed prompts to teach GPT-4 to perform image tag- ging, statistical analysis, and text analysis for multi-modal fashion report generation [27]. LLMs also achieve promising performance in recommender systems by understanding users’ preferences to- wards items [154, 195]. Despite the success, LLMs still suffer from intrinsic limitations [194, 195], such as the lack of domain-specific knowledge, the problem of “hallucination”, and the substantial computational resources required for updating the models. These problems are particularly notable in domain-specific fields like medicine and law. For instance, a recent study has demonstrated that legal hallucinations are pervasive and disturbing, with halluci- nation rates ranging from 69% to 88% in responses to specific legal queries for state-of-the-art LLMs [21]. Moreover, the challenges of tackling the hallucination problem become even harder due to the substantial computational resources required for fine-tuning LLMs with domain-specific or the latest data. This, in turn, significantly hinders the widespread adoption of LLMs in various real-world applications. To address these limitations, recent efforts have been made to take advantage of RAG to enhance the capabilities of LLMs in var- ious tasks [6, 53, 62, 135], especially those demanding high for the latest and reliable knowledge such as Question Answer (QA), AI4Science, and software engineering. For example, Lozano et al. [92] introduces a scientific QA system based on retrieving scientific literature dynamically. MolReGPT leverages RAG to enhance the in-context learning ability of ChatGPT for molecular discovery [77]. It is also been demonstrated that RAG can effectively reduce halluci- nations in conversational tasks [137, 171]. As illustrated in Figure 1, an LLM-based dialog system will not be able to answer well for out-of-scope queries. With the help of RAG to retrieve relevant knowledge from external database and integrate it into the process of generation, the dialog system succeeds in giving correct answers. Given the remarkable progress in advancing LLMs with RAG, there is an imperative need for a systematic review of recent advances in Retrieval-Augmented Large Language Models (RA-LLMs). This survey aims to provide a comprehensive overview of RA- LLMs by summarizing representative methods from the aspects of the architecture, training strategy, and application area respectively. More specifically, following a brief introduction to the background knowledge of LLMs in Section 2, we review existing research from several primary perspectives of RA-LLMs in terms of retrieval, generation, and augmentation in Section 3, as well as the necessity and application frequency of retrieval in RAG. Then, we summarize the main training techniques of RA-LLMs in Section 4 and various RA-LLMs applications in Section 5. Finally, in Section 6, we discuss key challenges and potential directions for future exploration. Concurrent to our survey, several related surveys have diverse fo- cuses for RAG and LLMs. For example, Zhao et al. [193] specifically review multi-modal information-based RAG techniques and Zhao et al. [192] discuss the RAG for AIGC. Gao et al. [41] conduct a rela- tively comprehensive overview of RAG for LLMs. Our survey differs from these surveys in concentrating on technical perspectives and systematically reviewing models according to the architecture and training paradigm in RA-LLMs, as well as application tasks. 2 BACKGROUND In this section, we briefly present the background of large language models and prompt learning. 2.1 Large Language Models (LLMs) Recently, the significant breakthrough of LLMs has revolutionized the field of artificial intelligence [7, 37, 194]. The advanced LLMs are typically pre-trained on extensive data with billion-level param- eters and have demonstrated the ability to understand and generate human-like text, leading to advancements in various natural lan- guage processing tasks such as text generation and information retrieval [194, 195]. LLMs can be adapted to a variety of down- stream tasks by fine-tuning them on specific datasets, allowing them to specialize in particular domains or applications. In general, most existing LLMs can be broadly divided into three main cate- gories: Encoder-only, Decoder-only, and Encoder-Decoder models. Encoder-only models, such as the BERT (Bidirectional Encoder Representations from Transformers) [25] family of models, pro- cess input text by encoding it into a high-dimensional space. The key feature of Encoder-only models is their bi-directional nature, meaning that they can take into account both the left and right con- text of each token when encoding it. This bi-directionality allows Encoder-only models to better understand the meaning of words in context, which is crucial for tasks like sentiment analysis, review reading, and text classification [25, 169]. In contrast to these models, 2