Memory Sharing for Large Language Model based Agents

Hang Gao, Yongfeng Zhang
Department of Computer Science
Rutgers University
New Brunswick, NJ 08854, USA
{h.gao, yongfeng.zhang}@rutgers.edu
Abstract

In the realm of artificial intelligence, the adaptation of Large Language Model (LLM)-based agents to execute tasks via natural language prompts represents a significant advancement, notably eliminating the need for explicit retraining or fine tuning for fixed-answer tasks such as common sense questions and yes/no queries. However, the application of In-context Learning to open-ended challenges, such as poetry creation, reveals substantial limitations due to the comprehensiveness of the provided examples and agent’s ability to understand the content expressed in the problem, leading to outputs that often diverge significantly from expected results. Addressing this gap, our study introduces the Memory-Sharing (MS) framework for LLM multi-agents, which utilizes a real-time memory storage and retrieval system to enhance the In-context Learning process. Each ”memory” within this system captures both the posed query and the corresponding real-time response from an LLM-based agent, aggregating these memories from a broad spectrum of similar agents to enrich the memory pool shared by all agents. This framework not only aids agents in identifying the most relevant examples for specific tasks but also evaluates the potential utility of their memories for future applications by other agents. Empirical validation across three distinct domains involving specialized functions of agents demonstrates that the MS framework significantly improve the agent’s performance regrading the open-ended questions. Furthermore, we also discuss what type of memory pool and what retrieval strategy in MS can better help agents, offering a future develop direction of MS. The code and data are available at: https://github.com/GHupppp/MemorySharingLLM

1 Introduction

The emergence of Large Language Model (LLM) has precipitated a transformation in the field of machine learning, primarily evidenced by the innovation of model fine-tuning techniques. However, the advent of in-context learning and prompt engineering heralds a more nuanced evolution, enabling dynamic and intuitive interactions between the model and the user. This is achieved by bypassing traditional necessities for parameter updates or explicit model retraining when adapting to novel tasks. Such advancements underscore the potential to substantially broaden the capabilities of LLM-based agents through the strategic conditioning on task-specific prompts, wherein diverse strategies are employed contingent upon the problem type at hand. Initially, In-context learning was introduced to facilitate LLM-based agents in achieving commendable performance with minimal examples (Brown et al., 2020), and subsequently extended across various domains (Ahmed & Devanbu, 2022; Izacard et al., 2022). Following this, the proposal of chain-of-thought prompting significantly augmented the proficiency of LLM-based agents in executing arithmetic tasks (Wei et al., 2022). Building upon this foundation, innovative methodologies such as PAL (Gao et al., 2023) and the integration of LLMs with symbolic solvers (He-Yueya et al., 2023) have been developed to further enhance agent capabilities in tackling reasoning tasks. Recent works has also developed agent which can continuously acquire diverse skills and make novel discoveries (Wang et al., 2023a). Concurrently, the introduction of Retrieval-Augmented Generation marked a substantial improvement in knowledge-intensive tasks (Lewis et al., 2020), and subsequently facilitated more effective generation in open-domain queries (Mao et al., 2020). In recent developments, self-learning techniques have been integrated with the retrieval mechanism within in-context learning to refine model performance in text generation tasks, through the retrieval of examples with the most analogous patterns (Rubin et al., 2021; Wang et al., 2023b). These innovations collectively enhance the efficiency of interactions with LLM-based agents across a spectrum of applications.

Refer to caption
Figure 1: The Memory Sharing framework. Whenever a new (Prompt, Answer) pair is generated, it will be considered to be added to the memory pool and to train the retriever.

Notwithstanding the proficiency of in-context learning in enhancing model performance for complex tasks through the meticulous construction of prompts that furnish explicit instructions and contextual clarity, its applicability remains diminished in scenarios involving open-ended queries such as poetry generation or lateral thinking puzzles. The capacity of LLM-based agents to navigate and resolve such queries will align these entities more closely with human cognitive processes and bolster their potential to create genuine novelty by adopting a more flexible and holistic approach to problem-solving. However, answers to open-ended queries require a comprehensive reference and better understanding of various knowledge, which is an aspect that current agents lack. This limitation is further exacerbated when the content of those knowledge base, often unidimensional and static, is not regularly updated, thereby compounding the difficulty in addressing open-ended queries with evolving reference requirements. To address these issues, we introduce the Memory Sharing (MS) framework among agents, a novel framework specifically designed to surmount the hurdles associated with ensuring comprehensive example coverage and enhancing the dynamism of the knowledge base within the context of in-context learning.

Within the MS framework, the interaction between an agent’s input and its subsequent output is conceptualized as an “Prompt-Answer” (PA) pair, collectively forming the agent’s memory pool. This framework introduces an innovative real-time memory storage and retrieval mechanism, designed to augment the agent’s memory pool by assimilating PA pairs from a multifaceted ensemble of agents. During the storage phase, each PA pair is subjected to a rigorous evaluation by a dedicated LLM scorer, which assesses its appropriateness for inclusion within the memory pool, thereby potentially serving as a referential asset in subsequent engagements. This procedure ensures the memory pool’s capability for dynamic expansion. The retrieval phase is orchestrated by an autonomous learning retrieval system, calibrated to ensure the incorporation of exceptionally pertinent memories into prompts, thereby enhancing the agents’ comprehension of the query’s essence. Fig.1 shows the framework of the MS. It is posited that the incorporation of self-generated memories into prompts significantly elevates agents’ understanding of the intended query meaning. Furthermore, the continuous incorporation of new memories into the pool not only enriches the memory pool but also perpetually refines the retriever, augmenting its efficiency in selecting relevant memories. By treating each interaction as a cohesive PA pair, this framework guarantees that every query posed and response generated is considered in an integrated manner. Our empirical evidence suggests that this methodology substantially assists LLM-based agents in generating outputs that are more aligned with user expectations.

We evaluate the MS framework through three divergent domains where each domain involved the participation of three agents, each with a specialized task under their domain. Our finding suggests that incremental additions to the memory pool have led to enhancements in the precision and relevance of outputs. This research delineates the MS framework’s capacity to mitigate the inherent constraints associated with in-context learning, thereby underscoring its potential applicability and effectiveness.

In the following, Section 2 delineates relevant works. An exhaustive elucidation of the MS framework, inclusive of its conceptual underpinnings and operational methodologies, is presented in Section 3. Section 4 provides empirical validation of the framework’s enhanced capability to address open-ended queries. The conclusion, presented in the Section 5, not only summarizes the findings but also explores prospective avenues for future development of the MS framework, which may better help improve the LLM-based agents.

2 Related Work

2.1 In-context learning

In-context learning capitalizes on the capacity of Large Language Model to address tasks by incorporating a minimal number of examples into the prompt, occasionally rivaling or even surpassing the effectiveness of previous state-of-the-art fine-tuning methods (Brown et al., 2020). Following this paradigm, there has been a growing emphasis on the meticulous construction of prompts to furnish the model with explicit instructions and contextual clarity, thereby augmenting its proficiency in executing complex tasks (Levine et al., 2021; Zhou et al., 2022; Liu et al., 2023; White et al., 2023). Subsequent research has explored the realm of question answering, revealing that crowd-sourced instructions can notably enhance an agent’s performance in this domain (Mishra et al., 2021). Moreover, In-context learning has been demonstrated to facilitate a measure of creative learning within agents (Swanson et al., 2021). Advancements have also been made through the strategic redesign of inputs, rendering LLMs increasingly adept at navigating logical dilemmas, particularly when dealing with interrelated questions and answers (Wiegreffe et al., 2021; Wu et al., 2022). Elucidating the nexus between examples and tasks has been shown to significantly benefit LLMs (Lampinen et al., 2022). Further progress in this domain is attributed to the introduction of sophisticated methodologies such as chain-of-thought (CoT) prompting (Wei et al., 2022) and its derivatives, like PAL (Gao et al., 2023), which incorporate intermediate reasoning steps to bolster performance on complex reasoning tasks. Nonetheless, in solving the open-ended queries, agent still encounters two primary hurdles: the insufficiency of problem descriptions, which undermines agents’ comprehension; and the constraints imposed by the scope of reference material available in external knowledge base, which, despite their extensive nature, present a limited range of consultable resources. Our Memory Sharing (MS) framework has been empirically validated to bolster agents’ proficiency in these respects, thereby enhancing their performance on open-ended queries.

2.2 Retrieval-Augmented Generation

Aiming to solve the Knowledge-Intensive NLP Tasks, Retrieval-Augmented Generation was proposed, which enhancing LLMs’ capacity for generating accurate and timely content with methodologies for source attribution through direct input integration, attention blending, and output interpolation(Lewis et al., 2020; Ram et al., 2023; Shi et al., 2023). Innovations such as employing the classical retrieval method BM25 (Luo et al., 2023; Liu et al., 2022) or SBERT (Reimers & Gurevych, 2019), the most similar utterances with the query are selected. This approach is further complemented by Dense Retrieval, leveraging feedback-driven dense retrievers for effective example retrieval through contrastive learning represent significant strides towards maximizing the efficacy of In-context learning (Rubin et al., 2021). Based on this, by repeating the contrastive learning process during the traning of the retriever, retriever’s performance is further promoted (Wang et al., 2023b). However, for those retrievers, they all trained once before putting the retriever into use, so that retriever doesn’t get constant updated and fail to get the most relevant examples for some new queries. The retriever in the MS framework has always been get continuous training, once a new memory is added. This keeps retriever in a process of constant updating and evolution, and the memory retrieved by the retriever is gradually becoming more and more relevant.

3 The Memory Sharing framework

The Memory Sharing (MS) framework encompasses an suite of functionalities specifically designed for the memory storage and retrieval. Section 3.1 details the description of the meaning and origin of the memory, and most importantly, the memory writing mechanism. This process incorporates a distinctive grading mechanism, enabling the MS to filter out the memory that not only bear relevance to the topical focus of the current agent but also possess a universal applicability, thereby facilitating their reference across agents within the same domain. Section 3.2 illustrates the retrieval mechanism in detail. This operation is instrumental in pinpointing those memories most apt for integration into novel prompts tailored to the immediate query, whilst concurrently leveraging the most recent, high-caliber new memory to refine model training. A toy example, showing the memory storage and retrieval, is depicted in Fig.2.

3.1 Memory Store

Refer to caption
Figure 2: An example of how the Agent (Sonnet) cooperates with the MS framework. (1) + (2) The retriever take the original query from agent as the input, retrieve the suitable memories from the memory pool and concatenate them to the query to form the prompt. (3) The Agent (Sonnet) takes the prompt and makes an answer, pack them as (Prompt, Answer). (4) Scorer generates a score according to the designed rubric for (Prompt, Answer), while (Prompt, Answer) pairs with high scores will be added into the Memory Pool and also be sent to train the Retriever. All agents share the same Memory Pool; they can write memories into the pool and retrieve memories from the pool so that they can share memories with each other.

3.1.1 Memory Generation

Each memory is conceptualized as a (Prompt, Answer) pair. Upon the posing of a query to the agents, the retriever—predicated on the employed strategic (e.g., three-shot learning)—determines the quantity of memory to be sourced from the memory pool. Those memories in conjunction with the original query, are integrated to form an enhanced prompt. This composite prompt is then interfaced with the agent to elicit a corresponding answer. Subsequently, this enhanced prompt and the resultant answer constitute a memory candidate for potential inclusion in the memory pool, diverging from the conventional pairing of merely the original query and its output to form a pair. This model adheres to a real-time memory integration framework, wherein the most recent prompt and answer of the agent are perpetually considered for being added into the memory pool and training the retrieval model. The content stored within the memory, generated by LLM-based agents, encompasses both answers produced by the agents themselves and prompts that incorporate information previously furnished by other agents. Consequently, aside from the initial query, the aggregated memory emanating exclusively from agent contributions serves to enhance the current agent’s comprehension of the comprehensive prompt. Additionally, the memory’s origins—stemming from agents aligned with a uniform overarching objective yet engaged in distinct specialized tasks—facilitate a multifaceted learning experience. This approach enables the agent to garner insights across various dimensions of open-ended queries. Furthermore, the dynamic expansion of the memory pool ensures a continuous influx of novel information, thereby enriching the agent’s knowledge base. Such a pool proves instrumental in addressing open-ended queries, as it equips the agent with a broadened perspective and a deeper understanding, pivotal for generating well-informed responses.

3.1.2 Memory Writing

Each emergent candidate memory is subject to a systematic evaluative procedure via scoring. Prior to this assessment phase, distinct grading rubrics are established for each domain, with the responsibility of grading delegated to the LLM itself. To facilitate the LLM’s comprehension of these rubrics, those rubrics are autonomously generated by the LLM, under the presumption that it will more adeptly grasp concepts of its own devising. Consequently, the establishment of these rubrics precedes the operational deployment of the entire framework to preclude the variability inherent in LLM-generated evaluations, which could lead to discrepancies and potentially compromise the fairness of memory assessments. Thus, all memories produced by the agent within a given domain are evaluated against this uniform set of criteria, ensuring consistency across the evaluation process. Prior to their official implementation, these rubrics undergo a manual review phase. This review not only assesses the relevance of potential memories to the agent’s current focal task but also evaluates their pertinence to other agents within the domain to ascertain their prospective utility. The rationale for not delegating the evaluation of this component to Large Language Models lies in the enhanced precision that manual screening provides, thereby refining the existing rubric to more accurately align with the unique requirements of agents, particularly concerning potential usage environments. Large Language Models, by their nature, may not comprehensively account for these nuanced demands simultaneously. Upon finalizing the appropriate rubric, it is integrated with newly generated memories for evaluation by the LLM-based scorer. Memories that surpass a predefined threshold, also determined by the LLM, are incorporated into the memory pool.

3.2 Memory Retrieval

Prior to the operational deployment of the MS, a curated small subset of instances was manually archived within the memory pool. These instances fulfill a dual purpose: firstly, they provide a diversified array of memories upon which each agent may experiment with novel prompts in the face of new queries; secondly, they constitute the preliminary training corpus for our retriever. This foundational training regimen mirrors the methodology by which subsequently archived memories will be assimilated into our model in real time, thereby facilitating the model’s ongoing adaptive learning and optimization.

3.2.1 Memory Train

Whenever there is a new memory (X,Y) going to be added into the pool, it will also be used to train our retriever, which help the retriever to continuously update itself and continuously adapt to new memory. Based on the new generated memory (X,Y), the classical method BM25 ascertain the most pertinent top-n candidate pairs {(xi,yi)}i=1n, sourced from the diverse and extensive memory pool, labeled as C. Each candidate within C will undergo a evaluation process utilizing the comprehensive scoring capabilities of LLMs.The scoring mechanism employed is defined by the following equation:

p(xi,yi)=P(¬Y(xi,yi),X),i{1,2,,n} (1)

This equation seeks to determine, given a input-output pair (xi,yi) in C as a condition, the probability that the response generated for the input in the new memory contradicts the output in the new memory. This grading part serves as a preparatory step for the subsequent labeling of each candidate example. It is noteworthy that making ¬Y as the result part is trying to make sure that the memory that the retriever gets from other agents is of reference value, but it does not have to be the most relevant to the current question, which means that it can help the current agent to learn from new examples. This approach diverges from a simplistic reliance on Y as the outcome, which tends to restrict the retrieval process to memory previously stored by the current agent.

Within the defined set C={(xi,yi)}i=1n, each candidate now is ascribed a score. We sort them from the lowest to the highest score and we select v memory in total to label. The top v2 candidates (lowest score) in C are identified as being the pair with the reference value to (X,Y) and accordingly, their labels are set to positive. Conversely, the bottom v2 candidates are deemed as the least reference value to (X,Y), and their labels are thus designated as negative. These v memories are subjected to further differentiation and categorization to elucidate their relevance and applicability to the context in query. This methodical approach ensures a systematic and nuanced analysis of the data, facilitating the identification and labeling of the pertinent memories for subsequent utilization. Those labeled data will be used to minimize the following function:

loss(x,y)=1vi=1v[yilog(11+exi)+(1yi)log(111+exi)] (2)

It enhances the predictive accuracy, which is especially critical in handling imbalanced memory pool. This strategic choice underscores our model’s preparedness to extract meaningful insights from various memory, advancing our overarching goal of developing a robust and adaptable MS mechanism.

3.2.2 Prompt Construction from Memory

After encoding the original query and the memory in the pool, we employ cosine similarity to navigate through those memories. The selection criterion for the top-ranking memories is contingent upon the adopted retrieval strategy, which may vary from one-shot to multi-shot learning. Subsequent to the identification and selection of the most relevant memories, these are sequentially concatenated as in-context examples in the prompt, culminating with the integration of the initial query at the sequence’s end. This concatenated sequence thus forms a new prompt, which is subsequently furnished to the agent for processing. This structured procedure not only streamlines the retrieval of relevant information from the memory pool but also facilitates the generation of a contextually enriched prompt, poised to elicit a more informed response from the agent. The integration process employs memories specifically curated to elucidate the query at hand, with each memory being generated by agents operating within the same domain. This approach enables the current agents to process the query through a multifaceted lens, significantly enhancing their understanding. Furthermore, it fosters a scenario wherein agents, inspired by their intrinsic reasoning regarding the open-ended query, achieve a deeper comprehension of its underlying significance. This methodology not only broadens the scope of query interpretation but also enriches the agents’ response quality by leveraging domain-specific insights.

4 Experiments

Our experiments are based on GPT-3.5-Turbo (Brown et al., 2020). For the evaluation Metric, we use BERTScore (Zhang* et al., 2020), ROGUE-2 (Lin, 2004) and ROGUE-L (Lin, 2004) to help us evaluate the usage of memory in improving agent performance regarding the mean relevance and the structure relevance of the answers.

Zero One Two Three
Agent Rogue-2 Rogue-L BERT Rogue-2 Rogue-L BERT Rogue-2 Rogue-L BERT Rogue-2 Rogue-L BERT
Limerick 0.06 0.15 0.50 0.25 0.37 0.69 0.44 0.52 0.76 0.75 0.77 0.87
Wuyanlvshi 0 0 0.66 0 0 0.72 0 0 0.71 0 0 0.72
Sonnet 0.02 0.14 0.48 0.02 0.13 0.53 0.1 0.15 0.53 0.1 0.15 0.53
Lateral-think 0.07 0.19 0.53 0.09 0.21 0.51 0.09 0.25 0.56 0.09 0.26 0.59
Pun 0.27 0.43 0.61 0.20 0.35 0.64 0.30 0.43 0.67 0.24 0.37 0.70
Riddle 0.71 0.80 0.86 0.32 0.48 0.64 0.44 0.56 0.70 0.62 0.75 0.88
Fitness 0.02 0.06 0.46 0.04 0.15 0.61 0.06 0.18 0.64 0.07 0.19 0.65
Study 0.008 0.04 0.44 0.01 0.15 0.65 0.01 0.17 0.60 0.02 0.14 0.63
Travel 0.03 0.06 0.45 0.02 0.12 0.55 0.14 0.28 0.71 0.12 0.18 0.71
Table 1: Performance across agents utilizing different amounts memory for open-ended queries execution. Each domain has its own Domain-pool shared within its three agents. The highest score within each agent of each metric is indicated by boldface.
Metric Limerick Wuyanlvshi Sonnet Lateral-think Pun Riddle Fitness Study Travel
Rogue-2 0.75 0.00 0.10 0.09 0.24 0.62 0.07 0.02 0.12
Domain-pool Rogue-L 0.77 0.00 0.15 0.26 0.37 0.75 0.19 0.14 0.18
BERT 0.87 0.72 0.53 0.59 0.70 0.88 0.65 0.63 0.71
Rouge-2 0.05 0.00 0.01 0.06 0.26 0.60 0.02 0.005 0.02
Single-pool Rogue-L 0.12 0.00 0.10 0.19 0.43 0.71 0.11 0.07 0.10
BERT 0.60 0.68 0.49 0.54 0.70 0.80 0.62 0.63 0.58
Table 2: Agent performance with Domain-pool vs. Single-pool by utilizing three suitable memories for open-ended queries.

4.1 Experiment Details

We aim to assess the efficacy of the MS framework in processing open-ended queries across three principal domains: Literary Creation, Unconventional Logic Problem-solving, and Plan Generation. Within the Literary Creation domain, we have appointed three specialized agents responsible for generating Wuyanlvshi (a form of classical Chinese poetry, Chinese form), Limericks, and Sonnets, respectively. In the Logic Problem-solving domain, dedicated agents are tasked with addressing Lateral Puzzles, Riddles, and Puns. Meanwhile, for Plan Generation, we have developed agents to create Study Plans, Travel Plans, and Fitness Plans. For each agent, a consistent, small subset of pre-provided, complete instances was selected and incorporated into the memory pool for the initial phase of retriever training and prompt refinement. Subsequently, for each agent, an identical number of queries will be introduced to increment the volume of real-time memory within the pool.

The evaluation of memory impact commenced with the implementation of divergent retrieval strategies, encompassing zero-shot, one-shot, two-shot, and three-shot learning modalities. Subsequent to this preliminary phase, the investigation bifurcated into an assessment of both quantitative analysis and qualitative analysis. Regarding the qualitative dimensions, the study delineated two distinct types of memory pools, Domain-pool and Single-pool. The Domain-pool means a dedicated memory pool is allocated for each domain and is shared for all agents with this domain, aiming at enhancing the integration of domain-specific memories. Conversely, the second pool integrates agents from all domains into a unified memory pool, facilitating an overarching analysis of cross-domain memory utilization. On the quantitative front, the experiment was segmented into five discrete phases, with each phase characterized by the addition of a predetermined quantum of new memories to the existing memory pool. At the juncture of each phase, an evaluation of agent performance was conducted to ascertain improvements or regressions. This dual-faceted approach enabled a thorough exploration of the nuanced impacts and applicability of authentic memories across diverse domains, thereby fostering a comprehensive understanding of their differential effects.

4.2 Experiment Analysis

The principal outcomes of our experiments are presented in Table.1, which delineates the performance of each agent under various learning strategies within the Memory Sharing (MS) framework. Different strategies will help agent retrieve different numbers of memory to combine. With the increasing of the memory be used, the performance of most agents is getting better and better, which means that under the same domain, different types of memory will help the agent better understand the problem and generate more relevant answers, rather than interfering with the agent’s learning ability. Specifically, within the Literal Creation domain and Plan Generation domain, all agents exhibited enhanced performance after utilizing the memory from other agents, as evidenced by both ROUGE and BERTScore metrics. This enhancement suggests that the shared memory enables agents to better comprehend description of the literary work and produce more pertinent responses. However, for Wuyanlvshi, the performance didn’t change a lot, which may resulted in the language used is different when storing memory. Unifying the language used, which may help improve the performance of MS in the future. Moreover, in the unconventional Logic Problem-solving domain, while the ROUGE scores remained relatively lower compared to non-utilization of memory, a improvement in BERTScore indicates a successful assimilation of diverse knowledge from the memory, leading to increasingly nuanced and semantically aligned outputs. Conclusively, the MS framework has facilitated continual performance improvement across agents through memory sharing, underscoring its potential utility.

Refer to caption
Figure 3: Evaluating agent performance on open-ended queries using three suitable memories and Domain-pool with periodic updates.

Table.2 compares the scenario where all agents use the Domain-pool or Single-pool under a three-shot learning strategy, since the prior experiments showed us that most agents achieve the best performance under the three-shot learning. Excluding Agent-Pun, all other agents exhibited diminished performance with the Single-pool. The data elucidate a general decrement in performance relative to scenarios where agents within identical domains utilize distinct, domain-specific memory pools. This outcome implies that agents sharing analogous characteristics derive maximal advantage from the exclusivity of a domain-specific memory pool, as the incorporation of cross-domain memories may detrimentally impact the learning efficiency of agents.

Furthermore, within the query where three-shot learning is applied and each agent use the Domain-pool, Figure.3 delineates the variations in performance across individual agents consequent to the integration of different ratios of newly generated memories into the pool. Specifically, for the agent-Travel, there is a consistent enhancement in performance. Conversely, for the majority of agents, an initial improvement in performance was observed as the volume of memories within the pool expanded, followed by a subsequent decline. This pattern suggests that, irrespective of the homogeneity of memory types, an excessive accumulation of memories may ultimately impede the agent’s learning efficacy and output quality. Therefore, determining the optimal capacity of a memory pool for a specific domain emerges as a pertinent inquiry for future research. Regarding the fluctuations in performance observed for the agent-Wuyanlvshi, these are tentatively ascribed to language different.

5 Conclusions

We introduce a novel framework Memory Sharing, which processes real-time memory via memory store and retrieval. The findings suggest that augmenting the volume of high-quality memory enhances the ability of LLM-based agents to comprehend the nuances of questions and generate more pertinent responses regarding the open-ended queries. And the production of each high-caliber memory contributes not only to the augmentation of our memory reservoir but also to the recurrent training of the Retriever. This systematic process guarantees that, as the memory pool keep expansion dynamically, the Retriever maintains the capability to consistently identify and select the most pertinent memory for utilization by the agent. Regarding future research directions, it is posited that the efficacy of the Memory Sharing (MS) framework could be enhanced through the development of methodologies for determining the optimal size of the memory pool. Presently, the usage of MS for all agents under the same LLM model is observed. It merits a comprehensive evaluation of the impact of deploying identical agents within a variety of foundational models (e.g., GPT-4, LLaMA-2, Claude-2). Such an approach would capitalize on leveraging memories acquired from diverse Large Language Models (LLMs). Furthermore, the potential integration of this framework within fine-tuning processes presents another avenue for exploration. These investigations into the MS represent a progressive step towards harnessing real-time memory to augment the capabilities of LLM-based agents, offering substantial prospects for future research and practical applications within the domain of artificial intelligence.

References

  • Ahmed & Devanbu (2022) Toufique Ahmed and Premkumar Devanbu. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5, 2022.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
  • He-Yueya et al. (2023) Joy He-Yueya, Gabriel Poesia, Rose E Wang, and Noah D Goodman. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102, 2023.
  • Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  • Lampinen et al. (2022) Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
  • Levine et al. (2021) Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of in-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541, 2021.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Lin (2004) Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
  • Liu et al. (2022) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  • Liu et al. (2023) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • Luo et al. (2023) Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128, 2023.
  • Mao et al. (2020) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
  • Mishra et al. (2021) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  • Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
  • Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  • Rubin et al. (2021) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021.
  • Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  • Swanson et al. (2021) Ben Swanson, Kory Mathewson, Ben Pietrzak, Sherol Chen, and Monica Dinalescu. Story centaur: Large language model few shot learning as a creative writing tool. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 244–256, 2021.
  • Wang et al. (2023a) Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  • Wang et al. (2023b) Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164, 2023b.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • White et al. (2023) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
  • Wiegreffe et al. (2021) Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. Reframing human-ai collaboration for generating free-text explanations. arXiv preprint arXiv:2112.08674, 2021.
  • Wu et al. (2022) Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–22, 2022.
  • Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  • Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.

Appendix A Appendix

A.1 Rubrics for scoring

Refer to caption
Figure 4: When the Single-Pool is used, the rubrics shared within all the agents
Refer to caption
Figure 5: The rubrics for agents in Literal Creation
Refer to caption
Figure 6: The rubrics for agents in Unconventional Logic Problem Solving
Refer to caption
Figure 7: The rubrics for agents in Plan generation.

A.2 Prompt and Answer

Refer to caption
Figure 8: Agent get the query and create a sonnet with using the memory

A.3 Dataset

Within the realm of each domain, examples were systematically harvested from the internet and subsequently utilized as responses to each query posed. In particular, the selection of Wuyanlvshi—renowned and historically significant poems within Chinese literature—was meticulously chosen for their fame and widespread recognition. Furthermore, the sonnets incorporated into our study originate from the ”quarto” collection authored by Shakespeare in 1609. Regarding the formulation of questions corresponding to each answer, this task was entrusted to ChatGPT. The process entailed providing the model with the selected answers, following which it was instructed to generate relevant questions. Figure.9 presents an illustrative example detailing the question generation process based on the provided answers.

Refer to caption
Figure 9: Example for make the question of those given answer