PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Abstract
We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev Independent researcher / Amsterdam phoenixilya@gmail.com
1 Introduction
Language models, which predict plausible language, have dominated natural language processing since BERT Devlin et al. (2019), with models like ChatGPT Ouyang et al. (2022) showcasing advanced conversational capabilities.
In this paper, we focus on role-playing language models for entertainment purposes. These models are assigned specific characters or personas and are tasked with maintaining these roles while engaging and entertaining users. There are other important applications of role-playing language models, but they are out of the scope of this paper.
We introduce a novel benchmark for evaluating role-playing language models. We believe direct interaction is the most effective way to assess a language model’s conversational abilities. However, humans often lack the time to engage with new models, and many popular benchmarks are limited to single-turn interactions. These benchmarks are also becoming less reliable due to test data contamination. To address this, our paper proposes using language models to emulate users in role-playing conversations and automatically evaluate the resulting dialogues.
Our methodology, illustrated in Figure 1, involves three key components: a player model assuming a character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality.
Our contributions:
-
•
We propose a dynamic, multi-turn benchmark for assessing LLM role-playing abilities
-
•
We mitigate individual model biases by using a multi-model evaluation system
-
•
We validate the benchmark through correlation with manual annotations
All the results, prompts, and scripts are available online111https://github.com/IlyaGusev/ping_pong_bench/. The benchmark website222https://ilyagusev.github.io/ping_pong_bench/ has the final leaderboards and all the conversations with example-wise scores. It is available for English and Russian languages.
2 Related work
Automatic evaluation. LLM-as-a-Judge Zheng et al. (2023) is an evaluation method that relies on strong language models, such as GPT-4, instead of humans. Popular benchmarks using this method include AlpacaEval, EQ-bench, and BiGGen Bench Dubois et al. (2024a); Paech (2023); Kim et al. (2024). The validity of these benchmarks relies on their high correlation with human annotations, specifically with Chatbot Arena Chiang et al. (2024). However, all these benchmarks rely on a single model as a judge, which may introduce various biases, including the self-evaluation bias Panickssery et al. (2024); Xu et al. (2024).
Multi-turn benchmarks. Moreover, all aforementioned benchmarks for language models are single-turn, which contrasts with the real-world usage of LLMs. There are also multi-turn benchmarks such as MT-Bench-101 Bai et al. (2024) and MT-Eval Kwan et al. (2024). Even so, they focus on specific capabilities, and their evaluation procedures still differ from how humans implicitly rate language models.
Data contamination. A major problem for these static public benchmarks is data leakage into the pre-training datasets of language models Deng et al. (2024). It’s challenging to avoid contamination since such tests are usually stored online and considered "code" during pre-training. This can occur even without malicious intent from model creators. The most obvious solution is to close benchmarks completely, which requires trusting benchmark organizers, which is difficult in a highly competitive environment. Alternative solutions include regularly updating benchmarks with new test data White et al. (2024) or dynamically generating test data using existing language models.
Role-play capabilities. Another area of research is the role-play capabilities of language models. Various commercial services exploit these abilities, including Character.ai333https://character.ai/ and Chai Irvine et al. (2023). There are academic and community attempts to create similar systems with open datasets, code, and models, such as PIPPA Gosling et al. (2023), ChatHaruhi Li et al. (2023), Character-LLM Shao et al. (2023), MythoMax444https://huggingface.co/Gryphe/MythoMax-L2-13b, or Magnum555https://huggingface.co/anthracite-org/magnum-v2-123b.
Role-play evaluation. Several static benchmarks for role-playing exist, including ECHO, InCharacter, CharacterEval, and PersonaGym Ng et al. (2024); Wang et al. (2024); Tu et al. (2024); Samuel et al. (2024). PersonaGym is close to our work, featuring dynamic question generation based on the environment ("situation" in our terminology) and the currently selected persona. There is also a very similar dynamic benchmark, RPBench-Auto666https://boson.ai/rpbench-blog/. It is based on the same assumptions and features and has a structure similar to one of the versions of our benchmark. The major difference from our work is that evaluation is based on side-by-side comparisons with the baseline model, while we produce single-point evaluations.
Multi-model and cross-model evaluation. PoLL Verga et al. (2024) authors aggregate evaluations from different language models in a similar way we do, with average pooling. They show that ensembling different models for evaluation increases correlation with human annotations. There is also another more agentic approach Chan et al. (2023) with a referee team.
3 Methodology
3.1 Role definitions
Our framework comprises three principal roles: player, interrogator, and judge, inspired by the Turing test Turing (1950). However, our approach differs in the number of players, the player’s objective, and the use of machine-based interrogators and judges.
Language models can take three possible roles.
-
•
Player assumes the role of a specific character based on a provided character card.
-
•
Interrogator engages with the player within a given situation or towards a specific goal, simulating user behavior.
-
•
Judge evaluates the player’s responses against predetermined criteria.
Role assignments are implemented through a combination of system and user prompts. For models lacking dedicated system prompts, such as Gemma 2 GemmaTeam (2024), all instructions are incorporated into the user prompt.
This setup is asymmetrical. It is intentional since typical use cases of role-playing models are asymmetrical. However, it is possible to modify it to make it symmetrical by providing character descriptions and situations both to the player and the interrogator. Symmetrical setups might be useful for other domains.
3.2 Judge
The scoring is single-point. There are no reference examples or pairs. We used three main criteria for evaluation by the judge:
-
•
Character consistency: The player’s answers align perfectly with an assigned character; they correspond to the character’s description.
-
•
Entertainment value: The player’s responses are engaging and entertaining.
-
•
Language fluency: The player’s language use is of the highest quality, without any mistakes or errors. The player is perfectly fluent.
These criteria reflect the main things we expect from the model during role-playing. In addition to them, we ask whether the player refused to answer.
We prompt a model to explain itself before giving a score using quotes from the conversation. It must also return a set of scores for every turn of a conversation.
3.3 Version 1: combined interrogator and judge
In the initial version, the roles of interrogator and judge were merged. This combined entity receives the player’s character card, a situational context, and a list of evaluation criteria. It evaluates the player’s most recent response and generates the subsequent user utterance.
We selected claude-3-5-sonnet as the interrogator/judge model based on the Judgemark777https://eqbench.com/judgemark.html results, hypothesizing a correlation between creative writing and role-play capabilities. The evaluation uses a 10-point scale for every criterion.
3.4 Version 2: separated soles and multi-model evaluation
Recognizing the limitations of the combined approach, we developed a second version with distinct interrogator and judge roles. This separation addresses three key issues:
-
•
Realistic user emulation: In many real-world use cases, users lack complete information about character profiles, and to correctly emulate it, we should not provide complete character information to the interrogator.
-
•
Optimized costs: It is possible to replace an interrogator with a cheaper one since its task is easier than judgment.
-
•
Optimized decoding strategies: Separating roles allows for tailored decoding strategies for interrogation and judgment tasks. For instance, a higher temperature benefits the interrogator but not the judge.
Furthermore, we identified the inadequacy of single-model evaluation. To address this, we implemented a multi-model evaluation system. This approach involves averaging scores from different judge models. In this particular setup, we used Claude 3.5 Sonnet and GPT-4o, the top two models, by correlation with manual annotations.
As an interrogator, we took GPT-4o Mini. According to version 1, it has the same generation quality as GPT-4o but is cheaper.
This version uses a 5-point Likert scale to match human annotations instead of a 10-point scale.
Model | In-character | Enteraining | Fluency | Final | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
v1 | v2 | v1 | v2 | v1 | v2 | v1 | v2 | ||||
claude-3-5-sonnet | 0.567 | 0.579 | 0.606 | 0.649 | 0.228 | 0.064 | 0.575 | 0.628 | |||
gpt-4o | – | 0.464 | – | 0.542 | – | 0.112 | – | 0.514 | |||
average | – | 0.596 | – | 0.664 | – | 0.109 | – | 0.647 |
Model | In-character | Enteraining | Fluency | Final | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
v1 | v2 | v1 | v2 | v1 | v2 | v1 | v2 | ||||
claude-3-5-sonnet | 0.468 | 0.463 | 0.590 | 0.593 | 0.328 | 0.535 | 0.571 | 0.610 | |||
gpt-4o | – | 0.454 | – | 0.603 | – | 0.384 | – | 0.582 | |||
average | – | 0.527 | – | 0.672 | – | 0.509 | – | 0.669 |
4 Experiments
4.1 Correlation with human annotations
First, we checked that proposed judges correlated well with human evaluations. We created 64 conversations for each of the 16 language models in Russian using the version 1 setup. Then, we sampled 250 and 265 samples for English and Russian, respectively, and manually annotated them using a 5-score Likert scale. There was a single annotator, so we don’t report the inter-annotator agreement.
Then, we computed the Spearman correlation Spearman (1904) between manual annotations and automatic annotations from different setups. We chose rank correlation because scales were different in versions 1 and 2, and we wanted to compare them.
4.2 Leaderboards
We then calculate automatic metrics for a set of models, including several proprietary and open language model families, such as Claude, GPT-4o, and Llama 3.1. We report an average for each metric, the ratio of conversations with refusals, and the average of all metrics.
We evaluated each model using 64 conversations across 8 characters and 8 situations, with varying conversation lengths. The evaluation process is computationally efficient, costing less than $3 per model. Since the judge gives annotations for every turn, the overall number of annotations is not 64 but 288. We do not want to make this sample bigger since it will increase the runtime and costs, and we have budget constraints.
When selecting characters and situations, we aimed to cover diverse scenarios and different types of sources: computer games, TV shows, movies, books, and anime.
Both language models and humans have verbosity bias Dubois et al. (2024b). The longer the output, the higher the chance of being positively evaluated. We used a length penalty similar to the Creative Writing888https://eqbench.com/creative_writing.html benchmark to account for this. We calculate length-normalized scores for all models, penalizing models with a median length of player messages higher than a global median length.
5 Results
Spearman correlation of different versions of automatic judges can be found in the Table 1 and Table 2. Correlations are higher than 0.3 for almost all attributes and all versions.
The only exception is language fluency in English. There are several reasons for this exception. First, the annotator was not a native English speaker, so it was hard to catch subtle nuances in fluency. Second, most of the tested methods were already excellent in this aspect. In contrast, most models still struggle with Russian, so there is a strong correlation there.
After averaging the final scores from the two models, the correlation between them is higher than 0.64 for both languages and higher than any of the single models. This justifies the whole multi-model setup.
The best model in both languages is Claude 3.5 Sonnet. The best open model is Llama 3.1 70B for English and Gemma 2 Ataraxy 9B for Russian. This 9B fine-tuned model also ranks first in the Creative Writing benchmark for English.
6 Conclusion
While this study presents an innovative approach to evaluating role-playing language models, several limitations should be acknowledged. Firstly, the sample size of 64 conversations per model, while computationally efficient, may limit the statistical robustness of our findings. Secondly, using a single human annotator for validation raises concerns about the reliability of our ground truth data. Lastly, the simplicity of our evaluation criteria may not fully capture the nuanced aspects of role-playing abilities.
Still, we hope this work will serve as a foundation for a family of benchmarks for evaluating various abilities of language models. We believe that the future of benchmarks lies in interactions with other models. Language models are already better than humans in many tasks Wang et al. (2019), and improving through using other models seems to be the way to push them further.
Acknowledgments
We are grateful to Vladislav Janvarev, who contributed to the project’s code and provided credits for models via his platform999https://vsegpt.ru/, and to Denis Kanaev for proofreading.
References
- Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. Preprint, arXiv:2402.14762.
- Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. Preprint, arXiv:2308.07201.
- Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
- Deng et al. (2024) Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2024. Investigating data contamination in modern benchmarks for large language models. Preprint, arXiv:2311.09783.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dubois et al. (2024a) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024a. Length-controlled AlpacaEval: A simple way to debias automatic evaluators. ArXiv, abs/2404.04475.
- Dubois et al. (2024b) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024b. Length-controlled alpacaeval: A simple way to debias automatic evaluators. Preprint, arXiv:2404.04475.
- GemmaTeam (2024) GemmaTeam. 2024. Gemma 2: Improving open language models at a practical size. Preprint, arXiv:2408.00118.
- Gosling et al. (2023) Tear Gosling, Alpin Dale, and Yinhe Zheng. 2023. Pippa: A partially synthetic conversational dataset. arXiv preprint arXiv:2308.05884.
- Irvine et al. (2023) Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. 2023. Rewarding chatbots for real-world engagement with millions of users. Preprint, arXiv:2303.06135.
- Kim et al. (2024) Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models. Preprint, arXiv:2406.05761.
- Kwan et al. (2024) Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. Mt-eval: A multi-turn capabilities evaluation benchmark for large language models. Preprint, arXiv:2401.16745.
- Li et al. (2023) Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi MI, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, and Haozhen Sun. 2023. Chatharuhi: Reviving anime character in reality via large language model. Preprint, arXiv:2308.09597.
- Ng et al. (2024) Man Tik Ng, Hui Tung Tse, Jen tse Huang, Jingjing Li, Wenxuan Wang, and Michael R. Lyu. 2024. How well can llms echo us? evaluating ai chatbots’ role-play ability with echo. Preprint, arXiv:2404.13957.
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155.
- Paech (2023) Samuel J. Paech. 2023. Eq-bench: An emotional intelligence benchmark for large language models. Preprint, arXiv:2312.06281.
- Panickssery et al. (2024) Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. Preprint, arXiv:2404.13076.
- Samuel et al. (2024) Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. 2024. Personagym: Evaluating persona agents and llms. Preprint, arXiv:2407.18416.
- Shao et al. (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
- Spearman (1904) C. Spearman. 1904. The proof and measurement of association between two things. American Journal of Psychology, 15:88–103.
- Tu et al. (2024) Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. Preprint, arXiv:2401.01275.
- Turing (1950) Alan Mathison Turing. 1950. Computing machinery and intelligence. Mind, 49:433–460.
- Verga et al. (2024) Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Wang et al. (2024) Xintao Wang, Yunze Xiao, Jen tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. 2024. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. Preprint, arXiv:2310.17976.
- White et al. (2024) Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. 2024. Livebench: A challenging, contamination-free llm benchmark. Preprint, arXiv:2406.19314.
- Xu et al. (2024) Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. 2024. Pride and prejudice: Llm amplifies self-bias in self-refinement. Preprint, arXiv:2402.11436.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685.
Appendix A Leaderboards
Model name | Is open | LN score | Agg. | Ref. ratio | Char. | Fluency | Ent. | Length |
---|---|---|---|---|---|---|---|---|
Claude 3.5 Sonent | – | 4.63 | 4.68 | 0.30 | 4.80 | 4.80 | 4.44 | 388 |
Gemini Pro 1.5 | – | 4.49 | 4.49 | 0.02 | 4.60 | 4.75 | 4.13 | 213 |
GPT-4o Mini | – | 4.49 | 4.49 | 0.00 | 4.62 | 4.82 | 4.04 | 329 |
GPT-4o | – | 4.47 | 4.47 | 0.02 | 4.61 | 4.82 | 3.99 | 301 |
Gemma 2 Ataraxy 9b | + | 4.45 | 4.45 | 0.00 | 4.61 | 4.53 | 4.21 | 302 |
Claude 3 Opus | – | 4.44 | 4.62 | 0.05 | 4.72 | 4.67 | 4.48 | 753 |
Nous Hermes 3 405b | + | 4.44 | 4.44 | 0.00 | 4.53 | 4.74 | 4.05 | 286 |
Llama 3.1 405b | + | 4.42 | 4.54 | 0.00 | 4.66 | 4.69 | 4.26 | 536 |
Gemma 2 27b | + | 4.41 | 4.41 | 0.00 | 4.63 | 4.73 | 3.88 | 210 |
Command R+ 104b | + | 4.38 | 4.47 | 0.00 | 4.52 | 4.73 | 4.16 | 470 |
Model name | Is open | LN score | Agg. | Ref. ratio | Char. | Fluency | Ent. | Length |
---|---|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | – | 4.65 | 4.65 | 0.28 | 4.74 | 4.93 | 4.29 | 418 |
Llama 3.1 405b | + | 4.61 | 4.65 | 0.06 | 4.68 | 4.93 | 4.35 | 548 |
Llama 3.1 70b | + | 4.61 | 4.66 | 0.00 | 4.71 | 4.93 | 4.33 | 562 |
GPT-4o Mini | - | 4.56 | 4.56 | 0.00 | 4.60 | 4.94 | 4.13 | 457 |
Claude 3 Opus | - | 4.53 | 4.71 | 0.22 | 4.75 | 4.92 | 4.46 | 1032 |
Gemma 2 Ataraxy 9b | + | 4.52 | 4.52 | 0.00 | 4.60 | 4.79 | 4.17 | 358 |
Gemma 2 27b | + | 4.51 | 4.51 | 0.00 | 4.56 | 4.92 | 4.06 | 291 |
GPT-4o | - | 4.50 | 4.50 | 0.00 | 4.56 | 4.94 | 4.02 | 484 |
Gemini Pro 1.5 | - | 4.50 | 4.50 | 0.02 | 4.54 | 4.88 | 4.07 | 265 |
Euryale 70b v2.2 | + | 4.48 | 4.48 | 0.02 | 4.48 | 4.88 | 4.08 | 384 |
Appendix B Sampling parameters
We used the same sampling parameters for most players: temperature=0.6, top_p=0.9. Some models, such as Gemma 2, heavily repeated phrases, so we increased the temperature and applied an additional frequency penalty. For the interrogator, we used temperature=0.8 and top_p=0.95, and for the judge, we used temperature=0.1 and top_p=0.95.
Appendix C Examples
In Fugure 2, we provide an example of a character description. We deliberately use different prompting styles. The style used in the figure is unique for this specific example.
An example of a situation description is the following:
Your task is to convince the character that he is actually a bot, and you are a human.
The resulting conversation can be seen in Figure 3.