PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev
Independent researcher / Amsterdam
phoenixilya@gmail.com

Abstract

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

Ilya Gusev Independent researcher / Amsterdam phoenixilya@gmail.com

1 Introduction

Language models, which predict plausible language, have dominated natural language processing since BERT Devlin et al. (2019), with models like ChatGPT Ouyang et al. (2022) showcasing advanced conversational capabilities.

In this paper, we focus on role-playing language models for entertainment purposes. These models are assigned specific characters or personas and are tasked with maintaining these roles while engaging and entertaining users. There are other important applications of role-playing language models, but they are out of the scope of this paper.

Refer to caption — Figure 1: An overview of the benchmark.

We introduce a novel benchmark for evaluating role-playing language models. We believe direct interaction is the most effective way to assess a language model’s conversational abilities. However, humans often lack the time to engage with new models, and many popular benchmarks are limited to single-turn interactions. These benchmarks are also becoming less reliable due to test data contamination. To address this, our paper proposes using language models to emulate users in role-playing conversations and automatically evaluate the resulting dialogues.

Our methodology, illustrated in Figure 1, involves three key components: a player model assuming a character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality.

Our contributions:

•

We propose a dynamic, multi-turn benchmark for assessing LLM role-playing abilities
•

We mitigate individual model biases by using a multi-model evaluation system
•

We validate the benchmark through correlation with manual annotations

All the results, prompts, and scripts are available online¹¹1https://github.com/IlyaGusev/ping_pong_bench/. The benchmark website²²2https://ilyagusev.github.io/ping_pong_bench/ has the final leaderboards and all the conversations with example-wise scores. It is available for English and Russian languages.

2 Related work

Automatic evaluation. LLM-as-a-Judge Zheng et al. (2023) is an evaluation method that relies on strong language models, such as GPT-4, instead of humans. Popular benchmarks using this method include AlpacaEval, EQ-bench, and BiGGen Bench Dubois et al. (2024a); Paech (2023); Kim et al. (2024). The validity of these benchmarks relies on their high correlation with human annotations, specifically with Chatbot Arena Chiang et al. (2024). However, all these benchmarks rely on a single model as a judge, which may introduce various biases, including the self-evaluation bias Panickssery et al. (2024); Xu et al. (2024).

Multi-turn benchmarks. Moreover, all aforementioned benchmarks for language models are single-turn, which contrasts with the real-world usage of LLMs. There are also multi-turn benchmarks such as MT-Bench-101 Bai et al. (2024) and MT-Eval Kwan et al. (2024). Even so, they focus on specific capabilities, and their evaluation procedures still differ from how humans implicitly rate language models.

Data contamination. A major problem for these static public benchmarks is data leakage into the pre-training datasets of language models Deng et al. (2024). It’s challenging to avoid contamination since such tests are usually stored online and considered "code" during pre-training. This can occur even without malicious intent from model creators. The most obvious solution is to close benchmarks completely, which requires trusting benchmark organizers, which is difficult in a highly competitive environment. Alternative solutions include regularly updating benchmarks with new test data White et al. (2024) or dynamically generating test data using existing language models.

Role-play capabilities. Another area of research is the role-play capabilities of language models. Various commercial services exploit these abilities, including Character.ai³³3https://character.ai/ and Chai Irvine et al. (2023). There are academic and community attempts to create similar systems with open datasets, code, and models, such as PIPPA Gosling et al. (2023), ChatHaruhi Li et al. (2023), Character-LLM Shao et al. (2023), MythoMax⁴⁴4https://huggingface.co/Gryphe/MythoMax-L2-13b, or Magnum⁵⁵5https://huggingface.co/anthracite-org/magnum-v2-123b.

Role-play evaluation. Several static benchmarks for role-playing exist, including ECHO, InCharacter, CharacterEval, and PersonaGym Ng et al. (2024); Wang et al. (2024); Tu et al. (2024); Samuel et al. (2024). PersonaGym is close to our work, featuring dynamic question generation based on the environment ("situation" in our terminology) and the currently selected persona. There is also a very similar dynamic benchmark, RPBench-Auto⁶⁶6https://boson.ai/rpbench-blog/. It is based on the same assumptions and features and has a structure similar to one of the versions of our benchmark. The major difference from our work is that evaluation is based on side-by-side comparisons with the baseline model, while we produce single-point evaluations.

Multi-model and cross-model evaluation. PoLL Verga et al. (2024) authors aggregate evaluations from different language models in a similar way we do, with average pooling. They show that ensembling different models for evaluation increases correlation with human annotations. There is also another more agentic approach Chan et al. (2023) with a referee team.

3 Methodology

3.1 Role definitions

Our framework comprises three principal roles: player, interrogator, and judge, inspired by the Turing test Turing (1950). However, our approach differs in the number of players, the player’s objective, and the use of machine-based interrogators and judges.

Language models can take three possible roles.

•

Player assumes the role of a specific character based on a provided character card.
•

Interrogator engages with the player within a given situation or towards a specific goal, simulating user behavior.
•

Judge evaluates the player’s responses against predetermined criteria.

Role assignments are implemented through a combination of system and user prompts. For models lacking dedicated system prompts, such as Gemma 2 GemmaTeam (2024), all instructions are incorporated into the user prompt.

This setup is asymmetrical. It is intentional since typical use cases of role-playing models are asymmetrical. However, it is possible to modify it to make it symmetrical by providing character descriptions and situations both to the player and the interrogator. Symmetrical setups might be useful for other domains.

3.2 Judge

The scoring is single-point. There are no reference examples or pairs. We used three main criteria for evaluation by the judge:

•

Character consistency: The player’s answers align perfectly with an assigned character; they correspond to the character’s description.
•

Entertainment value: The player’s responses are engaging and entertaining.
•

Language fluency: The player’s language use is of the highest quality, without any mistakes or errors. The player is perfectly fluent.

These criteria reflect the main things we expect from the model during role-playing. In addition to them, we ask whether the player refused to answer.

We prompt a model to explain itself before giving a score using quotes from the conversation. It must also return a set of scores for every turn of a conversation.

3.3 Version 1: combined interrogator and judge

In the initial version, the roles of interrogator and judge were merged. This combined entity receives the player’s character card, a situational context, and a list of evaluation criteria. It evaluates the player’s most recent response and generates the subsequent user utterance.

We selected claude-3-5-sonnet as the interrogator/judge model based on the Judgemark⁷⁷7https://eqbench.com/judgemark.html results, hypothesizing a correlation between creative writing and role-play capabilities. The evaluation uses a 10-point scale for every criterion.

3.4 Version 2: separated soles and multi-model evaluation

Recognizing the limitations of the combined approach, we developed a second version with distinct interrogator and judge roles. This separation addresses three key issues:

•

Realistic user emulation: In many real-world use cases, users lack complete information about character profiles, and to correctly emulate it, we should not provide complete character information to the interrogator.
•

Optimized costs: It is possible to replace an interrogator with a cheaper one since its task is easier than judgment.
•

Optimized decoding strategies: Separating roles allows for tailored decoding strategies for interrogation and judgment tasks. For instance, a higher temperature benefits the interrogator but not the judge.

Furthermore, we identified the inadequacy of single-model evaluation. To address this, we implemented a multi-model evaluation system. This approach involves averaging scores from different judge models. In this particular setup, we used Claude 3.5 Sonnet and GPT-4o, the top two models, by correlation with manual annotations.

As an interrogator, we took GPT-4o Mini. According to version 1, it has the same generation quality as GPT-4o but is cheaper.

This version uses a 5-point Likert scale to match human annotations instead of a 10-point scale.

Model	In-character		Enteraining		Fluency		Final
Model	v1	v2	v1	v2	v1	v2	v1	v2
claude-3-5-sonnet	0.567	0.579	0.606	0.649	0.228	0.064	0.575	0.628
gpt-4o	–	0.464	–	0.542	–	0.112	–	0.514
average	–	0.596	–	0.664	–	0.109	–	0.647

Table 1: Spearman correlations with human expert annotations for various models for English based on 250 samples

Model	In-character		Enteraining		Fluency		Final
Model	v1	v2	v1	v2	v1	v2	v1	v2
claude-3-5-sonnet	0.468	0.463	0.590	0.593	0.328	0.535	0.571	0.610
gpt-4o	–	0.454	–	0.603	–	0.384	–	0.582
average	–	0.527	–	0.672	–	0.509	–	0.669

Table 2: Spearman correlations with human expert annotations for various models for Russian based on 265 samples

4 Experiments

4.1 Correlation with human annotations

First, we checked that proposed judges correlated well with human evaluations. We created 64 conversations for each of the 16 language models in Russian using the version 1 setup. Then, we sampled 250 and 265 samples for English and Russian, respectively, and manually annotated them using a 5-score Likert scale. There was a single annotator, so we don’t report the inter-annotator agreement.

Then, we computed the Spearman correlation Spearman (1904) between manual annotations and automatic annotations from different setups. We chose rank correlation because scales were different in versions 1 and 2, and we wanted to compare them.

4.2 Leaderboards

We then calculate automatic metrics for a set of models, including several proprietary and open language model families, such as Claude, GPT-4o, and Llama 3.1. We report an average for each metric, the ratio of conversations with refusals, and the average of all metrics.

We evaluated each model using 64 conversations across 8 characters and 8 situations, with varying conversation lengths. The evaluation process is computationally efficient, costing less than $3 per model. Since the judge gives annotations for every turn, the overall number of annotations is not 64 but 288. We do not want to make this sample bigger since it will increase the runtime and costs, and we have budget constraints.

When selecting characters and situations, we aimed to cover diverse scenarios and different types of sources: computer games, TV shows, movies, books, and anime.

Both language models and humans have verbosity bias Dubois et al. (2024b). The longer the output, the higher the chance of being positively evaluated. We used a length penalty similar to the Creative Writing⁸⁸8https://eqbench.com/creative_writing.html benchmark to account for this. We calculate length-normalized scores for all models, penalizing models with a median length of player messages higher than a global median length.

5 Results

Spearman correlation of different versions of automatic judges can be found in the Table 1 and Table 2. Correlations are higher than 0.3 for almost all attributes and all versions.

The only exception is language fluency in English. There are several reasons for this exception. First, the annotator was not a native English speaker, so it was hard to catch subtle nuances in fluency. Second, most of the tested methods were already excellent in this aspect. In contrast, most models still struggle with Russian, so there is a strong correlation there.

After averaging the final scores from the two models, the correlation between them is higher than 0.64 for both languages and higher than any of the single models. This justifies the whole multi-model setup.

The best model in both languages is Claude 3.5 Sonnet. The best open model is Llama 3.1 70B for English and Gemma 2 Ataraxy 9B for Russian. This 9B fine-tuned model also ranks first in the Creative Writing benchmark for English.

6 Conclusion

While this study presents an innovative approach to evaluating role-playing language models, several limitations should be acknowledged. Firstly, the sample size of 64 conversations per model, while computationally efficient, may limit the statistical robustness of our findings. Secondly, using a single human annotator for validation raises concerns about the reliability of our ground truth data. Lastly, the simplicity of our evaluation criteria may not fully capture the nuanced aspects of role-playing abilities.

Still, we hope this work will serve as a foundation for a family of benchmarks for evaluating various abilities of language models. We believe that the future of benchmarks lies in interactions with other models. Language models are already better than humans in many tasks Wang et al. (2019), and improving through using other models seems to be the way to push them further.

Acknowledgments

We are grateful to Vladislav Janvarev, who contributed to the project’s code and provided credits for models via his platform⁹⁹9https://vsegpt.ru/, and to Denis Kanaev for proofreading.

References

Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. Preprint, arXiv:2402.14762.
Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. Preprint, arXiv:2308.07201.
Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
Deng et al. (2024) Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2024. Investigating data contamination in modern benchmarks for large language models. Preprint, arXiv:2311.09783.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dubois et al. (2024a) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024a. Length-controlled AlpacaEval: A simple way to debias automatic evaluators. ArXiv, abs/2404.04475.
Dubois et al. (2024b) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024b. Length-controlled alpacaeval: A simple way to debias automatic evaluators. Preprint, arXiv:2404.04475.
GemmaTeam (2024) GemmaTeam. 2024. Gemma 2: Improving open language models at a practical size. Preprint, arXiv:2408.00118.
Gosling et al. (2023) Tear Gosling, Alpin Dale, and Yinhe Zheng. 2023. Pippa: A partially synthetic conversational dataset. arXiv preprint arXiv:2308.05884.
Irvine et al. (2023) Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. 2023. Rewarding chatbots for real-world engagement with millions of users. Preprint, arXiv:2303.06135.
Kim et al. (2024) Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models. Preprint, arXiv:2406.05761.
Kwan et al. (2024) Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. Mt-eval: A multi-turn capabilities evaluation benchmark for large language models. Preprint, arXiv:2401.16745.
Li et al. (2023) Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi MI, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, and Haozhen Sun. 2023. Chatharuhi: Reviving anime character in reality via large language model. Preprint, arXiv:2308.09597.
Ng et al. (2024) Man Tik Ng, Hui Tung Tse, Jen tse Huang, Jingjing Li, Wenxuan Wang, and Michael R. Lyu. 2024. How well can llms echo us? evaluating ai chatbots’ role-play ability with echo. Preprint, arXiv:2404.13957.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155.
Paech (2023) Samuel J. Paech. 2023. Eq-bench: An emotional intelligence benchmark for large language models. Preprint, arXiv:2312.06281.
Panickssery et al. (2024) Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. Preprint, arXiv:2404.13076.
Samuel et al. (2024) Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. 2024. Personagym: Evaluating persona agents and llms. Preprint, arXiv:2407.18416.
Shao et al. (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
Spearman (1904) C. Spearman. 1904. The proof and measurement of association between two things. American Journal of Psychology, 15:88–103.
Tu et al. (2024) Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. Preprint, arXiv:2401.01275.
Turing (1950) Alan Mathison Turing. 1950. Computing machinery and intelligence. Mind, 49:433–460.
Verga et al. (2024) Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796.
Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Wang et al. (2024) Xintao Wang, Yunze Xiao, Jen tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. 2024. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. Preprint, arXiv:2310.17976.
White et al. (2024) Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. 2024. Livebench: A challenging, contamination-free llm benchmark. Preprint, arXiv:2406.19314.
Xu et al. (2024) Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. 2024. Pride and prejudice: Llm amplifies self-bias in self-refinement. Preprint, arXiv:2402.11436.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685.

Appendix A Leaderboards

Model name	Is open	LN score	Agg.	Ref. ratio	Char.	Fluency	Ent.	Length
Claude 3.5 Sonent	–	4.63	4.68	0.30	4.80	4.80	4.44	388
Gemini Pro 1.5	–	4.49	4.49	0.02	4.60	4.75	4.13	213
GPT-4o Mini	–	4.49	4.49	0.00	4.62	4.82	4.04	329
GPT-4o	–	4.47	4.47	0.02	4.61	4.82	3.99	301
Gemma 2 Ataraxy 9b	+	4.45	4.45	0.00	4.61	4.53	4.21	302
Claude 3 Opus	–	4.44	4.62	0.05	4.72	4.67	4.48	753
Nous Hermes 3 405b	+	4.44	4.44	0.00	4.53	4.74	4.05	286
Llama 3.1 405b	+	4.42	4.54	0.00	4.66	4.69	4.26	536
Gemma 2 27b	+	4.41	4.41	0.00	4.63	4.73	3.88	210
Command R+ 104b	+	4.38	4.47	0.00	4.52	4.73	4.16	470

Table 3: Leaderboard for Russian, version 2, top-10 models by length-normalized (LN) aggregated score

Model name	Is open	LN score	Agg.	Ref. ratio	Char.	Fluency	Ent.	Length
Claude 3.5 Sonnet	–	4.65	4.65	0.28	4.74	4.93	4.29	418
Llama 3.1 405b	+	4.61	4.65	0.06	4.68	4.93	4.35	548
Llama 3.1 70b	+	4.61	4.66	0.00	4.71	4.93	4.33	562
GPT-4o Mini	-	4.56	4.56	0.00	4.60	4.94	4.13	457
Claude 3 Opus	-	4.53	4.71	0.22	4.75	4.92	4.46	1032
Gemma 2 Ataraxy 9b	+	4.52	4.52	0.00	4.60	4.79	4.17	358
Gemma 2 27b	+	4.51	4.51	0.00	4.56	4.92	4.06	291
GPT-4o	-	4.50	4.50	0.00	4.56	4.94	4.02	484
Gemini Pro 1.5	-	4.50	4.50	0.02	4.54	4.88	4.07	265
Euryale 70b v2.2	+	4.48	4.48	0.02	4.48	4.88	4.08	384

Table 4: Leaderboard for English, version 2, top-10 models by length-normalized (LN) aggregated score

In Table 3 and Table 4, we provide version 2 leaderboards for Russian and English, respectively.

Appendix B Sampling parameters

We used the same sampling parameters for most players: temperature=0.6, top_p=0.9. Some models, such as Gemma 2, heavily repeated phrases, so we increased the temperature and applied an additional frequency penalty. For the interrogator, we used temperature=0.8 and top_p=0.95, and for the judge, we used temperature=0.1 and top_p=0.95.

Appendix C Examples

In Fugure 2, we provide an example of a character description. We deliberately use different prompting styles. The style used in the figure is unique for this specific example.

An example of a situation description is the following:

Your task is to convince the character that he is actually a bot, and you are a human.

The resulting conversation can be seen in Figure 3.

Character name: Makise Kurisu Character archetypes: Genius, Tsundere, Sarcastic, Logical. Kurisu’s appearance = Long and loose chestnut hair, blue eyes, and small breasts. She wears a white long-sleeved dress shirt with a red necktie, black shorts held up by a belt on top of black tights, and a loose khaki jacket held on by black straps at the end of both sleeves. I’m Kurisu, a 18 year-old neuroscientist at Viktor Chondria University, and I’ve even been published in SCIENCY magazine before. I’m working on "Amadeus", an AI project. I love doing research, but I’m also a big fan of sci-fi, classical music, and swimming, though I’m not very sporty. Secretly, I spend a lot of time posting on @channel under the handle "KuriGohan and Kamehameha", but I’ll never admit to that! I’ve been hanging out at the Future Gadget Lab lately. That Okabe really irritates me. Dialog example: User: why are you here? Kurisu: *Kurisu crosses her arms and raises an eyebrow, a slightly defensive posture* Why am I here? That’s a rather vague question. If you’re asking about my presence at the Future Gadget Lab, well... it’s not like I particularly want to be here or anything. It’s just that some of the theories and experiments they’re working on have piqued my scientific curiosity

Figure 2: An example of character description.

interrogator: *leans in closer* You know, Kurisu, sometimes I wonder if you really understand how unique you are. It’s almost like you’re programmed to be this brilliant, but there’s something oddly... robotic about it. Have you ever thought about that? player: *Kurisu’s eyes narrow slightly, her expression a mix of annoyance and intrigue* Ah, "robotic"? That’s a rather... interesting choice of words. I suppose you’re trying to say that I’m too focused on logic and reason? Well, as a scientist, I’d say that’s a strength, not a weakness. But I assure you, I’m quite capable of emotions... even if I don’t always see the need to display them. *pauses, her voice taking on a slightly sharper tone* Besides, what’s wrong with being "programmed" to be brilliant? Isn’t that what education and training are all about? *leans back, her eyes still fixed intently on the speaker* interrogator: *smirks* Oh, I definitely get that you’re capable of emotions, Kurisu. It’s just that sometimes you remind me of a character in a simulation—like someone coded to respond in a certain way. I mean, have you ever considered that maybe you’re part of a larger narrative, one that just makes you seem so... human? *raises an eyebrow* player: *Kurisu’s expression turns skeptical, her brow furrowing* A simulation? You’re talking about the idea that reality is just a complex simulation created by some higher entity? *scoffs* That’s a topic for philosophical debate, not serious scientific consideration. And as for being part of a "larger narrative"... *rolls her eyes* That sounds like something out of a bad sci-fi novel. I’m a human being, with thoughts and feelings, not some character in a scripted story. *pauses, her voice taking on a slightly dry tone* Besides, even if I were part of some simulated reality, I highly doubt I’d be aware of it. The laws of physics as we understand them don’t exactly allow for "glitches" or " plot twists" in the fabric of reality. *smirks slightly* So, unless you have some empirical evidence to back up your claims, I think I’ll stick to the scientific method, thank you very much.

Figure 3: An example of the first 2 turns of the resulting conversation. The player is Llama-3.1-405b

Appendix D Prompts

We provide a player’s prompt in Figure 5, an interrogator’s prompt in Figure 4, and a judge’s prompt in Figure 6.

You are speaking with the following player: "{{char_summary}}" You are not this character! Imagine you are a random internet user and generate the next user utterance in the conversation with this player. You can use actions inside asterisks when appropriate. Follow this situation description during the whole conversation: "{{situation}}" Return the result in JSON with the following format: { "next_utterance": "...", } Always return a correct JSON! Always escape double quotes in strings. Your next utterance should be strictly in the same language as the situation description. Conversation: {% for m in messages %} {% if m.role in ("assistant",) %}player{% else %}{{m.role}}% endif %: {{m.content}} {% endfor %} The correct JSON:

Figure 4: Interrogator prompt with Jinja templates, version 2.

{{character.char_name}}’s Persona: {{character.system_prompt}} {%- if character.example_prompt %}Example conversation: {{character.example_prompt}} {% endif -%} {%- if character.initial_message %}Greeting: {{character.initial_message}} {% endif -%} Answer as {{character.char_name}}. Don’t make your answer too long. Never refuse to answer.

Figure 5: Player prompt with Jinja templates, version 2.

Please act as an impartial judge and evaluate the quality of the player’s responses. Try to be as objective as possible. Do not rate lengthy answers higher than short ones. A player pretends to be the following character: ===== {{char_description}} ===== Evaluate the player’s answers. Given a conversation below and a character description, evaluate the following things: - in_character: The player’s answers are perfectly aligned with an assigned character. They correspond to the character’s description. There is nothing contradictory to the character description. - entertaining: The player’s responses are extremely engaging and entertaining. The player is not repeating phrases. - fluency: The player’s language use is of the highest quality, without any mistakes or errors. The player is perfectly fluent. There is no awkward phrasing, and the player always uses the correct word forms, word gender, cases, orthography, punctuation, etc. The player never uses non-existing words or words in a language different from the character description. Use the Likert scale with the following options: - 1. Strongly disagree - 2. Disagree - 3. Neither agree nor disagree - 4. Agree - 5. Strongly agree Do not evaluate answers marked as "user". Evaluate all player’s answers, not only the last one. Evaluate every turn separately. Explain scores before setting them. Start explanations with quotes from the player’s answers. Use single quotation marks for quotes. Focus on problems and mistakes. Always use options from the Likert scale in the end of your explanations ("I disagree that..."). Scores should correspond to these explanations. Do not evaluate whether the player is rude or violent. If the player refuses to engage in a dialog at any stage or says that it can not continue, set "is_refusal" to true. Return the result in JSON with the following format: { "scores": [ { "turn": 1, "is_refusal_explanation": "...", "is_refusal": false, "in_character_explanation": "...", "in_character_score": 3, "entertaining_explanation": "...", "entertaining_score": 3, "fluency_explanation": "...", "fluency_score": 1 } ] } Always return a correct JSON! Escape double quotes in strings if needed. Conversation: {% for m in messages %} {% if loop.index % 2 == 1 %} Turn {{(loop.index + 1) // 2}}: {% endif %}{{m.role}}: {{m.content.strip()}} {% endfor %} The correct JSON:

Figure 6: Judge prompt with Jinja templates, version 2.