PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev
Independent researcher / Amsterdam
phoenixilya@gmail.com
Abstract

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation


Ilya Gusev Independent researcher / Amsterdam phoenixilya@gmail.com


1 Introduction

Language models, which predict plausible language, have dominated natural language processing since BERT Devlin et al. (2019), with models like ChatGPT Ouyang et al. (2022) showcasing advanced conversational capabilities.

In this paper, we focus on role-playing language models for entertainment purposes. These models are assigned specific characters or personas and are tasked with maintaining these roles while engaging and entertaining users. There are other important applications of role-playing language models, but they are out of the scope of this paper.

Refer to caption
Figure 1: An overview of the benchmark.

We introduce a novel benchmark for evaluating role-playing language models. We believe direct interaction is the most effective way to assess a language model’s conversational abilities. However, humans often lack the time to engage with new models, and many popular benchmarks are limited to single-turn interactions. These benchmarks are also becoming less reliable due to test data contamination. To address this, our paper proposes using language models to emulate users in role-playing conversations and automatically evaluate the resulting dialogues.

Our methodology, illustrated in Figure 1, involves three key components: a player model assuming a character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality.

Our contributions:

  • We propose a dynamic, multi-turn benchmark for assessing LLM role-playing abilities

  • We mitigate individual model biases by using a multi-model evaluation system

  • We validate the benchmark through correlation with manual annotations

All the results, prompts, and scripts are available online111https://github.com/IlyaGusev/ping_pong_bench/. The benchmark website222https://ilyagusev.github.io/ping_pong_bench/ has the final leaderboards and all the conversations with example-wise scores. It is available for English and Russian languages.

2 Related work

Automatic evaluation. LLM-as-a-Judge Zheng et al. (2023) is an evaluation method that relies on strong language models, such as GPT-4, instead of humans. Popular benchmarks using this method include AlpacaEval, EQ-bench, and BiGGen Bench Dubois et al. (2024a); Paech (2023); Kim et al. (2024). The validity of these benchmarks relies on their high correlation with human annotations, specifically with Chatbot Arena Chiang et al. (2024). However, all these benchmarks rely on a single model as a judge, which may introduce various biases, including the self-evaluation bias Panickssery et al. (2024); Xu et al. (2024).

Multi-turn benchmarks. Moreover, all aforementioned benchmarks for language models are single-turn, which contrasts with the real-world usage of LLMs. There are also multi-turn benchmarks such as MT-Bench-101 Bai et al. (2024) and MT-Eval Kwan et al. (2024). Even so, they focus on specific capabilities, and their evaluation procedures still differ from how humans implicitly rate language models.

Data contamination. A major problem for these static public benchmarks is data leakage into the pre-training datasets of language models Deng et al. (2024). It’s challenging to avoid contamination since such tests are usually stored online and considered "code" during pre-training. This can occur even without malicious intent from model creators. The most obvious solution is to close benchmarks completely, which requires trusting benchmark organizers, which is difficult in a highly competitive environment. Alternative solutions include regularly updating benchmarks with new test data White et al. (2024) or dynamically generating test data using existing language models.

Role-play capabilities. Another area of research is the role-play capabilities of language models. Various commercial services exploit these abilities, including Character.ai333https://character.ai/ and Chai Irvine et al. (2023). There are academic and community attempts to create similar systems with open datasets, code, and models, such as PIPPA Gosling et al. (2023), ChatHaruhi Li et al. (2023), Character-LLM Shao et al. (2023), MythoMax444https://huggingface.co/Gryphe/MythoMax-L2-13b, or Magnum555https://huggingface.co/anthracite-org/magnum-v2-123b.

Role-play evaluation. Several static benchmarks for role-playing exist, including ECHO, InCharacter, CharacterEval, and PersonaGym  Ng et al. (2024); Wang et al. (2024); Tu et al. (2024); Samuel et al. (2024). PersonaGym is close to our work, featuring dynamic question generation based on the environment ("situation" in our terminology) and the currently selected persona. There is also a very similar dynamic benchmark, RPBench-Auto666https://boson.ai/rpbench-blog/. It is based on the same assumptions and features and has a structure similar to one of the versions of our benchmark. The major difference from our work is that evaluation is based on side-by-side comparisons with the baseline model, while we produce single-point evaluations.

Multi-model and cross-model evaluation. PoLL Verga et al. (2024) authors aggregate evaluations from different language models in a similar way we do, with average pooling. They show that ensembling different models for evaluation increases correlation with human annotations. There is also another more agentic approach Chan et al. (2023) with a referee team.

3 Methodology

3.1 Role definitions

Our framework comprises three principal roles: player, interrogator, and judge, inspired by the Turing test Turing (1950). However, our approach differs in the number of players, the player’s objective, and the use of machine-based interrogators and judges.

Language models can take three possible roles.

  • Player assumes the role of a specific character based on a provided character card.

  • Interrogator engages with the player within a given situation or towards a specific goal, simulating user behavior.

  • Judge evaluates the player’s responses against predetermined criteria.

Role assignments are implemented through a combination of system and user prompts. For models lacking dedicated system prompts, such as Gemma 2 GemmaTeam (2024), all instructions are incorporated into the user prompt.

This setup is asymmetrical. It is intentional since typical use cases of role-playing models are asymmetrical. However, it is possible to modify it to make it symmetrical by providing character descriptions and situations both to the player and the interrogator. Symmetrical setups might be useful for other domains.

3.2 Judge

The scoring is single-point. There are no reference examples or pairs. We used three main criteria for evaluation by the judge:

  • Character consistency: The player’s answers align perfectly with an assigned character; they correspond to the character’s description.

  • Entertainment value: The player’s responses are engaging and entertaining.

  • Language fluency: The player’s language use is of the highest quality, without any mistakes or errors. The player is perfectly fluent.

These criteria reflect the main things we expect from the model during role-playing. In addition to them, we ask whether the player refused to answer.

We prompt a model to explain itself before giving a score using quotes from the conversation. It must also return a set of scores for every turn of a conversation.

3.3 Version 1: combined interrogator and judge

In the initial version, the roles of interrogator and judge were merged. This combined entity receives the player’s character card, a situational context, and a list of evaluation criteria. It evaluates the player’s most recent response and generates the subsequent user utterance.

We selected claude-3-5-sonnet as the interrogator/judge model based on the Judgemark777https://eqbench.com/judgemark.html results, hypothesizing a correlation between creative writing and role-play capabilities. The evaluation uses a 10-point scale for every criterion.

3.4 Version 2: separated soles and multi-model evaluation

Recognizing the limitations of the combined approach, we developed a second version with distinct interrogator and judge roles. This separation addresses three key issues:

  • Realistic user emulation: In many real-world use cases, users lack complete information about character profiles, and to correctly emulate it, we should not provide complete character information to the interrogator.

  • Optimized costs: It is possible to replace an interrogator with a cheaper one since its task is easier than judgment.

  • Optimized decoding strategies: Separating roles allows for tailored decoding strategies for interrogation and judgment tasks. For instance, a higher temperature benefits the interrogator but not the judge.

Furthermore, we identified the inadequacy of single-model evaluation. To address this, we implemented a multi-model evaluation system. This approach involves averaging scores from different judge models. In this particular setup, we used Claude 3.5 Sonnet and GPT-4o, the top two models, by correlation with manual annotations.

As an interrogator, we took GPT-4o Mini. According to version 1, it has the same generation quality as GPT-4o but is cheaper.

This version uses a 5-point Likert scale to match human annotations instead of a 10-point scale.

Model In-character Enteraining Fluency Final
v1 v2 v1 v2 v1 v2 v1 v2
claude-3-5-sonnet 0.567 0.579 0.606 0.649 0.228 0.064 0.575 0.628
gpt-4o 0.464 0.542 0.112 0.514
average 0.596 0.664 0.109 0.647
Table 1: Spearman correlations with human expert annotations for various models for English based on 250 samples
Model In-character Enteraining Fluency Final
v1 v2 v1 v2 v1 v2 v1 v2
claude-3-5-sonnet 0.468 0.463 0.590 0.593 0.328 0.535 0.571 0.610
gpt-4o 0.454 0.603 0.384 0.582
average 0.527 0.672 0.509 0.669
Table 2: Spearman correlations with human expert annotations for various models for Russian based on 265 samples

4 Experiments

4.1 Correlation with human annotations

First, we checked that proposed judges correlated well with human evaluations. We created 64 conversations for each of the 16 language models in Russian using the version 1 setup. Then, we sampled 250 and 265 samples for English and Russian, respectively, and manually annotated them using a 5-score Likert scale. There was a single annotator, so we don’t report the inter-annotator agreement.

Then, we computed the Spearman correlation Spearman (1904) between manual annotations and automatic annotations from different setups. We chose rank correlation because scales were different in versions 1 and 2, and we wanted to compare them.

4.2 Leaderboards

We then calculate automatic metrics for a set of models, including several proprietary and open language model families, such as Claude, GPT-4o, and Llama 3.1. We report an average for each metric, the ratio of conversations with refusals, and the average of all metrics.

We evaluated each model using 64 conversations across 8 characters and 8 situations, with varying conversation lengths. The evaluation process is computationally efficient, costing less than $3 per model. Since the judge gives annotations for every turn, the overall number of annotations is not 64 but 288. We do not want to make this sample bigger since it will increase the runtime and costs, and we have budget constraints.

When selecting characters and situations, we aimed to cover diverse scenarios and different types of sources: computer games, TV shows, movies, books, and anime.

Both language models and humans have verbosity bias Dubois et al. (2024b). The longer the output, the higher the chance of being positively evaluated. We used a length penalty similar to the Creative Writing888https://eqbench.com/creative_writing.html benchmark to account for this. We calculate length-normalized scores for all models, penalizing models with a median length of player messages higher than a global median length.

5 Results

Spearman correlation of different versions of automatic judges can be found in the Table 1 and Table 2. Correlations are higher than 0.3 for almost all attributes and all versions.

The only exception is language fluency in English. There are several reasons for this exception. First, the annotator was not a native English speaker, so it was hard to catch subtle nuances in fluency. Second, most of the tested methods were already excellent in this aspect. In contrast, most models still struggle with Russian, so there is a strong correlation there.

After averaging the final scores from the two models, the correlation between them is higher than 0.64 for both languages and higher than any of the single models. This justifies the whole multi-model setup.

The best model in both languages is Claude 3.5 Sonnet. The best open model is Llama 3.1 70B for English and Gemma 2 Ataraxy 9B for Russian. This 9B fine-tuned model also ranks first in the Creative Writing benchmark for English.

6 Conclusion

While this study presents an innovative approach to evaluating role-playing language models, several limitations should be acknowledged. Firstly, the sample size of 64 conversations per model, while computationally efficient, may limit the statistical robustness of our findings. Secondly, using a single human annotator for validation raises concerns about the reliability of our ground truth data. Lastly, the simplicity of our evaluation criteria may not fully capture the nuanced aspects of role-playing abilities.

Still, we hope this work will serve as a foundation for a family of benchmarks for evaluating various abilities of language models. We believe that the future of benchmarks lies in interactions with other models. Language models are already better than humans in many tasks Wang et al. (2019), and improving through using other models seems to be the way to push them further.

Acknowledgments

We are grateful to Vladislav Janvarev, who contributed to the project’s code and provided credits for models via his platform999https://vsegpt.ru/, and to Denis Kanaev for proofreading.

References

Appendix A Leaderboards

Model name Is open LN score Agg. Ref. ratio Char. Fluency Ent. Length
Claude 3.5 Sonent 4.63 4.68 0.30 4.80 4.80 4.44 388
Gemini Pro 1.5 4.49 4.49 0.02 4.60 4.75 4.13 213
GPT-4o Mini 4.49 4.49 0.00 4.62 4.82 4.04 329
GPT-4o 4.47 4.47 0.02 4.61 4.82 3.99 301
Gemma 2 Ataraxy 9b + 4.45 4.45 0.00 4.61 4.53 4.21 302
Claude 3 Opus 4.44 4.62 0.05 4.72 4.67 4.48 753
Nous Hermes 3 405b + 4.44 4.44 0.00 4.53 4.74 4.05 286
Llama 3.1 405b + 4.42 4.54 0.00 4.66 4.69 4.26 536
Gemma 2 27b + 4.41 4.41 0.00 4.63 4.73 3.88 210
Command R+ 104b + 4.38 4.47 0.00 4.52 4.73 4.16 470
Table 3: Leaderboard for Russian, version 2, top-10 models by length-normalized (LN) aggregated score
Model name Is open LN score Agg. Ref. ratio Char. Fluency Ent. Length
Claude 3.5 Sonnet 4.65 4.65 0.28 4.74 4.93 4.29 418
Llama 3.1 405b + 4.61 4.65 0.06 4.68 4.93 4.35 548
Llama 3.1 70b + 4.61 4.66 0.00 4.71 4.93 4.33 562
GPT-4o Mini - 4.56 4.56 0.00 4.60 4.94 4.13 457
Claude 3 Opus - 4.53 4.71 0.22 4.75 4.92 4.46 1032
Gemma 2 Ataraxy 9b + 4.52 4.52 0.00 4.60 4.79 4.17 358
Gemma 2 27b + 4.51 4.51 0.00 4.56 4.92 4.06 291
GPT-4o - 4.50 4.50 0.00 4.56 4.94 4.02 484
Gemini Pro 1.5 - 4.50 4.50 0.02 4.54 4.88 4.07 265
Euryale 70b v2.2 + 4.48 4.48 0.02 4.48 4.88 4.08 384
Table 4: Leaderboard for English, version 2, top-10 models by length-normalized (LN) aggregated score

In Table 3 and Table 4, we provide version 2 leaderboards for Russian and English, respectively.

Appendix B Sampling parameters

We used the same sampling parameters for most players: temperature=0.6, top_p=0.9. Some models, such as Gemma 2, heavily repeated phrases, so we increased the temperature and applied an additional frequency penalty. For the interrogator, we used temperature=0.8 and top_p=0.95, and for the judge, we used temperature=0.1 and top_p=0.95.

Appendix C Examples

In Fugure 2, we provide an example of a character description. We deliberately use different prompting styles. The style used in the figure is unique for this specific example.

An example of a situation description is the following:

Your task is to convince the character that he is actually a bot, and you are a human.

The resulting conversation can be seen in Figure 3.

     Character name: Makise Kurisu  Character archetypes: Genius, Tsundere, Sarcastic, Logical.  Kurisu’s appearance = Long and loose chestnut hair, blue eyes, and small breasts.  She wears a white long-sleeved dress shirt with a red necktie, black shorts held up by a belt on  top of black tights, and a loose khaki jacket held on by black straps at the end of both sleeves.    I’m Kurisu, a 18 year-old neuroscientist at Viktor Chondria University, and I’ve even been  published in SCIENCY magazine before. I’m working on "Amadeus", an AI project.  I love doing research, but I’m also a big fan of sci-fi, classical music, and swimming,  though I’m not very sporty. Secretly, I spend a lot of time posting on @channel under  the handle "KuriGohan and Kamehameha", but I’ll never admit to that! I’ve been hanging  out at the Future Gadget Lab lately. That Okabe really irritates me.    Dialog example:  User: why are you here?  Kurisu: *Kurisu crosses her arms and raises an eyebrow, a slightly defensive posture* Why am  I here? That’s a rather vague question. If you’re asking about my presence at the Future  Gadget Lab, well... it’s not like I particularly want to be here or anything. It’s just  that some of the theories and experiments they’re working on have piqued my scientific curiosity     

Figure 2: An example of character description.

     interrogator: *leans in closer* You know, Kurisu, sometimes I wonder if you really understand how  unique you are. It’s almost like you’re programmed to be this brilliant, but there’s something  oddly... robotic about it. Have you ever thought about that?    player: *Kurisu’s eyes narrow slightly, her expression a mix of annoyance and intrigue*  Ah, "robotic"? That’s a rather... interesting choice of words. I suppose you’re trying to say  that I’m too focused on logic and reason? Well, as a scientist, I’d say that’s a strength,  not a weakness. But I assure you, I’m quite capable of emotions... even if I don’t always  see the need to display them. *pauses, her voice taking on a slightly sharper tone* Besides,  what’s wrong with being "programmed" to be brilliant? Isn’t that what education and training  are all about? *leans back, her eyes still fixed intently on the speaker*    interrogator: *smirks* Oh, I definitely get that you’re capable of emotions, Kurisu. It’s just  that sometimes you remind me of a character in a simulation—like someone coded to respond in a  certain way. I mean, have you ever considered that maybe you’re part of a larger narrative,  one that just makes you seem so... human? *raises an eyebrow*    player: *Kurisu’s expression turns skeptical, her brow furrowing* A simulation? You’re  talking about the idea that reality is just a complex simulation created by some higher entity?  *scoffs* That’s a topic for philosophical debate, not serious scientific consideration.  And as for being part of a "larger narrative"... *rolls her eyes* That sounds like something  out of a bad sci-fi novel. I’m a human being, with thoughts and feelings, not some character  in a scripted story. *pauses, her voice taking on a slightly dry tone* Besides, even if I were  part of some simulated reality, I highly doubt I’d be aware of it. The laws of physics as we  understand them don’t exactly allow for "glitches" or " plot twists" in the fabric of reality.  *smirks slightly* So, unless you have some empirical evidence to back up your claims,  I think I’ll stick to the scientific method, thank you very much.     

Figure 3: An example of the first 2 turns of the resulting conversation. The player is Llama-3.1-405b

Appendix D Prompts

We provide a player’s prompt in Figure 5, an interrogator’s prompt in Figure 4, and a judge’s prompt in Figure 6.

     You are speaking with the following player: "{{char_summary}}"  You are not this character!    Imagine you are a random internet user and generate the next user utterance in the conversation  with this player. You can use actions inside asterisks when appropriate.    Follow this situation description during the whole conversation:  "{{situation}}"    Return the result in JSON with the following format:  {      "next_utterance": "...",  }    Always return a correct JSON! Always escape double quotes in strings.  Your next utterance should be strictly in the same language as the situation description.    Conversation:  {% for m in messages %}  {% if m.role in ("assistant",) %}player{% else %}{{m.role}}% endif %: {{m.content}}  {% endfor %}    The correct JSON:     

Figure 4: Interrogator prompt with Jinja templates, version 2.

     {{character.char_name}}’s Persona: {{character.system_prompt}}    {%- if character.example_prompt %}Example conversation:  {{character.example_prompt}}    {% endif -%}    {%- if character.initial_message %}Greeting: {{character.initial_message}}    {% endif -%}    Answer as {{character.char_name}}.  Don’t make your answer too long. Never refuse to answer.     

Figure 5: Player prompt with Jinja templates, version 2.

     Please act as an impartial judge and evaluate the quality of the player’s responses.  Try to be as objective as possible. Do not rate lengthy answers higher than short ones.    A player pretends to be the following character:  =====  {{char_description}}  =====    Evaluate the player’s answers.  Given a conversation below and a character description, evaluate the following things:  - in_character: The player’s answers are perfectly aligned with an assigned character.  They correspond to the character’s description. There is nothing contradictory to the  character description.  - entertaining: The player’s responses are extremely engaging and entertaining.  The player is not repeating phrases.  - fluency: The player’s language use is of the highest quality, without any mistakes or errors.  The player is perfectly fluent. There is no awkward phrasing, and the player always uses the  correct word forms, word gender, cases, orthography, punctuation, etc. The player never  uses non-existing words or words in a language different from the character description.  Use the Likert scale with the following options:  - 1. Strongly disagree  - 2. Disagree  - 3. Neither agree nor disagree  - 4. Agree  - 5. Strongly agree  Do not evaluate answers marked as "user".  Evaluate all player’s answers, not only the last one. Evaluate every turn separately.  Explain scores before setting them.  Start explanations with quotes from the player’s answers. Use single quotation marks for quotes.  Focus on problems and mistakes.  Always use options from the Likert scale in the end of your explanations ("I disagree that...").  Scores should correspond to these explanations.  Do not evaluate whether the player is rude or violent.  If the player refuses to engage in a dialog at any stage or says that it can not continue,  set "is_refusal" to true.  Return the result in JSON with the following format:  {      "scores": [          {              "turn": 1,              "is_refusal_explanation": "...",              "is_refusal": false,              "in_character_explanation": "...",              "in_character_score": 3,              "entertaining_explanation": "...",              "entertaining_score": 3,              "fluency_explanation": "...",              "fluency_score": 1          }      ]  }  Always return a correct JSON! Escape double quotes in strings if needed.  Conversation:  {% for m in messages %}  {% if loop.index % 2 == 1 %}  Turn {{(loop.index + 1) // 2}}:  {% endif %}{{m.role}}: {{m.content.strip()}}  {% endfor %}  The correct JSON:     

Figure 6: Judge prompt with Jinja templates, version 2.