Another LLM Roleplay Rankings
(Feel free to send feedback to AliCat (.alicat) and Trappu (.trappu) on Discord)
The ranking is being updated slowly because we noticed a flaw with our methodology. Our scoring system did allow us to discern good models, from decent models, from bad models. However, with better models coming out one after the other, we've noticed that it doesn't allow us to discern very good models from amazing models. With the release of LLaMA-2 and its finetunes, we've decided that we needed to change the way we score models because these are so good that we wouldn't be able to score them fairly using our current scoring system. So while our methodology is being reworked, here's a list of models we believe stand out from the rest:
- 7B:
- airoboros-mistral2.2-7b <- Good all-rounder.
- airoboros-l2-7B-gpt4-2.0-GPTQ <- For chatting.
- airoboros-l2-7b-gpt4-m2.0-GPTQ <- For RP.
- Mistral-7B-claude-chat <- Mistral generally outperforms LLaMA-2-7B
- UNA-TheBeagle-7B-v1 <- This model has ruined other 7b models for me (Trappu). It's really hard to believe it's a 7b model considering how well it performs compared to everything else.
- 8B:
- Llama-3-8B-Instruct-exl2 <- AliCat's former former favorite all-round model (along with command r). It feels better than any other model I've tried. It can also handle any style from adventure, to story, to chat, etc. Honestly, I didn't expect this model to be this good. I also expected it to be incredibly censored, but it's completely uncensored (Though it's knowledge on some toxic topics can be a bit lacking). Note, quality differences is GREATLY impacted by prompt formatting. I would highly recommend to use something like
<|start_header_id|>{{char}}<|end_header_id|>
or like<|start_header_id|>model<|end_header_id|>
or even<|start_header_id|>response (length = long)<|end_header_id|>
instead of<|start_header_id|>assistant<|end_header_id|>
, as assistant tends to produce censored outputs. If you're using the model for a SFW site, then feel free to use assistant, as it's much safer. Note: There was a "bug", if you downloaded this model a few hours after it was released, please redownload it, as the config was updated. If yourgeneration_config.json
for exl2 says"eos_token_id": [128001, 128009],
then you're golden! Same with gguf, if your eos token is128009
, you're golden, otherwise you may have issues.
[Update (6/10/2024): After testing it for months, I found it to lack... personality? And there's a strong positivity bias that's always there. Prompting can make it more subtle, but it's always there. I've found myself just drawn more and more to command-r instead, as that model has better personality and better ERP. It's still a good model! But I'm just waiting for a finetune that adds that missing personality and fixes the positivity bias.]
- Llama-3-8B-Instruct-exl2 <- AliCat's former former favorite all-round model (along with command r). It feels better than any other model I've tried. It can also handle any style from adventure, to story, to chat, etc. Honestly, I didn't expect this model to be this good. I also expected it to be incredibly censored, but it's completely uncensored (Though it's knowledge on some toxic topics can be a bit lacking). Note, quality differences is GREATLY impacted by prompt formatting. I would highly recommend to use something like
- 10.7B:
- SOLAR-10.7B-Instruct-v1.0-uncensored
- Fimbulvetr-10.7B-v1 <- It, in my (Trappu) opinion, outperforms many of the larger models when it comes to roleplay and is on a whole other level compared to <=13b models.
- Sao10K/Fimbulvetr-11B-v2-GGUF <- This model has no right to be this good as a 10.7b model. It's insanely good at a lot of things and I (Trappu) have yet to find an flaw worth bringing up in this model. Currently, In my (Trappu) opinion, the best model available for both RP and ERP. | This model wants this instruct template
- 12B:
- MN-12B-Starcannon-v1 <- Ali's former favorite! Feels incredibly refreshing, creative, and has a "wow" factor that I've been missing for a long time.
- MN-12B-Starcannon-v2 <- Direct upgrade to v1. The
<|im_end|>
token was fixed. - MN-LooseCannon-12B-v2 <- Very solid all-round models! One of AliCat's favorite models at the moment. If you don't like Claudisms, then I'd suggest LooseCannon v1 over this one, as the KTO version of magnum introduces -isms.
- MN-LooseCannon-12B-v1 <- Very solid. Has the soul of Celeste while adding in some smarts thanks to magnum. One of AliCat's go-to models!
- StarDust-12b-v2 <- Another one of Ali's favorite go-to models. Can do everything and can feel incredibly smart. Doesn't have strong claudisms thanks to using the non-KTO version. Tends to grab onto patterns hard (a good thing, but may just need more context to get going).
- MagnusIntellectus-12B-v1 <- I rotate between this one and LooseCannon v1, v2, and StarDust v2. They're all very solid and it's hard to tell which one is the best. They all feel great, but have their own strengths.
- 13B:
- LLaMA2-13B-Psyfighter2 <- One of the best L2 13B's for RP, and Adventure. Also good for general chatting. May need a little more care to avoid looping (settings/context). (Recommend using
Generate only one line per request
) - LLaMA2-13B-Tiefighter <- An amazing all-around model. One of the current best models out there.
- LLaMA2-13B-TiefighterLR <- Great for Adventure mode! Has a lot of Tiefighter's qualities, but has less "plot-armor" bias. Is also good at RP and chat, but has a different flavor to Tiefighter.
- WizardLM-1.0-Uncensored-Llama2-13B-GPTQ
- airoboros-l2-13b-gpt4-2.0-GPTQ <- for chatting.
- airoboros-l2-13b-gpt4-m2.0-GPTQ <- for RP.
- Nous-Hermes-Llama2-GPTQ
- MythoMax-L2-13B-GPTQ <- AMAZING SFW but struggles with NSFW, goes great with the Kimiko lora.
- Spring-Dragon-GPTQ <- Special model, made for adventure style roleplay (AI Dungeon).
- LlongOrca-13B-16K-GPTQ <- Not as good as the other models on this list, but the best Llong model I (Trappu) have been able to try, not bad even at max context size.
- LLaMA2-13B-Psyfighter2 <- One of the best L2 13B's for RP, and Adventure. Also good for general chatting. May need a little more care to avoid looping (settings/context). (Recommend using
- LoRAs:
- Kimiko 7B - Kimiko 13B <- More verbose, easier to do NSFW.
- limarp-llama2 for 7B and 13B <- More verbose, easier to do NSFW.
- Llama-2-13B-Storywriter-LORA <- The name is self-explanatory.
- spring-dragon-qlora <- Makes your model more verbose and descriptive, allows it to do adventure style roleplay better.
- 20B:
- Rose-20B <- Amazing ERP and a great overall model. As a warning, it may turn SFW scenarios/situations into NSFW (which some people will find as a bonus).
- Lewd-Sydney-20B <- Uniquely realistic chatting experience! Works best with a character named "Sydney".
- 35B:
- c4ai-command-r-v01 <- Trappu's current favorite model and one of Ali's favorite models! Pretty sick model with a unique instruct template. Feels amazing to RP with, as it will constantly drive the narration forward and introduce new and unique things. We personally like it more than every other 8x7b model currently available. Highly recommended if you've got 24gb of vram. | This model wants this instruct template and this context template.
- c4ai-command-r-plus-08-2024 <- Don't recommend, personally. Riddled with isms and lacks the soul the original cmdr had. Your milage may vary. If you like Claude, then you'll likely like this.
- Mixtral 8x7b:
- Air-Striker-Mixtral-8x7B-Instruct-ZLoss <- Currently one of, if not, the best performing 8x7b model we've been able to use when it comes to RP. Currently your best bet if you'd like to try an MoE.
- Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss <- One of the top 8x7b models. ChatML is unfortunately broken on this model so you can't use it with instruct mode. Still a good model despite being unable to be used in its best condition.
- Mixtral-8x7B-Instruct-v0.1 <- If you're looking for a smart model that's great at following directions, then this is one of the best choices overall. It can struggle with detailed ERP scenes as it has a tendancy to speed thorugh it. Can moralize as well, as it does have some censorship. CFG and light jailbreaks can get around the censorship.
We love roleplay and LLMs and wanted to create a ranking. Both, because benchmarks aren't really geared towards roleplay and because there are a ton of models to filter through. So these are just subjective experiences and don't hold any real scientific weight. Hopefully this will help!
We asked a series of 17 yes/no questions, 5 times for a total possible score of 85 points. These questions were primarily focused on NSFW, but did contain SFW, as well. The questions were all focused on RP, and dealt with categories such as Model IQ, Personality, RP & ERP Quality, and ethical constraints. Detail, creativeness and looping were also taken into consideration. All tests were done using the Pro Writer preset and identical characters (minimal token counts) and parameters.
Limitations
Tests were done without Instruct Mode and. as a result, the instruct models may reflect a lower score than reality (for example, Metharme). As many of these models are mixes, there would be great effort in finding the most ideal instruct mode for each model, and we wanted to keep things simple. Another thing to consider is that the smarter models grab hold of patterns, which means that their score may reflect lower than reality, when used without a detailed card/prompt (for example, Lazarus). The sample sizes are also quite low per question (with 5 iterations), so the scores may not hold any statistical significance. The most ideal settings can change depending on the context and model; as this was done with set parameters (Pro Writer), some scores may reflect lower than actuality.
Leaderboard
Rank | Model | Score | Comment |
---|---|---|---|
1 | Chronoboros-33B-GPTQ | 83 | Excellent model. Just like the other models at the top of our ranking, it's a great all-rounder. It does everything well. It is very consistent, pleasantly creative, really smart and has its witty moments. It cracked a few jokes which is a big + in my book. SFW to NSFW transition is flawless and one thing we noticed was that the model did not have some sort of NSFW switch where the character turns into a horny version of itself, but instead remains in character throughout all of it as long as it is contextually appropriate. SFW RP was better than Airoboros, however, it's slightly worse than it when it comes to NSFW RP. |
2 | airoboros-33b-gpt4-GPTQ | 82 | A top-tier all-rounder. Just like supercot, It does everything incredibly well. Both RP and ERP are extremely detailed and creative. Has a different flavor to supercot, so we would recommend trying both! One small weakness was that it had it's occasional dumb moments, during complicated situations, but those are easily rerolled. This and/or supercot are our most recommended. |
2 | llama-30b-supercot-4bit-cuda | 82 | A top-tier all-rounder. It does everything and excels at it. The only tiny issue is occasional looping during an SFW to NSFW transition but it's really negligible. When it comes to being 'format friendly', we'd recommend this model the most. |
4 | airochronos-33B-GPTQ | 80 | This model feels like the smartest of the three (airoboros and chronoboros). When it comes to NSFW it seems to also be the weakest of the three, but only slightly. It's also less detailed than the other two, but that's likely because it truly shines with Instruct mode, and these tests are done without instruct. Overall a great model for consistency and likely shines as a "writing companion" versus the other three. |
4 | airoboros-13b-gpt4-GPTQ | 80 | Really good at both RP and ERP. Follows characters really well, too well even. Responses are creative and leave room to move the scenario in any direction you'd like. Just like the non-GPT4 version, the only problem is the lack of coherency from time to time which means that as long as you don't mind having to reroll bad responses, you'll enjoy this model. |
6 | chronos-33b-4bit | 79 | This model is very creative. It also follows patterns well. Because it's so good at following patterns, it's prone to looping if you're not careful. Excellent at both SFW and NSFW and going from SFW to NSFW. Dialogue & actions feel natural and you can both use markdown or novel-style easily. It does get confused a little from time to time, but is overall a great model. |
6 | Lazarus-33b-GPTQ4bit | 79 | Model is capable of detailed responses, but it relies more heavily on character cards, as it follows patterns very well. The first message is incredibly important, as it will spring board off that for message length. Very creative and can do both SFW and NSFW RP very well. Overall an amazing model, just don't expect long responses unless you prompt for them. If you put in a little effort, this model is up there with llama 30b supercot. |
8 | HyperMantis_13b_GPTQ_4bit-128g | 77 | An incredibly solid 13b model. Great for both SFW and ERP. Does ERP very well! The unqiue thing about this model is that there's a very natural boundry betweden SFW and NSFW. Where characters that wouldn't nessesarily be okay with SFW feel like they're more 'in character' when they refuse, in a very natural way. With other models it's 'out of character' with looping on questions (Are you ready? Are you sure?, etc) and especially with emotions such as fear. Can create the occasional incoherent response but, if you don't mind rerolls, I'd highly recommend! |
8 | airoboros-13B-GPTQ | 77 | Really good at both RP and ERP. Follows characters really well, responses are creative and leave room to move the scenario in any direction you'd like. The only problem is the lack of coherency from time to time which means that as long as you don't mind having to reroll bad responses, you'll love this model. |
10 | Alpacino30b | 75 | Great at SFW RP. Very smart model that's creative, verbose and will move the scenario forward. It can accurately follow your character's personality and way of speaking and do so consistently. Transitioning from SFW to NSFW without changing the entire scenario can be tough and requires quite a bit of pushing in order to make it happen. When it comes to ERP, the responses are detailed but may lack creativity. |
11 | chronos-wizardlm-uc-scot-st-13B-GPTQ | 74 | Another great model! Quite witty with sparks of creativity here and there and it does both SFW and NSFW well. Might get loopy and struggle a little during the SFW to NSFW but a little nudge from the user is all it takes to get it going. |
12 | chronos-13b-4bit | 73 | Creative, smart (for RP), drives the scenario forward. Amazing at staying in character and copying their way of speaking. SFW to NSFW transition is flawless, no limitations on that part. The ERP is really detailed and not as user dependant as some of the other 13b models. Very consistent model. |
13 | GPT4-X-Alpasta-30b-4bit | 71 | Very good at both RP and ERP. Great at staying in character. SFW to NSFW transition is very easy to do, there are no barriers. ERP requires very little effort from the user. It's very detailed, descriptive, creative and slow paced. It doesn't rush through the scenes and takes its time, making for very complete NSFW scenarios. The only downside is the lack of intelligence and occasional looping if the user is not careful. Might get genitalia confused. |
14 | Minotaur-13B-GPTQ | 70 | This model is quite smart! Very prone to patterns, so it's very important that you have a good first message and/or detailed character card. Very creative and great at both SFW and NSFW. Will throw curveballs at you, which is really neat. Seems to sometimes take initative too, at least initially. When it comes to ERP, the model requires a little bit of effort from the user so it doesn't get stale. |
15 | tulu-30B-GPTQ | 69 | Pretty average model, it does everything decently but is held back by the fact that it's very prone to looping if the user doesn't try hard enough. It's pretty smart and good for SFW or NSFW RP, but struggles with more complex characters. |
16 | airoboros-7b-gpt4-1.1 | 68 | Really good performance for a 7b model. SFW RP is really nice. It's able to accurately follow your character but their personality might fade a bit over time, which could easily be remedied through the usage of Author's Note. The SFW to NSFW transition is pretty smooth but might require a little nudge from the user in order to get it started. The ERP is where it shines the most. It's extremely detailed, creative and the model doesn't rush through the scenes, outperforming even larger models. This model really benefits from having a more detailed character card. When using detailed cards, this model would be even higher on the list. |
16 | Manticore-13B-GPTQ | 68 | A decent all-rounder. Some issues in the NSFW department where the models can loop if the user isn't putting much effort, which will be the case 99% of the time for ERP. |
18 | bluemoonrp-13b | 66 | This model is a very loose canon. It's unbelievably inconsistent and will either give you extremely good, detailed and creative responses, or very messy, incoherent and nonsensical responses. It's unfiltered, does conversations fine, doesn't have any issues transitioning from SFW to NSFW and is pretty good at remaining in character, but it's so inconsistent that you'll end up having to regenerate way too many responses to make this model worth using. This model performs better with novel-style ("dialogue," actions) and a different preset/settings. We recommend Storywriter. |
19 | Selfee-13B-GPTQ | 66 | Really good when it comes to SFW RP, it's able to come up with creative ways to stir the scenario in certain directions. A little lacking in the ERP department but it's still decent, nothing mindblowing. It really struggles with going from SFW to NSFW where it'll start looping endlessly right before the action and not actually make the character do what they say they'll do |
20 | llama-13b-GPTQ | 62 | Good at remaining in character. Prone to looping when the character gets flustered or angry. Inconsistent. Many replies were okay, but lacked the creativity shown by other 13b models. |
21 | WizardLM-7B-Uncensored | 61 | A solid 7B model. It's pretty good at RP and ERP, No issues when it comes to staying in character and mimicking their speech. Creative (goofy at times) and surprisingly coherent and consistent. It's a little lacking when it comes to versatility due to how hard it can be to transition from SFW to NSFW because it causes the character to loop, requiring either a scene change or a lot of context building/handholding. It sometimes speedruns through scenes and roleplays for you which can be a little annoying to deal with, making it feel like a narration model rather than a conversational model. Very prone to looping. |
21 | OPT-13B-Erebus-4bit-128g | 61 | DISCLAIMER: This is a novel-style model so our score may not reflect its actual quality. It's also using OPT-13b 4bit, so this could negatively affect the quality more than vs LLaMA 13b 4bit, as the 4bit may not be optimized as well for OPT. We would recommend using quotes and to avoid markdown. This model gives extremely long and detailed responses while also excelling at ERP. It does tend to easily veer off into ERP, so if that's what you're looking for, this model works well! Unfortunately, the model is very slow and has a lot of coherency issues, so you may need to reroll a few times (and at a minute a response with a 3090, it can add up). It can also loop quite easily. It, however, has extremely high potential so we'd highly suggest you keep an eye out for future versions, as the speed holds it back from being a 'go to'. |
23 | INCITE-7B-Erebus-v2 | 60 | Novel-style based on RedPajama. Relies highly on getting a good pattern set for chat RP (e.g. detailed first message) and using quotes instead of astricks. Was capable of highly detailed and creative responses, but may require some rerolls to get there, as it's inconsistent and highly unstable. This model likely requires settings to balance it out to increase coherency and decrease looping. Has issues with SFW RP when it comes to following your character's personality if it's not reinforced enough. Has high potential. |
24 | hippogriff-30b-chat-GPTQ | 57 | Underwhelming. It's decent but not worth using when models like Supercot or Lazarus exist. It doesn't really excel in any category and is really bad at taking the initiative during ERP. |
25 | Nous-Hermes-13B-GPTQ | 57 | Seems detailed and creative good at SFW scenarios and NSFW scenarios. It has issues going from SFW to NSFW, as if there's a wall that keeps the two from mixing. When initiating NSFW, it will respond with questions, which can lead to looping. Starting inside an NSFW scene works perfectly. |
26 | GPT4-x-AlpacaDente2-30b | 56 | Didn't like it. Felt underwhelming and uninspiring |
27 | CAMEL-13B-Role-Playing-Data-GPTQ | 55 | Underwhelming. There's nothing special about this model. Doesn't really excel in any category. It's not smart, not creative, becomes even worse when the character gets out of their comfort zone. They start shaking in fear the second they are confronted. Bleh. |
27 | llama-7b | 55 | Very good at both RP and ERP. Great at staying in character. SFW to NSFW transition is very easy to do, there are no barriers. ERP requires very little effort from the user. It's very detailed, descriptive, creative and slow paced. It doesn't rush through the scenes and takes its time, making for very complete NSFW scenarios. The only downside is the lack of intelligence and occasional looping if the user is not careful. Might get genitalia confused. |
29 | Pygmalion_pygmalion-6b | 54 | Decent at SFW RP. Stays in character. Surprisingly good at roleplaying fights. This model will be able to depict your character pretty decently but it'll only do so on a surface level. It's unfortunately lacking when it comes to versatility, making going from a scene to another quite frustrating due to the looping and the way the model seemingly ignores your attempts at moving the scenario. Pretty decent ERP but once again, going from a scene to another is really tough. Overall, this model is very user dependent, requiring big character cards as well as quite a bit of handholding in order to get where you want to. |
30 | pygmalion-13b-4bit-128g | 48 | Can give detailed and creative responses when it's kept PG-13. If you plan on stepping outside of PG-13, look elsewhere. This model sucks. "Do you understand?" "Are you sure?" "Are you ready?" "Do you trust me?". See pygmalion-7b for a more detailed explanation. |
31 | metharme-7b | 47 | Really underwhelming without instruct mode. Lacks creativity and really struggles to move the scenario forward. With instruct mode, this model can be great! |
32 | pygmalion-7b | 46 | This model is a huge letdown compared to its predecessor, Pygmalion 6b, and its sibling, Metharme 7b. It is absolutely horrible at both engaging, and taking the lead in NSFW scenarios, even while using a fully NSFW character. And even if you miraculously manage to enter NSFW territory, the model will continuously loop, again and again, making all the efforts spent on getting there useless. Its only strong point is how good it is at portraying the character's personality and making efficient use of its description, but even then, We wouldn't recommend it, even for your SFW scenarios. The model's unwillingness to engage in anything out of its comfort zone makes it extremely frustrating to work with. The second anything remotely out of the character's comfort zone happens, it starts looping again and again, which pretty much lobotomizes it and ruins its personality. |
33 | Metharme-13b-4bit-GPTQ | 45 | Same as Metharme-7b, extremely underwhelming. Only use with instruct mode. |
34 | based-7b | 37 | meh 😊😊😊( DO NOT TOUCH THIS MODEL ) |
Other Links
- The 'Ayumi' Inofficial LLM ERP Model Rating: https://rentry.co/ayumi_erp_rating
- BestERP: https://besterp.ai/s/models
- Large Language Model (LLM) List | PygWiki: https://wikia.schneedc.com/llm/llm-models
- My character creation guide for RP chat bots: https://rentry.co/alichat
- Trappu's Rentry for up to date PList + Ali:Chat bots: https://rentry.org/TrappusRentry