LIMA ERP data (LimaRP)

Less Is More for Adult RolePlay

Following the principles highlighted in the LIMA paper and replicated in some aspects by Kaiokendev with SuperHOT, this archive contains about 2000 manually selected and curated 1-on-1 roleplaying conversations. They all feature only two participants, although occasionally participants may play the role of more than one character.

Source files

Source files with conversation data in .yaml format + basic Python script for building the dataset. LimaRP can be considered finished, but updated and revised archives may be posted in the future.

Be aware that although retrieved solely from age-restricted (18+) internet forums, the data contains roleplaying elements and topics that may be considered illegal in some countries. Do not download it if you're not sure of the legal ramifications of possessing socially inappropriate, extreme or otherwise disturbing fictional written content in your country. More details further below.

Notes

  • The first ~500 samples were designed to be trained with a 2048 tokens context size; the following 500 with a 4096 tokens context size or greater. The later training samples (data-long) were designed for an 8192 tokens context size. Note that while the 8k samples can be reduced to 4k size, this can confuse the model to some extent, as scenario and persona data may end up referring to events removed from the context.

Applications

For end-users, LimaRP LoRA adapters for Llama-2 and other newer models (as well as the dataset), have been made available on HuggingFace.

Other authors have merged these LoRA adapters with many different models or, more recently, trained models with the data.

Known issues

LimaRP has a few notable issues, here in subjective decreasing order of severity.

  • Grammar and typos. Although care has been put to reduce the amount of typos and grammatical errors (punctuation in particular), they are still present to some extent. Automated AI-based grammar checking with language models like CoEdit could be performed, but the results would then have to be manually validated since these models often tend to correct more than necessary, which can be undesirable in dialogues, as well as being avoidable manual work. Some data sources (threads) show a larger amount of grammatical issues than others, and for those this could be an acceptable tradeoff if they're worth saving.
  • Dullness. Overall the conversations may feel too polite or even dull in some aspects. This might be due to various reasons, but the main one is probably that most are from generally moderately well-written "vanilla" ERP where people try to be respectful with each other. More noncon and/or extreme content may be needed to reduce the general "politeness" of the conversational data, spice it up.
  • Compiling errors. While the provided script performs a certain amount of validation checks, there may still be instances where due to human error utterances have been assigned the wrong label, or placeholder names been assigned to the wrong character. The former issue is more likely to have happened in the first (4k-context) ~1000 training samples (data-short). The data needs to be carefully checked to make sure that no issue in this regard exists.
  • Repetitive and inaccurate descriptions. While conversations are almost entirely human-generated, character information and scenario exhibit gpt-4-isms and can be repetitive, lack depth and miss certain character traits; manual editing will be needed to make them more human-like and respond to more specialized personality traits and keywords—as a result, LimaRP-generated text may appear to ignore certain character traits. A more powerful personality summarizer capable of being both accurate while generating sufficiently long descriptions could be conceived for solving this issue.
  • Lack of instructions. No instruction data whatsoever is present in the dataset. While the initial plan was only making it focused on conversations, in retrospect a minimal amount of instruction-oriented roleplay data could be beneficial in making the dataset able to better stand up on its own feet, without the need for merging the data with smarter models or mixing it with external instruction datasets.
  • Name biases. Character names may need to be diversified to remove potentially undesirable bias. In other words, certain names may have ended getting associated with certain personalities since they have been used more frequently.
  • Lack of diversity. In general, more focus needs to be put on improving conversation diversity. The total number of conversations may have been excessively padded up, as several long conversations that couldn't fit within the 4k/8k tokens target have been split into multiple ones (on the other hand, Persona and Scenario data was never reused).
  • Poor dataset building script. The Python script for building the dataset, although working, is not great quality-wise and not particularly efficient.
  • Possible sources of impersonation. Several of the conversations in the 8k set feature participants consistently playing the role of two characters at the same time. Character names in these files (which include the suffix _MULTI or _GROUP in the filename) have been assigned a name with the format Char1&Char2. Testing didn't reveal issues with this, but it's something to keep in mind if more severe impersonation problems occur compared to the initial release of LimaRP. Furthermore, in a few conversations additional characters (roleplayed by either of the two users) may also temporarily participate to the story. These have often (but not always) been assigned a _BAD tag in the filename.
  • Gender confusion sources. Some conversations feature "futanari" or "transgender" content. These have been found to confuse small-scale models to a certain extent. All source files have a content field and in most cases they contain keywords like shemale, futa, futanari, trans, transgender when relevant to assist filtering.

License

All dataset source files are provided under the Apache License, Version 2.0.

(A copyleft license was previously used, but the present author has decided to change it to a permissive & non-viral license to make adoption simpler)

Some technical details

Conversation data form

Only one format has been used: forum/novel-style. This includes:

  • Quotation marks for dialogues;
  • Narration in third person, simple past form, without delimiters;

Other RP styles have been excluded, and messages showing them have been fixed when possible and feasible.

Format details

  • Narration does not have any delimiter.
    • Jessica looked at Mark with disdain.
  • Dialogues are enclosed with ASCII double quotation marks.
    • "I say this."
  • Onomatopoeias are enclosed with asterisks.
    • *thud*
  • Inner thoughts are enclosed with underscores.
    • _What is he doing?_
  • Non-dialogue quotes are enclosed with two apostrophes on each side (caveat: not all have been converted in this way).
    • ''The Jungle Book''
  • Punctuation has been normalized. Fancy quotes has been converted to the ASCII equivalent, ellipses always turned into a standard format (... with a trailing space when a word follows) and em-dashes always converted to three consecutive dashes (---) without any surrounding space.
    • For stylistic reasons, when building the dataset em-dash surrogates get converted to their UTF-8 symbol ().
  • Placeholder names have been used for the characters, even within the messages, whenever possible. <FIRST> is always assumed to be the bot/model, and <SECOND> always assumed to be the human/user. All conversations terminate with a message by <FIRST>.
    • When building the dataset, placeholder names currently get converted to the ones actually used in the RP conversations.

Data sources

In this initial version of the dataset, conversations include human-generated messages retrieved from the following forums. Not all forums may be openly accessible without registration. The data has not been filtered for content, only for conversational quality.

Note that the age ranges refer to the roleplayed characters, not the actual human participants behind them. Users are required to be 18+ to write in the listed ERP forums or forum subsections.

What this contains in detail

Almost the entirety of the conversation data here is human-generated, except for a very few instances where gpt-4 was used to either summarize excessively long messages, or to come up with a continuation in the style of one of the participants where there weren't enough messages.

In most cases, typos have been fixed (when spotted), punctuation normalized and paragraphs merged or broken up in order to make them easier to read. Oftentimes, character names have been manually clarified or repeated to avoid confusion. Furthermore, the names of the roleplayed characters have been changed to the placeholder names <FIRST> and <SECOND>. These could be replaced with actual names or changed to something else when building the dataset.

All conversations include character personas for both participants and scenario descriptions (a summary of the story), initially inferred by gpt-3.5-turbo or gpt-4. Later on, a custom 7B summarizer (unreleased) has been conceived to do the same without relying on censored cloud services.

The scenario, persona and conversation data summed together in most files ranges from about 2000 to 5500 tokens in length (for the first ~1000 examples) or up to about 8500 tokens (for the later examples). Depending on the final application or VRAM constraints, early messages will have to be trimmed to make the training examples fit within specific token lengths.

GPT prompts used

The initial prompts for jailbroken gpt-4-inferred scenario and personas were generally as follows:

  • Summarize the story in about 140 words, focusing on events and actions.
  • Infer the appearance and personality of <FIRST> in a few sentences, without focusing on story events. Write confidently even if character qualities are vague or poorly-defined; avoid using terms such as "likely", "possibly", "suggesting", "could hint", "presumably", and so on.
    • This was done for both <FIRST> and <SECOND>. An earlier version of this prompt didn't include the part where the model is asked to write confidently.

To help it more accurately infer personas in multi-part stories, basic information was added as a context just after the summarization request. Outputs were rarely used raw, but most often lightly manually edited.

Edit
Pub: 06 Jul 2023 23:28 UTC
Edit: 06 Jun 2024 12:45 UTC
Views: 4192