ERP/RP and erotica raw data collection

RP/ERP forum scrapes

Listed below are archives containing RP (roleplaying) and ERP (erotic roleplaying) messages scraped from openly accessible forums, in CSV format. For non-story forums, average message length can usually give a good indication of their general quality (i.e. effort put into writing them); on good quality RP forums it is often well over 1 kB/message. Besides miscellaneous information about data quality, the notes provide message count, mean length, standard deviation and message length percentiles.

Keep in mind that most messages in the CSV files are in raw HTML and are not ready for use for LLM training or fine-tuning. Further processing will be needed for that.

Be careful

Except for rather basic selection criteria (e.g. forum thread length), messages are completely unfiltered, therefore cleaning for removing PII and so-called CSAM where applicable will be necessary.

Site name Forum Section Quality Theme Link Message length statistics and notes
Bluemoon Roleplaying 🟢 Fan-based Roleplays (1x1) Good ~ Excellent Fandom RP+ERP .zip [Count:293778 Mean:944 Std:1366 25%:226 50%:540 75%:1177 90%:2173 95%:3131]
Bluemoon Roleplaying 🟢 General Original Roleplays (1x1) Very good RP+ERP .zip [Count:451593 Mean:997 Std:1493 25%:289 50%:594 75%:1201 90%:2200 95%:3146] somewhat variable quality/message length. Many empty threads.
Bluemoon Roleplaying 🟢 Group Roleplays Very Good RP+ERP .zip [Count:37458 Mean:1174 Std:2223 25%:309 50%:640 75%:1413 90%:2664 95%:3784]
Bluemoon Roleplaying 🟢 Taboo Original Roleplays (1x1) Good Fetish ERP .7z [Count:534299 Mean:1029 Std:1340 25%:265 50%:611 75%:1306 90%:2388 95%:3350] Several long threads contain mostly short messages.
Creative Freedom RPG 🟢 Alpha, Alpha 1x1, Beta, Beta 1x1, Omega, Omega 1x1 Very Good RP .7z A1: [Count:4635 Mean:2968 Std:3780 25%:1362 50%:2040 75%:3183 90%:5369 95%:7604] AG:[Count:4442 Mean:3574 Std:3263 25%:1697 50%:2628 75%:4223 90%:6852 95%:9376] B1:[Count:25158 Mean:1338 Std:2233 25%:609 50%:927 75%:1472 90%:2330 95%:3150] BG:[Count:13845 Mean:1868 Std:1880 25%:826 50%:1348 75%:2247 90%:3710 95%:4976] O1:[Count:31612 Mean:646 Std:598 25%:281 50%:515 75%:831 90%:1221 95%:1569] OG:[Count:13281 Mean:772 Std:1098 25%:234 50%:451 75%:885 90%:1694 95%:2451] Moderately good quality. Some messages may have encoding issues
Eka's Portal 🟢 Vore Roleplay Poor Vore RP+ERP .zip [Count:430574 Mean:318 Std:914 25%:73 50%:143 75%:286 90%:593 95%:1022] Threads mostly in-topic but message style often in one-liner form.
Elliquiy 🟢 Non-Adult Roleplays Good ~ Excellent RP .zip [Count:52496 Mean:653 Std:1113 25%:124 50%:322 75%:781 90%:1554 95%:2295] OOC threads present. Non-OOC threads very good.
Giant in the Playground 🟢 Ongoing Games (In-Character) Good Tabletop-like RP part1.7z part2.7z part3.7z part4.7z part5.7z part6.7z #1:[Count:159149 Mean:752 Std:955 25%:258 50%:520 75%:943 90%:1549 95%:2123] #2:[Count:333893 Mean:756 Std:1248 25%:226 50%:467 75%:901 90%:1571 95%:2269] #3:[Count:320330 Mean:818 Std:1451 25%:213 50%:468 75%:924 90%:1728 95%:2839] #04:[Count:384830 Mean:898 Std:1409 25%:238 50%:531 75%:1029 90%:1938 95%:3042] #5:[Count:464139 Mean:957 Std:1784 25%:292 50%:600 75%:1104 90%:1943 95%:2796] #6:[Count:378482 Mean:1026 Std:1647 25%:309 50%:638 75%:1178 90%:2091 95%:3027] #7:[Count:502020 Mean:1109 Std:2019 25%:331 50%:706 75%:1290 90%:2300 95%:3299] #8:[Count:488631 Mean:1105 Std:1808 25%:297 50%:675 75%:1291 90%:2339 95%:3410] #9:[Count:533131 Mean:1348 Std:2511 25%:367 50%:774 75%:1507 90%:2792 95%:4131] 3.5M messages in 9 files into 6 compressed archives. Generally consistently good prose, although the messages do not have a very long length on average. The theme is non-adult RP with table-top RPG rules or gaming elements which may need some care when parsing into a format suitable for training. Note 1: early posts from years 2005-2007 may have some formatting issues, e.g. showing & quot; (without the space) in the messages, or with raw BB-like [ooc] tags. Note 2: Some OOC threads and GM threads also present, denoted by related tags in the thread title. Note 3: important information (OOC, dice rolls, story info) can be under spoiler tags, don't discard them.
Literotica Forum 🟢 Online Roleplaying Poor RP+ERP .zip [Count:227973 Mean:936 Std:1469 25%:199 50%:557 75%:1121 90%:2083 95%:3076] Many Off-topic, OOC threads, short-reply threads. Occasionally good threads.
Literotica Forum 🟢 Sexual Roleplaying Poor ERP part1.7z part2.7z part3.7z part4.7z part5.7z #01:[Count:498048 Mean:963 Std:1130 25%:312 50%:648 75%:1221 90%:2067 95%:2838] #02:[Count:191958 Mean:814 Std:1030 25%:244 50%:522 75%:1036 90%:1784 95%:2456] #03:[Count:212865 Mean:729 Std:988 25%:198 50%:426 75%:890 90%:1632 95%:2382] #04:[Count:198527 Mean:760 Std:988 25%:239 50%:471 75%:921 90%:1647 95%:2327] #05:[Count:180607 Mean:802 Std:1039 25%:219 50%:514 75%:989 90%:1757 95%:2514] #06:[Count:158697 Mean:976 Std:1270 25%:285 50%:636 75%:1185 90%:2092 95%:3030] #07:[Count:146079 Mean:1080 Std:1261 25%:351 50%:744 75%:1354 90%:2305 95%:3197] #08:[Count:142542 Mean:1093 Std:1327 25%:395 50%:743 75%:1327 90%:2264 95%:3178] #09:[Count:173609 Mean:994 Std:1243 25%:303 50%:611 75%:1197 90%:2213 95%:3156] #10:[Count:182408 Mean:973 Std:1240 25%:301 50%:627 75%:1180 90%:2093 95%:2992] #11:[Count:207904 Mean:1074 Std:1396 25%:335 50%:674 75%:1296 90%:2364 95%:3335] #12:[Count:282249 Mean:1202 Std:1561 25%:327 50%:746 75%:1527 90%:2728 95%:3783] 2.5M messages in 12 parts into 5 compressed files. Many Off-topic, OOC threads, short-reply threads. Occasionally good threads. Message HTML needs some cleaning.
Menewsha 🟢 All 1x1 sections, All Group sections, OOC Poor ~ Mediocre General RP 1x1.7z group.7z OOC.7z 1x1 #1: [Count:191547 Mean:509 Std:688 25%:163 50%:308 75%:576 90%:1086 95%:1649] 1x1 #2: [Count:151791 Mean:512 Std:740 25%:136 50%:287 75%:620 90%:1190 95%:1697] 1x1 #3: [Count:172102 Mean:568 Std:954 25%:141 50%:258 75%:663 90%:1331 95%:1979] Group: [Count:304200 Mean:634 Std:1674 25%:130 50%:316 75%:707 90%:1404 95%:2079] OOC: [Count:171760 Mean:273 Std:1354 25%:56 50%:115 75%:228 90%:452 95%:761] 990K messages in total; relatively short messages in general. Threads from OOC section provided, in case some what can be found to link in-character threads to them. As threads from multiple forum subsections have been put together in the same archive, an additional field in the CSVs providing the subforum location has been provided as well. Some messages may have encoding issues
Menewsha 🟢 RP Archive Poor ~ Mediocre General RP .7z #1:[Count:230796 Mean:489 Std:672 25%:161 50%:319 75%:602 90%:1043 95%:1442] #2:[Count:200151 Mean:501 Std:835 25%:158 50%:323 75%:599 90%:1054 95%:1476] #3:[Count:205105 Mean:483 Std:817 25%:163 50%:309 75%:556 90%:989 95%:1421] #4:[Count:205770 Mean:608 Std:1099 25%:170 50%:388 75%:741 90%:1292 95%:1809] About 840K messages in total from a 2008-2011 archive of general-themed RP by young users. Prose might not always be good and RP style not consistent, often with html tags used just to make posts prettier. Message length generally short. Some messages may have encoding issues
Nation States 🟢 Portal to the Multiverse Mediocre Nations, Politics and Misc RP part1.7z part2.7z part3.7z #1:[Count:295729 Mean:1048 Std:1291 25%:406 50%: 714 75%:1233 90%:2130 95%:3047] #2:[Count:299668 Mean:1528 Std:1789 25%:572 50%:1020 75%:1850 90%:3201 95%:4469] #3:[Count:286762 Mean:1950 Std:2235 25%:739 50%:1335 75%:2381 90%:4060 95%:5619] #4:[Count:256867 Mean:2942 Std:4319 25%:859 50%:1665 75%:3340 90%:6348 95%:9504] Only the threads explicitly marked as "IC" (In-Character, RP) in the title were scraped, for about 1.1M messages in total into 3 compressed archives; still, not all messages might be RP-related. Noise and blockquotes need to be filtered. Message length excellent and even improved over time, but the general posting style might be difficult to adapt to the typical chatbot format.
Roleplay Adventures 🟢 All In-character and "Hall of Fame" subforums Mediocre (Variable) General RP + Soft ERP .7z #1 [Count:73660 Mean:973 Std:2666 25%:131 50%:401 75%:1057 90%:2220 95%:3457] #2 [Count:73551 Mean:1203 Std:2294 25%:306 50%:670 75%:1482 90%:2643 95%:3647] #3 [Count:90614 Mean:662 Std:2218 25%:110 50%:208 75%:447 90%:1443 95%:2707] 236K messages in total. A large portion of the messages is short, but a small subset is very long. Some OOC threads may be present. A handful of messages has encoding issues
Roleplay Gateway 🟢 Fanfics & Other Fiction-Based Roleplay Mediocre Fanfiction RP .zip [Count:141810 Mean:840 Std:1353 25%:241 50%:507 75%:989 90%:1848 95%:2683]
Roleplay Gateway 🟢 Fantasy Roleplay Mediocre Fantasy RP .zip [Count:265450 Mean:907 Std:1384 25%:230 50%:529 75%:1077 90%:2035 95%:2986]
Roleplay Gateway 🟢 Realistic Roleplay Mediocre General RP .zip [Count:204882 Mean:830 Std:1087 25%:263 50%:501 75%:989 90%:1840 95%:2645]
Roleplay Gateway 🟢 School-Based Roleplay Mediocre School life RP .zip [Count:41368 Mean:590 Std:730 25%:209 50%:419 75%:723 90%:1232 95%:1687] some good threads, but otherwise inconsistent RP style.
Roleplayer Guild 🟢 All roleplaying forums Excellent General RP+ERP part1.7z part2.7z part3.7z part4.7z part5.7z part6.7z part7.7z part8.7z This dataset is different compared to the others in that it includes within the same .csv files in-character (IC, i.e. actual roleplay), out-of-character (OOC) and Character Sheet messages for a total of about 3 million messages. As OOC and Sheets share the same base url/name with the IC threads, they can be reliably associated with them, if needed. Thread tags and an additional field identifying if the messages are part of IC, OOC or sheets are included. Possibly one of the best all-around RP datasets. Special usage notes: 1: @-mentions in the IC threads could be removed. 2: A markdown file with an extended explanation of thread tags is provided.
RP Nation 🟢 Group Poor ~ Good RP part1.7z part2.7z LOST part4.7z part5.7z #1:[Count:497833 Mean:649 Std:1454 25%:160 50%:344 75%:717 90%:1418 95%:2156] #2:[Count:383466 Mean:861 Std:1733 25%:188 50%:457 75%:977 90%:1950 95%:2978] #4:[Count:309836 Mean:2023 Std:2631 25%:524 50%:1230 75%:2582 90%:4719 95%:6467] #5:[Count:483754 Mean:1940 Std:3356 25%:424 50%:880 75%:2223 90%:4858 95%:7043] Part 3 missing due to problems while scraping; variable message quality and length
RP Nation 🟢 One on One (1x1) Poor ~ Good RP part1.7z part2.7z #1:[Count:574127 Mean:596 Std:1194 25%:101 50%:243 75%:599 90%:1409 95%:2374] #2:[Count:594005 Mean:1334 Std:2787 25%:284 50%:728 75%:1320 90%:3087 95%:4801] Variable quality that seemingly improved over time.
RPG Net 🟢 Roleplay-By-Post Play Forum Good Tabletop-like RP part1.7z.001 part2.7z.002 #1:[Count:140054 Mean:1274 Std:1605 25%:322 50%:854 75%:1548 90%:2797 95%:3996] #2:[Count:143877 Mean:1326 Std:1552 25%:346 50%:945 75%:1681 90%:2848 95%:3992] #3:[Count:147795 Mean:1306 Std:1699 25%:306 50%:865 75%:1607 90%:2876 95%:4101] #4:[Count:140932 Mean:1235 Std:1534 25%:308 50%:853 75%:1514 90%:2705 95%:3865] #5:[Count:144716 Mean:1167 Std:1409 25%:312 50%:885 75%:1453 90%:2454 95%:3382] #6:[Count:134337 Mean:1151 Std:1367 25%:282 50%:806 75%:1455 90%:2563 95%:3564] #7:[Count:145362 Mean:1547 Std:2344 25%:327 50%:922 75%:1764 90%:3405 95%:5169] #8:[Count:135931 Mean:1243 Std:1500 25%:315 50%:831 75%:1567 90%:2762 95%:3912] Only the in-character (RP) threads were scraped. In total, 1.1M messages in 8 .csv files compressed into 1 7-zip archive. General quality a bit variable, with OOC, "noise" and tabletop-like RPG dice roll data and so on that will have to be cleaned.
SpaceBattles 🟢 Roleplaying Good? Group fantasy / scifi / fandom RP (OOC threads) .7z.001 .7z.002 #0 [Count:207873 Mean: 865 Std:1362 25%:140 50%:762 75%:1093 90%:1755 95%:2456] #1 [Count:210103 Mean:1032 Std:1652 25%:301 50%:887 75%:1154 90%:1828 95%:2493] #2 [Count:211394 Mean:1096 Std:1959 25%:707 50%:926 75%:1209 90%:1907 95%:2556] #3 [Count:212016 Mean:1222 Std:2436 25%:760 50%:933 75%:1240 90%:1988 95%:2789] #4 [Count:212590 Mean:1215 Std:3062 25%:770 50%:926 75%:1202 90%:1905 95%:2639] #5 [Count:211059 Mean:1338 Std:3912 25%:767 50%:945 75%:1287 90%:2124 95%:3094] #6 [Count:209990 Mean:1488 Std:3702 25%:774 50%:947 75%:1310 90%:2313 95%:3767] #7 [Count:202896 Mean:1747 Std:4669 25%:747 50%:975 75%:1446 90%:2900 95%:5465] About 1.5M messages. Most messages are from OOC threads. Thread tags present for newer threads. Thread Labels may denote if they are OOC or IC threads.
SpaceBattles 🟢 Roleplaying IC Good Group fantasy / scifi / fandom RP part1.7z.001 part2.7z.002 part3.7z.003 part4.7z.004 #0 [Count:348316 Mean:1952 Std:4125 25%:283 50%:1031 75%:1932 90%:4222 95%:7019] #1 [Count:353124 Mean:1365 Std:2391 25%:294 50%: 948 75%:1454 90%:2657 95%:4056] #2 [Count:351046 Mean:1273 Std:2732 25%:225 50%: 901 75%:1313 90%:2335 95%:3622] #3 [Count:354673 Mean:1311 Std:3165 25%:241 50%: 917 75%:1354 90%:2427 95%:3724] #4 [Count:353331 Mean:1542 Std:2424 25%:792 50%:1053 75%:1670 90%:2896 95%:4306] #5 [Count:355101 Mean:1826 Std:3499 25%:841 50%:1106 75%:1881 90%:3418 95%:5263] #6 [Count:346398 Mean:2665 Std:7769 25%:884 50%:1316 75%:2483 90%:4935 95%:7869] #7 [Count:354917 Mean:3017 Std:8534 25%:982 50%:1419 75%:2643 90%:5086 95%:8439] About 2.8M messages, for the most part in-character / IC. Thread tags present for newer threads. Thread Labels may denote if they are OOC or IC threads. The group nature of the threads can make it difficult for them to be used for 1-on-1 chats.
Wolf RPG 🟢 Archives Very Good Wolf (Animal) RP .7z [Count:423174 Mean:1177 Std:669 25%:759 50%:1038 75%:1421 90%:1925 95%:2352] Messages not overly long, but consistently very good quality. Note 1: OOCs in this forum are most often placed under a special tag. These tags have been isolated and removed from the message body, then placed in a special message_ooc field in the CSV file. Note 2: content warnings (violence, swearing) have been removed from the messages. Note 3: Threads shorter than 1 reply have been filtered.

Usage notes

All message archive files are in .csv format. They can be easily processed in python using the pandas data processing library as follows:

import pandas
data = pandas.read_csv(filename)

After doing so, most of the time the data will have 5 or more fields as in the following example:

1
2
3
4
5
6
7
b.loc[211]
Out[211]: 
thread_title                              new idea, but need some help
thread_href          https://forum.literotica.com/threads/new-idea-...
message_timestamp                               Nov 3, 2000 at 3:00 PM
message_username                                   husband's nightmare
message              I want to do a role play based o the movie Dre...

Some details

  • Messages have been parsed by chronological order using threads first, then messages. In other words, messages from the same thread will be consecutive in the .csv files.
  • message contains the message body in html format. To become fully usable for fine-tuning LLMs it will likely have to be rendered as text. The Python library BeautifulSoup could be used for this.
  • message_timestamp has not been normalized and thus may have different formats depending on the forum.
  • Sometimes, thread labels or tags have been added too as separate fields.

Suggested strategy for parsing and filtering the files

  • Forums can have a lot of off-topic or non-RP threads. The latter most of the time include "OOC" in the thread title. Such threads could be filtered away.
    • For Elliquiy, it is suggested to filter away threads titles containing the strings: OOC, O.O.C. (out of character, i.e. non-RP/ERP talk); Character Sheet, Character Profile, Character Thread, Character List, Character Roster (these all contain character descriptions). However, note that IC or in-character means that the thread is in-topic, i.e. it contains proper role-play.
    • Note that sometimes Character and OOC can be in the same thread.
  • For privacy reasons it might be best to anonymize usernames and scramble URLs originating from the forum where the messages have been posted.
  • Typically OOC (out of character) messages within in-topic threads will be enclosed with double parentheses (( )), single parentheses with OOC marking (OOC: ); more rarely square parentheses are used as well [ ] [OOC: ]. However, often just single parentheses ( ) without other indications are used for OOC.
  • Threads tend to have a more or less consistent posting style. It is suggested to filter complete threads away if the posts contain too many one-liners.
  • Oftentimes there will only be 1 opening post, perhaps due to a failure to attract attention. In principle, these threads could be filtered away, but depending on the forum they are often also high-effort posts.

Stories website scrapes

These archives contain short stories written by a single author instead of roleplay between two or more persons.

Site name Quality Theme Link Notes
Impregnorium 🟢 Good Impregnation stories .7z About 1360 stories. Uses alt.sex.stories story codes; glossary provided in the archive for the tags.
Literotica 🟢 Good (Variable) General 7z.001 7z.002 7z.003 7z.004 7z.005 7z.006 7z.007 7z.008 7z.009 7z.010 7z.011 7z.012 7z.013 7z.014 300K+ stories as of 2021-12, 8GB CSV file with tags and other metadata 7zipped to a 2.6GB archive. Processed from https://archive.org/details/literotica-2021.12
Lush Stories 🟢 Good+ General .7z.001 .7z.002 About 70K stories (1 GB archive) with tags and general categories, a large fraction also having a brief summary. The stories are in HTML that needs some cleaning. Note that while a good portion of the author's notes are present as a dedicated field, in many cases they are in the story text together with trailers ("Continues in Chapter X", "THE END", etc.).
Sexstories 🟢 Good (Variable) General part1.7z part2.7z 60K stories, 1.25GB. Duplicate stories may be present. Some are also short "sex jokes", so it is suggested to filter them by length. Tags, ratings, views included in the csv files.
The Erotic Mind-Control Story Archive 🟢 Excellent Erotic mind control stories (general) part1.7z.001 part2.7z.002 36000+ stories (including multi-chapter parts). All stories include a short summary and categories based on a modified alt.sex.stories system, described in detail here. The stories are in very consistently formatted HTML, and separating unwanted portions like trailers, milestones, chapter titles and so on should be easy.
Edit
Pub: 13 Apr 2023 15:11 UTC
Edit: 05 Nov 2023 02:44 UTC
Views: 21551