Three lolis

Cute, Sexy and Moe archives

Overview

This page contains mainly raw forum messages (in HTML) and erotic story collections in CSV format that can be properly parsed with the python library pandas. Excel has been reported not to work correctly with them.

Please note that the raw forum/story data is not ready to use as it is for fine-tuning large language models. It should be cleaned and reprocessed in some way first; in the case of ERP/RP forums this means for example dealing with (not necessarily removing) HTML tags and/or BB code.

ERP/RP Forums

It is advised to carefully remove personal information (mainly usernames) before using this data.

Site name Forum Sections Quality Theme Link Notes
AllTheFallen Roleplaying Good Loli/shota ERP atf-forum-rp.zip 80864 messages, mean length: 975, median length: 510. Scraped on 2023-03-24.
Lolicit RP 1 on 1, RP general Good / Very Good Loli/shota ERP lolicit.zip #1 37880 messages, mean length: 588, median length: 461; #2 45493 messages, mean length: 619, median length: 323. Scraped on 2023-03-25. The 1 on 1 section has on average a higher quality than the RP general section, with some gems (mainly the longer RP threads). The website died around mid-2023, but a spiritual successor currently exists (no RP section yet there). Bonus content: all RP threads from Lolicit as single HTML pages.

Stories

Site name Quality Theme Link Notes
AllTheFallen Poor Loli/shota general atf-forum-stories.7z Since it's been scraped from a forum, the .CSV contains user comments as well. Low quality on average.
Archive of Our Own (subset) Mediocre (Variable) "Underage" ao3ua.7z.001 ao3ua.7z.002 ao3ua.7z.003 ao3ua.7z.004 ao3ua.7z.005 ao3ua.7z.006 ao3ua.7z.007 ao3ua.7z.008 ao3ua.7z.009 ao3ua.7z.010 ao3ua.7z.011 ao3ua.7z.012 ao3ua.7z.013 ao3ua.7z.014 ao3ua.7z.015 ao3ua.7z.016 ao3ua.7z.017 ao3ua.7z.018 A subset of ~800K (14.7GB) fanfictions (including multi-chapter parts) from AO3 tagged with the "Underage" content warning; they are not all smut, but those with a Rating of "Mature" or "Explicit" most likely are. A large amount of metadata has been provided in this archive. Cleaning will be needed to remove unnecessary HTML tags and other noise from the fanfictions. Each chapter has its own record in the CSV file. It is suggested to further filter the archive by "Kudos" (Upvotes) and/or chapter length. Rename the downloaded files to the same base name. 18 parts in total.
Chris Haley Erotic Stories Very Good Loli/shota general chaley.zip Small collection (360+) of loli/shota stories by the same author. Uses alt.sex.stories story codes. Short summaries present, but story notes and trailers may occasionally also be present.
Juicy Secrets Poor** Lesbian and Incest Lolita stories juicy.zip The stories are very good, but consistently cleaning up the text from unnecessary tags, notes, links and information seems very difficult without extensive manual work. Thus, this .CSV file is just a raw scrape of the stories in html from the website, without much processing involved.
Kristen's Archives - The Book Shelf Directories Mediocre~Good (Variable) General + Taboo + Lolisho kristen.7z 4150+ selected stories of various genres parsed from a 2017 scrape of asstr.org. Uses alt.sex.stories story codes. Basic cleaning has been performed but the stories may still contain author notes, disclaimers, copyright remarks, short summaries and trailers, so they might be difficult to use properly. About 25% of the stories are tagged as teen or loli/shota and below.
Leslita Very Good Lesbian Lolita stories leslita.7z 4368 stories. Stories are tagged according to the alt.sex.stories newsgroup tagging system. Check out Source1 and Source2 for a full explanation. Possible character encoding issues may be present. Em-dashes added during the parsing step.
Lolita Bondage (archived) Very good Loli/shota bondage lolita-bondage.7z The website died in 2008 and has been retrieved via archive.org. About 3300 stories. Story quality not always excellent, but all major components like head/foot notes, disclaimers, etc have been separated from the story body, making this archive easier to use than others. Story tags (alt.sex.stories story codes), summaries, votes and quality tags provided, but are not always present. Some of the stories are actually 'poems' or similar compositions, generally indicated in the tags. Notes: 1) Sometimes, "section titles" may be present in the story body as p.h2 tags. 2) Text encoding errors may be present although a good attempt was made to fix them. 3) Em-dashes have been fixed in the parsing process. 4) Some non-English stories also present.
Loliwood Studios Excellent Loli/shota general ls.7z About 7130 stories (multi-chapter story parts included). Parsed from a 2017 scrape of asstr.org. The CSV file uses expanded alt.sex.stories story codes where ages may sometimes be explicitly indicated. Story summaries, author information also provided. Content disclaimers and copyrights are most of the time separated from the story body, but story trailers ("Continues to Chapter X", etc) may still be present. Note that the website appears to have been excluded from archive.org.
Piper's Domain Excellent Loli/shota mind control stories piper.7z 665 stories (multi-chapter story parts included). Parsed from a 2017 scrape of asstr.org. The CSV file uses alt.sex.stories story codes; an html file with the ones used in this archive is provided. Story summaries also present.

Datasets

  • Ashhwriter (Version 2023-10-22)
    • Partially cleaned unsupervised finetuning dataset in .jsonl format, with just stories (with some metadata) broken into 8k tokens chunks (Mistral-7B tokenizer); about 315MB size uncompressed. Contains: Leslita, Lolita Bondage, Loliwood Studios, Piper's Domain
  • LimaRP (Version 2023-10-19)
    • Manually curated dataset made with forum data from Lolicit and ATF from this Rentry and a larger one with general forum RP. LimaRP information on another Rentry and on Huggingface.
    • Not ready-to-use; it needs to be "built" with the included script (preferably modified to suit one's needs).
  • ShoriRP (version 2024-05-25)
    • Incomplete Loli-focused, manually curated dataset of roughly half-a-LimaRP size, containing many different things, mainly roleplay from ATF and Lolicit, as well as semi-synthetic conversations. Different direction and scope than LimaRP; it's not a direct replacement. There is a partial description of what it's intended to do and how it's been organized on Huggingface, but a full description of the contents has not been written yet.
    • It's recommended to disregard the data under the low (i.e. low-quality) category unless it can be finetuned first (curriculum training).
    • Not ready-to-use; it needs to be "built" with the included script (preferably modified to suit one's needs).

Unparsed data

Edit
Pub: 27 Apr 2023 11:11 UTC
Edit: 10 Jul 2024 20:43 UTC
Views: 13065