Cute, Sexy and Moe archives
Overview
This page contains mainly raw forum messages (in HTML) and erotic story collections in CSV format that can be properly parsed with the python library pandas
. Excel has been reported not to work correctly with them.
Please note that the raw forum/story data is not ready to use as it is for fine-tuning large language models. It should be cleaned and reprocessed in some way first; in the case of ERP/RP forums this means for example dealing with (not necessarily removing) HTML tags and/or BB code.
ERP/RP Forums
It is advised to carefully remove personal information (mainly usernames) before using this data.
Site name | Forum Sections | Quality | Theme | Link | Notes |
---|---|---|---|---|---|
AllTheFallen | Roleplaying | Good | Loli/shota ERP | atf-forum-rp.zip | 80864 messages, mean length: 975, median length: 510. Scraped on 2023-03-24. |
Lolicit | RP 1 on 1, RP general | Good / Very Good | Loli/shota ERP | lolicit.zip | #1 37880 messages, mean length: 588, median length: 461; #2 45493 messages, mean length: 619, median length: 323. Scraped on 2023-03-25. The 1 on 1 section has on average a higher quality than the RP general section, with some gems (mainly the longer RP threads). The website died around mid-2023, but a spiritual successor currently exists (no RP section yet there). Bonus content: all RP threads from Lolicit as single HTML pages. |
Stories
Site name | Quality | Theme | Link | Notes |
---|---|---|---|---|
AllTheFallen | Poor | Loli/shota general | atf-forum-stories.7z | Since it's been scraped from a forum, the .CSV contains user comments as well. Low quality on average. |
Archive of Our Own (subset) | Mediocre (Variable) | "Underage" | ao3ua.7z.001 ao3ua.7z.002 ao3ua.7z.003 ao3ua.7z.004 ao3ua.7z.005 ao3ua.7z.006 ao3ua.7z.007 ao3ua.7z.008 ao3ua.7z.009 ao3ua.7z.010 ao3ua.7z.011 ao3ua.7z.012 ao3ua.7z.013 ao3ua.7z.014 ao3ua.7z.015 ao3ua.7z.016 ao3ua.7z.017 ao3ua.7z.018 | A subset of ~800K (14.7GB) fanfictions (including multi-chapter parts) from AO3 tagged with the "Underage" content warning; they are not all smut, but those with a Rating of "Mature" or "Explicit" most likely are. A large amount of metadata has been provided in this archive. Cleaning will be needed to remove unnecessary HTML tags and other noise from the fanfictions. Each chapter has its own record in the CSV file. It is suggested to further filter the archive by "Kudos" (Upvotes) and/or chapter length. Rename the downloaded files to the same base name. 18 parts in total. |
Chris Haley Erotic Stories | Very Good | Loli/shota general | chaley.zip | Small collection (360+) of loli/shota stories by the same author. Uses alt.sex.stories story codes. Short summaries present, but story notes and trailers may occasionally also be present. |
Juicy Secrets | Poor** | Lesbian and Incest Lolita stories | juicy.zip | The stories are very good, but consistently cleaning up the text from unnecessary tags, notes, links and information seems very difficult without extensive manual work. Thus, this .CSV file is just a raw scrape of the stories in html from the website, without much processing involved. |
Kristen's Archives - The Book Shelf Directories | Mediocre~Good (Variable) | General + Taboo + Lolisho | kristen.7z | 4150+ selected stories of various genres parsed from a 2017 scrape of asstr.org. Uses alt.sex.stories story codes. Basic cleaning has been performed but the stories may still contain author notes, disclaimers, copyright remarks, short summaries and trailers, so they might be difficult to use properly. About 25% of the stories are tagged as teen or loli/shota and below. |
Leslita | Very Good | Lesbian Lolita stories | leslita.7z | 4368 stories. Stories are tagged according to the alt.sex.stories newsgroup tagging system. Check out Source1 and Source2 for a full explanation. Possible character encoding issues may be present. Em-dashes added during the parsing step. |
Lolita Bondage (archived) | Very good | Loli/shota bondage | lolita-bondage.7z | The website died in 2008 and has been retrieved via archive.org. About 3300 stories. Story quality not always excellent, but all major components like head/foot notes, disclaimers, etc have been separated from the story body, making this archive easier to use than others. Story tags (alt.sex.stories story codes), summaries, votes and quality tags provided, but are not always present. Some of the stories are actually 'poems' or similar compositions, generally indicated in the tags. Notes: 1) Sometimes, "section titles" may be present in the story body as p.h2 tags. 2) Text encoding errors may be present although a good attempt was made to fix them. 3) Em-dashes have been fixed in the parsing process. 4) Some non-English stories also present. |
Loliwood Studios | Excellent | Loli/shota general | ls.7z | About 7130 stories (multi-chapter story parts included). Parsed from a 2017 scrape of asstr.org. The CSV file uses expanded alt.sex.stories story codes where ages may sometimes be explicitly indicated. Story summaries, author information also provided. Content disclaimers and copyrights are most of the time separated from the story body, but story trailers ("Continues to Chapter X", etc) may still be present. Note that the website appears to have been excluded from archive.org. |
Piper's Domain | Excellent | Loli/shota mind control stories | piper.7z | 665 stories (multi-chapter story parts included). Parsed from a 2017 scrape of asstr.org. The CSV file uses alt.sex.stories story codes; an html file with the ones used in this archive is provided. Story summaries also present. |
Datasets
- Ashhwriter (Version 2023-10-22)
- Partially cleaned unsupervised finetuning dataset in
.jsonl
format, with just stories (with some metadata) broken into 8k tokens chunks (Mistral-7B tokenizer); about 315MB size uncompressed. Contains: Leslita, Lolita Bondage, Loliwood Studios, Piper's Domain
- Partially cleaned unsupervised finetuning dataset in
- LimaRP (Version 2023-10-19)
- Manually curated dataset made with forum data from Lolicit and ATF from this Rentry and a larger one with general forum RP. LimaRP information on another Rentry and on Huggingface.
- Not ready-to-use; it needs to be "built" with the included script (preferably modified to suit one's needs).
- ShoriRP (version 2024-05-25)
- Incomplete Loli-focused, manually curated dataset of roughly half-a-LimaRP size, containing many different things, mainly roleplay from ATF and Lolicit, as well as semi-synthetic conversations. Different direction and scope than LimaRP; it's not a direct replacement. There is a partial description of what it's intended to do and how it's been organized on Huggingface, but a full description of the contents has not been written yet.
- It's recommended to disregard the data under the
low
(i.e. low-quality) category unless it can be finetuned first (curriculum training). - Not ready-to-use; it needs to be "built" with the included script (preferably modified to suit one's needs).
Unparsed data
- Crestfall story data (version 2024-02-14)
- Handpicked and cleaned loli (7.7MB) - shota (4.8MB) stories. Not a dataset, but could be the base for one. Some empty files (0 bytes) present.
- Histoires Taboues
- 20000+ stories in French from HT from 1999 to 2024, from archives in https://infosht.wordpress.com/about/etat-du-site-h-t/. It appears that the website has been specifically excluded from archive.org and that it has not been properly archived from the asslr.org domain it was previously temporarily hosted on.
- [S] Entire Archive of ASSTR.ORG FTP Site, Up to May 2017. (archived)
- Reddit discussion of a 2017 scrape of asstr.org. ASSTR was known to contain a large amount of erotica featuring underage characters.
- It's recommended to use the 2022 magnet link shown in the thread (the data is the same as the other).
Other links
- ERP/RP and erotica raw data collection
- A large collection of erotica and conversations scraped between March and April 2023 from some of the major RP/ERP forums and other sources.
- LIMA ERP data (LimaRP)
- The Rentry where the LimaRP dataset was originally posted on mid-2023.