Dampf's list of good datasets for LLM fine-tuning

Dataset Link Description Paper Link
WizardLM evol instruct V2 The newest version of the highly acclaimed WizardLM evol instruct dataset. This dataset builds upon its 70k predecessor that led to impressive models like ehartford/WizardLM-13B-Uncensored. The dataset, merged and uncensored by Eric Hartford, offers even more data for fine tuning language models. Paper by WizardLM
Airoboros 3.1 An excellent partly synthetic GPT-4 dataset created by Jon Durbin. Its diverse content of math, spanning coding and scripting tasks, agent and function style prompts, general tasks, riddles, roleplay, and more, aims to enhance the world knowledge and coherence of language models for a wide range of tasks. -
PygmalionAI PIPPA This is the latest release by the respected PygmalionAI-Team, known for their good roleplay and instruct models with both NSFW and SFW capabilities. The dataset is collected from sources such as CharacterAI and Claude, providing a unique blend of human-written and AI responses. Many other roleplay datasets are entirely synthetic in contrast. In my personal tests with rp datasets, the roleplay nature enhances world understanding and logic for instruct tasks as well as correlations between objects, size differences and the world itself are described in more detail, contributing to improved model performance. For pippa specifically, it's sufficient to teach the model more casual writing styles, emotions and character cards but since most of the responses are quite short, I strongly recommend combining it with high quality novel styled rp datasets. Most of these are not available to download on HF, so they are not included in this list. A Partially Synthetic Conversational Dataset
Evol Codealpaca v1 A coding dataset by theblackcat102, intended to replicate WizardLM's successful WizardLM coder. Similar to roleplay data, good coding datasets are crucial not only for refining coding skills but also for boosting general capabilities of language models. Paper by WizardLM
Starcoderdata A popular coding dataset, contributing to excellent performance in coding tasks. Fine-tuned models' performance details can be found in this link. -
OpenOrca and Dolphin Excellent and innovative datasets from respected LLM community members like Eric Hartford, WingLian, Teknium, NanoBit, etc. These datasets utilize GPT-4 as a parent model, enhancing model capabilities by explaining complex instructions using a step-by-step thought process. This approach improves reasoning and comprehension, as demonstrated by OpenOrca-Preview2-13B. Paper by Microsoft
TinyStories A popular and heartwarming dataset in simple language, perfect for training small models to generate coherent language. While primarily used for training smaller models, I think it has potential to enhance larger models' ELI5-task capabilities and produce more direct answers by fine-tuning them with a smaller version of this dataset. Paper
LosslessMegaCodeTrainingV3_Mini A vast, uncensored coding dataset. It significantly enhances model coding, mathematical, and reasoning abilities, making it a prime choice for uncensored models. -
OpenAssistant Guanaco and Guanaco Unfiltered High-quality instruct-tuned datasets by Tim Dettmers and uncensored version by Fredithefish. It reuses the name of the first GuanacoDataset. OpenAssistant Guanaco and its unfiltered version is a subset of the acclaimed OASST1 dataset by OpenAssistant, introducing a vast conversation tree and annotated paths, elevating instruct capabilities and reasoning for complex tasks. Paper
GPTeacher A flexible dataset consisting of entirely synthetic instruct and roleplay data by GPT-4. This dataset offers higher quality output compared to GPT-3.5's synthetic datasets, contributing to improved model performance. Do note that there is a possibility this dataset may include small traces of alignment and censorship. -
Open Platypus A fresh release by the garage-bAInd community. As you will notice, this already includes some datasets found in this list. It features a variety of logic tasks, enhancing model capabilities. Stable-Platypus2-13B, one of the best performing models on HF Leaderboard in its class currently, demonstrates the dataset's quality. Please note that there might be alignment and slight traces of censorship as it has not been uncensored. You might want to pick parts of the dataset to fit your needs. -

Datasets Added by the Community:

Dataset Link Description Paper Link
COT Submix Original A diverse dataset featuring a wide range of riddles and logic tasks, enhancing logical reasoning capabilities of models. -
Lima - Paper
Stack Exchange Instruction A dataset containing numerous questions and answers from Stack Overflow, boosting model coding capabilities. -
Open Instruct Uncensored An uncensored version of the open-instruct dataset, including human-written instructions and thought prompts. Paper

That is an overview of the best datasets I'm aware of. A huge thank you to all these dedicated people working on high quality datasets.

The list will be updated from time to time and I'm open to suggestions!

Edit
Pub: 31 Jul 2023 10:25 UTC
Edit: 24 Oct 2023 09:32 UTC
Views: 6266