How To RVC
[ Guide by @diablofx, other guides linked with credit ] [ contributors all welcome! ]
I will primarily be focusing on creating a dataset and local training in this guide, given the abundance of tutorials on other subjects.
I'll specifically focus on Windows, (training on Mac isnt it). While aiming for conciseness, this guide is not for absolute beginners. Limited coverage of paid tools is intentional, as most readers probably lack access. If you do have access, you probably wouldn't be reading this guide.
This guide will still work on all versions of RVC
I will still mention everything else, so feel free to still follow the guide
~ Table of Content ~
- PREPARATIONS 📜
- CREATING A DATASET 🔉
- Where to get your audio? 🤔
- How to prepare your dataset
- Length of the Dataset 📏
- Isolating Vocals
- Using UVR
- Best Settings for Isolation
- Noise Gating
- Normalising
- Truncating
- De-Esser
- TRAINING 💪
- Using Tensorboard
- How to read the graphs
- How to spot overtraining
- How to train properly
- Sample-Rate & other settings
- Continue Training a model
- Using Tensorboard
- INFERENCE - How to make AI Cover 🗣️
- PUBLISHING A MODEL 📤
- REAL-TIME VOICE CHANGER 🎤
- TTS/Text-to-Speech 🛠️
- Cloud-Alternatives ☁️
📜 PREPARATIONS 📜
Before doing ANYTHING, you need to get an RVC installation.
This Guide covers the process when training locally first, go here if you are not planning to train locally.
(I will link each equivalent guide in each section)
🖥️ Installing Mainline-RVC locally 🖥️
(Update: Removed the mention of the Mangio-RVC Fork)
Before proceeding, ensure that you have Python 3.10 installed, and be sure to select "Add Python to Environment Variables (PATH)" during the installation to avoid issues!
(and make sure your GPU is good enough, 2nd gen RTX or above, otherwise don't bother with using local and use Colab or Ilaria RVC instead.
I would personally use Mainline-RVC and i would recommend you to install that too. Just go to the releases page I linked and download the package corresponding to your GPU and extract it.
- Download and extract the zip file to your preferred location.
- Create a new folder (recommended), ensuring no spaces in folder names to prevent potential issues later.
- Launch the official Web UI Interface by running the go-web.bat file (or the corresponding custom one if you have one installed)

OPTIONAL: Easy-GUI & RavenUI
You can also install the Easy-GUI for a simplified custom interface, although I don't recommend it due to reduced features. If you decide to install it, extract it into your newly created folder and run the batch file. After installation, launch the 'run_easiergui.bat' file as you would for the regular interface.
Same with RavenUI (I do recommend this one).
If you ever want to make your own custom UI, this RENTRY might help you in knowing what lines to change
📈 Installing Tensorboard 📈
Install Tensorboard, a crucial but often overlooked tool for maximizing a model's potential and avoiding unintended issues, or even completely ruining a model. Drop the linked file into your RVC folder and install it.
Use that same .cmd whenever you wish to access Tensorboard. I explain how to use it properly here.
Now that you have RVC installed, let's talk about how to make a good dataset that you can train.
🔉 CREATING A DATASET 🔉
Datasets are just Audio Files that have been processed to remove things that you do not want in it. But first, you need to download some of the audio samples of the character/sound you want...
🤔Where to get your audio? 🤔
Always prioritize high-quality audio rips in stereo format with a higher sample rate, preferably sourced from official platforms like Spotify, Qobuz, etc.
Stereo = Audio with 2 channels (left and right ear)
Opt for a lossless format such as .flac over mp3 or YouTube rips for superior audio quality — though other formats may suffice, it's advisable to avoid them if possible (please i beg you). I won't delve into other specific methods here, for obvious reasons.
Get .FLAC using Cobalt or free-mp3-download
If you're feeling lazy (though I'd advise against it), try this
But, personally, I would recommend obtaining the best quality possible in a lossless format from the source itself
If your audio is in a format other than .wav or .flac, I suggest converting it to .wav (especially if it's in .mp3).
▶️ YouTube ▶️
I would strongly advise against ripping audio from YouTube, as youtube rips are getting compressed and therefore lose quality in the process. However, if you find it necessary, consider using yt-dlp. If that is too confusing for you, you can also use Stacher
(a frontend GUI for yt-dlp, same thing).
Follow the straightforward installation guide on the install page. After installation, open the command prompt in the folder path where yt-dlp is located, and use this command to download in WAV format:

yt-dlp "LINK" -f ba --extract-audio --audio-format wav
If you prefer a simpler approach, you can use y2down, which typically saves files in 16Bit 48kHz WAV format.
Choose FLAC or WAV during export.
🎥 Audio from Movies/TV/Anime etc. 🎥
If you want to extract audio from movies, series, etc., I recommend using MKVToolnix.
Just select only the audio you want and save it as a .wav file.
🗣️ Making a model of yourself 🗣️
If you're creating a model of yourself, use a good microphone in a room with little echo and background noise. Record yourself singing, counting, reading, or whatever is relevant to cover a variety of syllables, pitches, and emotions. Cover both low and high pitches, and ensure inclusion of all vowels (a, e, i, o, u).
HOW TO PREPARE YOUR DATASET
📏 Length of the Dataset 📏
I would recommend a minimum of 15-25 minutes, but ideally, aim for around 30-45 minutes.
Having less than that will work too, it just won't handle a variety of ranges well as it wasn't trained for that. Avoid going overboard, as exceeding this duration won't necessarily yield better results and only wastes your time/energy.
Isolating Vocals
If you are lucky, you may have already sourced some studio stems with ideally no vocal processing applied, allowing you to skip this step and go straight to here. If not, keep reading.
Whatever you do though. YOU NEED TO REMOVE REVERB/ECHO/NOISE/HARMONIES for a good dataset, so make sure thats the case.
WARNING: Isolating reduced Quality, so avoid doing unnecessary Isolations.
But still, try to leave as little processing as possible, make it sound like a stock mic recording.
I will go over 3 methods:
- UVR (Ultimate Vocal Remover) - likely your best free option.
This guide will primarily center around UVR, as it's widely used and effective. Personally, I prefer using it sparingly, but it remains a solid choice that gets the job done. - MVSEP - a free alternative but can take a while to use (long queue). The free tier only outputs 320kbps .mp3 and there's a 10-minute limit, so I would advise against using it. MDX B (vocals) is a decent option, but since you're reading a guide on training locally, stick to a local option like UVR, you weirdo.
- iZotope RX 10 (paid) - Since this is a paid option, I won't delve into details here. For the sake of this guide, I'll focus on UVR, which suits most use cases. The primary process is essentially the same for all tools, so follow along.
If you have a Low-End-Rig or are interested in trying out MVSEP, follow this guide for more details!
Like I said, i will mainly cover UVR here, but the main process remains the same
Honorable mention: Adobe Audition
AVOID USING: vocalremover.org / x-minus.pro. While these may suffice for minor tasks, they sacrifice potential quality and are highly limited.
Whichever one you decide to use, the end goal is the same…
Using UVR
If there is any modes in the guide that are not in your list:
- Navigate to settings (the wrench icon to the left of "Start Processing")
- Access the download center to get whatever modes you need, keep reading for more info
Set the export format to .wav or .flac.
Choose based on space considerations or personal preference. .wav is recommended for optimal quality, but I use .flac.
Then select the audio you want to use as input and find a folder you want to put the output audio in and select that as the output (duh).
BEST SETTINGS FOR ISOLATION
Removing Vocals/Instrumental:
- use Kim Vocal 1 or 2 (MDX-Net) -> Kim Vocal 2 is harsher and might add noise but it's more precise lately so use whatever works for you
- MDX23C-InstVoc HQ is probably the best one (some might say its better than Kim), but Kim Vocal is enough most of the time. It also requires a more powerful GPU
- Voc FT (MDX-Net) is good too, try out what works (good for Low-End Rigs)
- Honerable mention: Hq3-Inst
AFTER SEPERATING ALWAYS USE THE GENERATED OUTPUT AS THE NEW INPUT IF YOUR TRYING TO DO MORE THAN SEPARATING (duh)
Removing Reverb / Echo:
- use De-echo (VR Architecture) or Reverb HQ (MDX-Net)
- Deecho is very aggressive but i like using it so try that. Avoid using Reverb HQ if possible
Removing Noise:
- UVR-DeNoise (VR)
Removing Harmonies:
- 5-HP Karaoke (VR) -> 6 is more “aggressive”, but can be used to separate singers
- (if you are interested in that, consider checking out this)
- UVR-MDX-NET Karaoke 2 (MDX-NET) -> 1 is not worth it most of the time
IF ANYTHING SOUNDS BAD AFTER ISOLATING AND TRYING EVERYTHING
(e.g., inability to remove harmonies or poor quality), exclude it from the dataset and delete it.
Should be self-explanatory, but don't use a lot just for the sake of it.
Quality > Quantity.
Don't try to meet my recommended dataset length no matter what, but be aware that less than that will result in a less polished model that will struggle in certain moments since it doesn't have lots of training data.
THE BEST WAY TO PREPARE YOUR VOCALS:
- separate vocals and instrumental
- denoise the result
- de-reverb the result
- if you need to separate vocals or remove harmonies, use karaoke 2
- if you didn't denoise before, do it now
I would highly recommend to use UVR Denoise or a similar tool in whatever you use (rx, audition, etc). it helps more than it hurts.
- you should also Noise Gate, Truncate and Normalize (in that order)
Noise Gating
This will remove the sounds below the threshold we set (like a compressor basically)
You can Noise Gate in Audacity directly, but I recommend a different approach.
If you do use Audacity, use these Settings:

But I would use Renegate (free). Just install as usual, and you can access it under the Effects tab. Then apply my preset! (adjust accordingly)
Truncating (automatically removing silence
Having silence in your dataset is a no-no. You can manually cut out silence or simply avoid including it when selecting your samples. However, I strongly recommend truncating under all circumstances; it's essential. Noise Gating is unneccessary imo, but this is a must.
To do this, select everything (ctrl+a), go to Effect > Special > Truncate Silence.
There are two controls that determine which audio will be treated as "silence":
- Threshold (dB): For audio to be treated as silence, it must be below this threshold level. If insufficient silences are being reduced, increase the threshold to a higher (less negative) number. Choose a value between -48 and -54.5dB for our purposes. Opt for a higher value if you want to add more "breathing" to your vocals.
- Duration: The minimum duration for audio to be treated as silence. The audio must stay below the entered "Threshold" for at least this duration to be considered silence. If too few silences are being reduced, decrease this "Duration."
Set “Duration” and “Truncate to” to 0.0004 seconds.
And if not selected already, go for the Action “Truncate Detected Silence” and select “Truncate tracks independently”
ALWAYS PREVIEW AND LISTEN TO THE OUTPUT BEFORE USING TRUNCATE
Normalising
When you are done, normalize your audio (Effect > Volume and Compression) with -4dB. Export only the mono channels to save space (consider trying -2dB, but -4dB is recommended). This step also eliminates segments of the audio that aren't suitable for the dataset. Ensure all three boxes are ticked.
Export in either .wav or .flac with 24-bit/level 8.
If you choose the right settings, you can always resume training from where you left off, so don't worry.
ANYTHING NOT MENTIONED IN HERE IS NOT IMPORTANT SO JUST LEAVE IT AT DEFAULT!
De-esser
Now that you have normalized, truncated and run your noise suppression again (if theres bg noise still). You can now run your de-esser if you think the s sounds are too harsh.
Training
once your dataset is prepared, we can finally start training 🎉 But…there are still some things you should know about.
USING TENSORBOARD
Before doing anything, confirm that you are on the 'scalars' tab, have 'ignore outliers in chart scaling' enabled, and set the smoothing to 0.987 or the maximum level. If you haven't trained for long, consider turning off smoothing initially and enabling it later (or keep it on consistently). Any value between 0.7 and above is effective; adjust as needed. Additionally, set “Horizontal Axis” to “Step.”
QOL-Tip: Enable “Reload Data” in settings so you don't have to refresh each

time using
Press the 3rd Option to fit the domain to data/fit data to the graph, whenever you are training. 
HOW TO READ THE GRAPHS
You can pretty much ignore all the graphs, except for some in the “loss” tab.
I won't bore you with the details, but pay close attention to:
- loss/g/total is the one you should look at the most imo, loss/d/total too.
- loss/g/mel is crucial as well. If the graph is close to 0, it indicates very accurate pitch
Basically, the graphs should consistently trend downward, including 'loss/g/mel,' to ensure a good model. Occasionally checking 'loss/g/kl' may be beneficial, but tbh i never do that.
Now to the important part…
HOW TO SPOT OVERTRAINING

If the graph, at its lowest point, suddenly starts rising significantly, it indicates overtraining. Examine the graph for the minimum value to pinpoint when overtraining begins.
Usually that's enough to tell, but...
Another method is to observe the behavior of d/loss and g/loss: if one is increasing while the other is decreasing, it's also a sign of overtraining. Both graphs should mirror each other consistently, always.
! ALWAYS VERIFY IF KL AND MEL ARE DECREASING TOO (indicating it's not overtraining).
Whenever you see overtraining happen, promptly click "stop training" in the GUI and test the current epoch state. Verify the epoch state in Tensorboard by hovering over the graph or check the train.log file in the right /logs folder for the corresponding timestamp. If you realize overtraining after completing training, you can revert to a previous epoch state if you followed my recommended settings (i.e., save small .pth file is enabled).
HOW TO TRAIN PROPERLY
Sample-Rate
Basically, a higher sample rate means a broader range of frequencies get captured.
You typically can't go wrong with 32k, most models tend to end up as 32k. 40k and 48k are quite rare, but you need to verify in a spectrogram anyways, so do that.
<18k = 32k model, <20k = 40k model, <24k = 48k model
Doesnt 48k handle higher frequencies better?
While that might be true, it occasionally introduces more noise in the output. And you rarely reach it anyways, so just stick with the safe and reliable choice: 32k. But please, always check a spectrogram before doing anything.

If you are not sure what to use: check where your datasets frequency ends with this or just drag your Audio into "Ilaria Audio Analyzer" if you dont want to download Spek.

Double the value where the graphs end. In this example, the audio would be 44kHz.
for more info or as a cheat sheet, check a frequency response table or read more about it here, you nerd
Now that that's all done, you can start training
- Navigate to your RVC folder and open the GUI using go-web.bat.
- Go to the Train tab at the top:

- Choose a model name under “Experiment name” (first option) - avoid spaces in the name.

- Choose the target sample rate accordingly (usually 40k or 48k).
- Enable pitch guidance.
- RMVPE (gpu or + variant) is just the best option to use
- Select v2 for the model architecture version.
- Set CPU processes to 2. You can increase this for faster processing but may face crashes (BSOD).
- Copy the path to your dataset folder and paste it into the “Path to training folder” box:

- Once you've copied your dataset path, click “Process Data'' and wait until it's done (the console should say something like “end preprocess”).
- Select your pitch extraction method (mangio-crepe or rmvpe), and set your hop length to 128 or 64 if asked (just do 128)
- Press “feature extraction” and wait for “all-feature-done” to appear.
- Set your saving frequency to 10 (it will save a model every 10 epoch states).
- Set your batch size depending on your GPU VRAM size (google it). If it's 6GB VRAM, for example, select 6.
- Enable “save only the latest .ckpt” if not enabled already.
- Disable, if not disabled already:

- Enable “Save a small final model to the 'weights' folder at each save point.” Each save can be used as a model on its own, useful in case of overtraining. Epoch states will be saved under the “weights” folder.
- Set your epoch amount. I always go for 300. You can always adjust the epoch state later once it finished training to 300. I do this to prevent unnecessary training.
- Aim for a slightly undertrained model; don't train too little. Always go for the epoch state just before overtraining begins (check with Tensorboard).
- Once everything is set up (DON'T MESS WITH THE OPTIONS NOT MENTIONED HERE) press “train feature index”. You can also do this at the end, in case you forget this step, but just do it now 🗿
- Once it's added, press “train model” to start training. The command box will notify you when it's done.
CONTINUE TRAINING A MODEL
IMPORTANT:
USE THE EXACT SAME NAME AND ALL THE SAME SETTINGS
DO NOT REPROCESS OR REDO FEATURE EXTRACT!!!!
DON'T DO ANYTHING BESIDES ENTERING THE SAME VALUES.
THE ONLY THING YOU PRESS IS "TRAIN MODEL" AT THE END WITH YOUR EPOCH COUNT SET!!!!
ALSO MAKE SURE YOU ONLY KEEP THE TWO LATEST .PTH FILES in the /logs folder of that model (G_69420 and D_69420 for example). . This doesn't mean deleting everything in the folder lol; just ensure there are only 2 .pth files.
Once you are done training and have checked with Tensorboard for any signs of overtraining, go to your weights folder to view the model.
Congrats!
If you ever plan on uploading your models to AI Hub or share them in any way, see this section of the guide for more information
Interference - How to make ai cover

You're now ready to make a cover, aka an inference!
All the usual steps [here] still apply. If you're downloading a song directly, make sure you isolate the vocals properly first and then edit them to have no processing for the best results—no reverb, no nothing.
Options:

The lower you set this, the more it will capture the original volume range of the song. A value of 1 will be equally loud throughout the whole conversion, while 0 will mimic the volume range of the original as closely as possible. Use 0.25 or 0.2.
WHAT PITCH EXTRACTION METHOD TO USE
use rmvpe or crepe/mangio-crepe if you have it, ignore the rest. not good.
RMVPE is the best option overall (generally). If you see Mangio-crepe for example, that might be more “smoother” (covers singing well) and rmvpe is “clearer.” So use whatever you want; but I'll stick to rmvpe.
Transpose / Pitch setting: (-12 / 12)
If a male singer is supposed to sing a song sung by a female, use higher transpose settings and listen to the results (adjust accordingly).
Search feature ratio:
Controls the value of the .index file (basically adjusts the “accent”). Leave it at the default (0.75), but feel free to experiment by going higher or lower.
🤔 Where to get models? 🤔
If you want dont want to train a model yourself and just want to find existing ones, the best way to find them is either on the AI Hub Server in the 🎧┋voice-models channel or on weights.gg.
If the model doesnt exist yet, you can request one in the 🎫┋request-model channel, either free or paid (shameless plug)
Once you have found a model, download it and extract the zip. Put the files in their respective folders if you are using it locally:
- .pth: (RVC Root)/weights
- .index: (RVC Root)/logs
If you are using a Cloud-based solution, just take the link itself.
📤 PUBLISHING A MODEL 📤
If you want to share a model, grab your model with the best epoch state, and also its ADDED .index file (not trained index) under the corresponding folder in logs.
Zip these two together to share them with whoever you want.
If you want to post models on AI Hub (or learn how to generally share a model in more detail), follow one of these guides:
- English Guide - FDG
- German Guide - diablofx
- Spanish Guide - Julia (ailen2091)
- Hindi Guide - Enes
- Turkish Guide - Enes
🎤 USING A REAL-TIME VOICE CHANGER 🎤
This guide by Raven has everything you need. No notes.
There is also this guide by Antasma, for using it on Colab, but just follow the first guide.
🛠️ TTS/Text-to-Speech 🛠️
Honestly, the best way to do that is to just use a Cloud solution, like Ilaria TTS
Bonus points for working on mobile too!
Just make sure you have a HuggingFace account if you havent already, and duplicate the space before using any HuggingFace Space.
☁️ Cloud-Alternatives ☁️
Use these if you dont have the capacities to work locally.
🗣️ Infering 🗣️
Colab
AICoverGen NO UI
HuggingFace
AICoverGen
Ilaria RVC
Guide: https://rentry.org/ilarvc_inf_guide
You can go also go to the AI Hub Server and use the wordband bot or AstraLabs if you want.
Wordband guide: https://rentry.co/WordBand
💪 Training 💪
Colab
RVC Disconnected
Guide: https://docs.google.com/document/u/0/d/1XuxQYiqEhYrdYeCZRRLrmV_ciMKo0bV-jTCGHu_-5Cc/
🎤 Voice-Changer 🎤
Refer to this section of the guide.
🔉 Creating a Dataset 🔉
Use either MDX-NEX or MVSEP, as mentioned in the non-cloud section of this guide. Useful for Low-End-Rigs:
Follow this guide for more info, but the main processes stay the same throughout.