How To RVC

[ Guide by @diablofx, other guides linked with credit ] [ contributors all welcome! ]

I will primarily be focusing on creating a dataset and local training in this guide, given the abundance of tutorials on other subjects.
I'll specifically focus on Windows, (training on Mac isnt it). While aiming for conciseness, this guide is not for absolute beginners. Limited coverage of paid tools is intentional, as most readers probably lack access. If you do have access, you probably wouldn't be reading this guide.
This guide will still work on all versions of RVC

I will still mention everything else, so feel free to still follow the guide

~ Table of Content ~

PREPARATIONS 📜
- Installing Mainline-RVC locally 🖥️
- Installing Tensorboard 📈
CREATING A DATASET 🔉
- Where to get your audio? 🤔
- How to prepare your dataset
  - Length of the Dataset 📏
  - Isolating Vocals
- Using UVR
  - Best Settings for Isolation
  - Noise Gating
  - Normalising
  - Truncating
  - De-Esser
TRAINING 💪
- Using Tensorboard
  - How to read the graphs
- How to spot overtraining
- How to train properly
  - Sample-Rate & other settings
  - Continue Training a model
INFERENCE - How to make AI Cover 🗣️
- Where to get models? 🤔
PUBLISHING A MODEL 📤
REAL-TIME VOICE CHANGER 🎤
TTS/Text-to-Speech 🛠️
Cloud-Alternatives ☁️

📜 PREPARATIONS 📜

Before doing ANYTHING, you need to get an RVC installation.

This Guide covers the process when training locally first, go here if you are not planning to train locally.
(I will link each equivalent guide in each section)

🖥️ Installing Mainline-RVC locally 🖥️

(Update: Removed the mention of the Mangio-RVC Fork)

Before proceeding, ensure that you have Python 3.10 installed, and be sure to select "Add Python to Environment Variables (PATH)" during the installation to avoid issues!

(and make sure your GPU is good enough, 2nd gen RTX or above, otherwise don't bother with using local and use Colab or Ilaria RVC instead.

I would personally use Mainline-RVC and i would recommend you to install that too. Just go to the releases page I linked and download the package corresponding to your GPU and extract it.

Download and extract the zip file to your preferred location.
Create a new folder (recommended), ensuring no spaces in folder names to prevent potential issues later.
Launch the official Web UI Interface by running the go-web.bat file (or the corresponding custom one if you have one installed)

test

OPTIONAL: Easy-GUI & RavenUI

You can also install the Easy-GUI for a simplified custom interface, although I don't recommend it due to reduced features. If you decide to install it, extract it into your newly created folder and run the batch file. After installation, launch the 'run_easiergui.bat' file as you would for the regular interface.
Same with RavenUI (I do recommend this one).

If you ever want to make your own custom UI, this RENTRY might help you in knowing what lines to change

📈 Installing Tensorboard 📈

Install Tensorboard, a crucial but often overlooked tool for maximizing a model's potential and avoiding unintended issues, or even completely ruining a model. Drop the linked file into your RVC folder and install it.

Use that same .cmd whenever you wish to access Tensorboard. I explain how to use it properly here.

Now that you have RVC installed, let's talk about how to make a good dataset that you can train.

🔉 CREATING A DATASET 🔉

Datasets are just Audio Files that have been processed to remove things that you do not want in it. But first, you need to download some of the audio samples of the character/sound you want...

🤔Where to get your audio? 🤔

Always prioritize high-quality audio rips in stereo format with a higher sample rate, preferably sourced from official platforms like Spotify, Qobuz, etc.

Stereo = Audio with 2 channels (left and right ear)

Opt for a lossless format such as .flac over mp3 or YouTube rips for superior audio quality — though other formats may suffice, it's advisable to avoid them if possible (please i beg you). I won't delve into other specific methods here, for obvious reasons.

Get .FLAC using Cobalt or free-mp3-download
If you're feeling lazy (though I'd advise against it), try this
But, personally, I would recommend obtaining the best quality possible in a lossless format from the source itself

If your audio is in a format other than .wav or .flac, I suggest converting it to .wav (especially if it's in .mp3).

▶️ YouTube ▶️

I would strongly advise against ripping audio from YouTube, as youtube rips are getting compressed and therefore lose quality in the process. However, if you find it necessary, consider using yt-dlp. If that is too confusing for you, you can also use Stacher
(a frontend GUI for yt-dlp, same thing).

Follow the straightforward installation guide on the install page. After installation, open the command prompt in the folder path where yt-dlp is located, and use this command to download in WAV format:

TEST

yt-dlp "LINK" -f ba --extract-audio --audio-format wav

If you prefer a simpler approach, you can use y2down, which typically saves files in 16Bit 48kHz WAV format.
Choose FLAC or WAV during export.

🎥 Audio from Movies/TV/Anime etc. 🎥

If you want to extract audio from movies, series, etc., I recommend using MKVToolnix.
Just select only the audio you want and save it as a .wav file.

🗣️ Making a model of yourself 🗣️

If you're creating a model of yourself, use a good microphone in a room with little echo and background noise. Record yourself singing, counting, reading, or whatever is relevant to cover a variety of syllables, pitches, and emotions. Cover both low and high pitches, and ensure inclusion of all vowels (a, e, i, o, u).

HOW TO PREPARE YOUR DATASET

📏 Length of the Dataset 📏

I would recommend a minimum of 15-25 minutes, but ideally, aim for around 30-45 minutes.
Having less than that will work too, it just won't handle a variety of ranges well as it wasn't trained for that. Avoid going overboard, as exceeding this duration won't necessarily yield better results and only wastes your time/energy.

Isolating Vocals

If you are lucky, you may have already sourced some studio stems with ideally no vocal processing applied, allowing you to skip this step and go straight to here. If not, keep reading.

Whatever you do though. YOU NEED TO REMOVE REVERB/ECHO/NOISE/HARMONIES for a good dataset, so make sure thats the case.

WARNING: Isolating reduced Quality, so avoid doing unnecessary Isolations.

But still, try to leave as little processing as possible, make it sound like a stock mic recording.

I will go over 3 methods:

UVR (Ultimate Vocal Remover) - likely your best free option.
This guide will primarily center around UVR, as it's widely used and effective. Personally, I prefer using it sparingly, but it remains a solid choice that gets the job done.
MVSEP - a free alternative but can take a while to use (long queue). The free tier only outputs 320kbps .mp3 and there's a 10-minute limit, so I would advise against using it. MDX B (vocals) is a decent option, but since you're reading a guide on training locally, stick to a local option like UVR, you weirdo.
iZotope RX 10 (paid) - Since this is a paid option, I won't delve into details here. For the sake of this guide, I'll focus on UVR, which suits most use cases. The primary process is essentially the same for all tools, so follow along.

If you have a Low-End-Rig or are interested in trying out MVSEP, follow this guide for more details!

Like I said, i will mainly cover UVR here, but the main process remains the same

Honorable mention: Adobe Audition

AVOID USING: vocalremover.org / x-minus.pro. While these may suffice for minor tasks, they sacrifice potential quality and are highly limited.

Whichever one you decide to use, the end goal is the same…

Using UVR

If there is any modes in the guide that are not in your list:

Navigate to settings (the wrench icon to the left of "Start Processing")
Access the download center to get whatever modes you need, keep reading for more info

Set the export format to .wav or .flac.

Choose based on space considerations or personal preference. .wav is recommended for optimal quality, but I use .flac.

Then select the audio you want to use as input and find a folder you want to put the output audio in and select that as the output (duh).

BEST SETTINGS FOR ISOLATION

Removing Vocals/Instrumental:

use Kim Vocal 1 or 2 (MDX-Net) -> Kim Vocal 2 is harsher and might add noise but it's more precise lately so use whatever works for you
MDX23C-InstVoc HQ is probably the best one (some might say its better than Kim), but Kim Vocal is enough most of the time. It also requires a more powerful GPU
Voc FT (MDX-Net) is good too, try out what works (good for Low-End Rigs)
Honerable mention: Hq3-Inst

AFTER SEPERATING ALWAYS USE THE GENERATED OUTPUT AS THE NEW INPUT IF YOUR TRYING TO DO MORE THAN SEPARATING (duh)

Removing Reverb / Echo:

use De-echo (VR Architecture) or Reverb HQ (MDX-Net)
Deecho is very aggressive but i like using it so try that. Avoid using Reverb HQ if possible

Removing Noise:

UVR-DeNoise (VR)

Removing Harmonies:

5-HP Karaoke (VR) -> 6 is more “aggressive”, but can be used to separate singers
(if you are interested in that, consider checking out this)
UVR-MDX-NET Karaoke 2 (MDX-NET) -> 1 is not worth it most of the time

IF ANYTHING SOUNDS BAD AFTER ISOLATING AND TRYING EVERYTHING
(e.g., inability to remove harmonies or poor quality), exclude it from the dataset and delete it.

Should be self-explanatory, but don't use a lot just for the sake of it.
Quality > Quantity.
Don't try to meet my recommended dataset length no matter what, but be aware that less than that will result in a less polished model that will struggle in certain moments since it doesn't have lots of training data.

THE BEST WAY TO PREPARE YOUR VOCALS:

separate vocals and instrumental
denoise the result
de-reverb the result
if you need to separate vocals or remove harmonies, use karaoke 2
if you didn't denoise before, do it now

I would highly recommend to use UVR Denoise or a similar tool in whatever you use (rx, audition, etc). it helps more than it hurts.

you should also Noise Gate, Truncate and Normalize (in that order)

Noise Gating

This will remove the sounds below the threshold we set (like a compressor basically)
You can Noise Gate in Audacity directly, but I recommend a different approach.
If you do use Audacity, use these Settings:
test

But I would use Renegate (free). Just install as usual, and you can access it under the Effects tab. Then apply my preset! (adjust accordingly)

Truncating (automatically removing silence

Having silence in your dataset is a no-no. You can manually cut out silence or simply avoid including it when selecting your samples. However, I strongly recommend truncating under all circumstances; it's essential. Noise Gating is unneccessary imo, but this is a must.

To do this, select everything (ctrl+a), go to Effect > Special > Truncate Silence.

There are two controls that determine which audio will be treated as "silence":

Threshold (dB): For audio to be treated as silence, it must be below this threshold level. If insufficient silences are being reduced, increase the threshold to a higher (less negative) number. Choose a value between -48 and -54.5dB for our purposes. Opt for a higher value if you want to add more "breathing" to your vocals.
Duration: The minimum duration for audio to be treated as silence. The audio must stay below the entered "Threshold" for at least this duration to be considered silence. If too few silences are being reduced, decrease this "Duration."
Set “Duration” and “Truncate to” to 0.0004 seconds.

And if not selected already, go for the Action “Truncate Detected Silence” and select “Truncate tracks independently”

ALWAYS PREVIEW AND LISTEN TO THE OUTPUT BEFORE USING TRUNCATE

Normalising

When you are done, normalize your audio (Effect > Volume and Compression) with -4dB. Export only the mono channels to save space (consider trying -2dB, but -4dB is recommended). This step also eliminates segments of the audio that aren't suitable for the dataset. Ensure all three boxes are ticked.

Export in either .wav or .flac with 24-bit/level 8.

If you choose the right settings, you can always resume training from where you left off, so don't worry.

ANYTHING NOT MENTIONED IN HERE IS NOT IMPORTANT SO JUST LEAVE IT AT DEFAULT!

De-esser

Now that you have normalized, truncated and run your noise suppression again (if theres bg noise still). You can now run your de-esser if you think the s sounds are too harsh.

Training

once your dataset is prepared, we can finally start training 🎉 But…there are still some things you should know about.

USING TENSORBOARD

Before doing anything, confirm that you are on the 'scalars' tab, have 'ignore outliers in chart scaling' enabled, and set the smoothing to 0.987 or the maximum level. If you haven't trained for long, consider turning off smoothing initially and enabling it later (or keep it on consistently). Any value between 0.7 and above is effective; adjust as needed. Additionally, set “Horizontal Axis” to “Step.”
test

QOL-Tip: Enable “Reload Data” in settings so you don't have to refresh each

test
time using test

Press the 3rd Option to fit the domain to data/fit data to the graph, whenever you are training. test

HOW TO READ THE GRAPHS

You can pretty much ignore all the graphs, except for some in the “loss” tab.
I won't bore you with the details, but pay close attention to:

loss/g/total is the one you should look at the most imo, loss/d/total too.
loss/g/mel is crucial as well. If the graph is close to 0, it indicates very accurate pitch

Basically, the graphs should consistently trend downward, including 'loss/g/mel,' to ensure a good model. Occasionally checking 'loss/g/kl' may be beneficial, but tbh i never do that.

Now to the important part…

HOW TO SPOT OVERTRAINING

link

If the graph, at its lowest point, suddenly starts rising significantly, it indicates overtraining. Examine the graph for the minimum value to pinpoint when overtraining begins.

Usually that's enough to tell, but...

Another method is to observe the behavior of d/loss and g/loss: if one is increasing while the other is decreasing, it's also a sign of overtraining. Both graphs should mirror each other consistently, always.

! ALWAYS VERIFY IF KL AND MEL ARE DECREASING TOO (indicating it's not overtraining).

Whenever you see overtraining happen, promptly click "stop training" in the GUI and test the current epoch state. Verify the epoch state in Tensorboard by hovering over the graph or check the train.log file in the right /logs folder for the corresponding timestamp. If you realize overtraining after completing training, you can revert to a previous epoch state if you followed my recommended settings (i.e., save small .pth file is enabled).

HOW TO TRAIN PROPERLY

Sample-Rate

Basically, a higher sample rate means a broader range of frequencies get captured.

You typically can't go wrong with 32k, most models tend to end up as 32k. 40k and 48k are quite rare, but you need to verify in a spectrogram anyways, so do that.

<18k = 32k model, <20k = 40k model, <24k = 48k model

Doesnt 48k handle higher frequencies better?

While that might be true, it occasionally introduces more noise in the output. And you rarely reach it anyways, so just stick with the safe and reliable choice: 32k. But please, always check a spectrogram before doing anything.

link

If you are not sure what to use: check where your datasets frequency ends with this or just drag your Audio into "Ilaria Audio Analyzer" if you dont want to download Spek.
link
Double the value where the graphs end. In this example, the audio would be 44kHz.

for more info or as a cheat sheet, check a frequency response table or read more about it here, you nerd

Now that that's all done, you can start training

Navigate to your RVC folder and open the GUI using go-web.bat.
Go to the Train tab at the top:

link

Choose a model name under “Experiment name” (first option) - avoid spaces in the name.
Choose the target sample rate accordingly (usually 40k or 48k).
Enable pitch guidance.
RMVPE (gpu or + variant) is just the best option to use
Select v2 for the model architecture version.
Set CPU processes to 2. You can increase this for faster processing but may face crashes (BSOD).
Copy the path to your dataset folder and paste it into the “Path to training folder” box:

link

Once you've copied your dataset path, click “Process Data'' and wait until it's done (the console should say something like “end preprocess”).
Select your pitch extraction method (mangio-crepe or rmvpe), and set your hop length to 128 or 64 if asked (just do 128)
Press “feature extraction” and wait for “all-feature-done” to appear.
Set your saving frequency to 10 (it will save a model every 10 epoch states).
Set your batch size depending on your GPU VRAM size (google it). If it's 6GB VRAM, for example, select 6.
Enable “save only the latest .ckpt” if not enabled already.
Disable, if not disabled already:

link

Enable “Save a small final model to the 'weights' folder at each save point.” Each save can be used as a model on its own, useful in case of overtraining. Epoch states will be saved under the “weights” folder.
Set your epoch amount. I always go for 300. You can always adjust the epoch state later once it finished training to 300. I do this to prevent unnecessary training.
Aim for a slightly undertrained model; don't train too little. Always go for the epoch state just before overtraining begins (check with Tensorboard).
Once everything is set up (DON'T MESS WITH THE OPTIONS NOT MENTIONED HERE) press “train feature index”. You can also do this at the end, in case you forget this step, but just do it now 🗿
Once it's added, press “train model” to start training. The command box will notify you when it's done.

CONTINUE TRAINING A MODEL

IMPORTANT:

USE THE EXACT SAME NAME AND ALL THE SAME SETTINGS

DO NOT REPROCESS OR REDO FEATURE EXTRACT!!!!

DON'T DO ANYTHING BESIDES ENTERING THE SAME VALUES.

THE ONLY THING YOU PRESS IS "TRAIN MODEL" AT THE END WITH YOUR EPOCH COUNT SET!!!!

ALSO MAKE SURE YOU ONLY KEEP THE TWO LATEST .PTH FILES in the /logs folder of that model (G_69420 and D_69420 for example). . This doesn't mean deleting everything in the folder lol; just ensure there are only 2 .pth files.

Once you are done training and have checked with Tensorboard for any signs of overtraining, go to your weights folder to view the model.
Congrats!

If you ever plan on uploading your models to AI Hub or share them in any way, see this section of the guide for more information

Interference - How to make ai cover

link

You're now ready to make a cover, aka an inference!

All the usual steps [here] still apply. If you're downloading a song directly, make sure you isolate the vocals properly first and then edit them to have no processing for the best results—no reverb, no nothing.

Options:

link

The lower you set this, the more it will capture the original volume range of the song. A value of 1 will be equally loud throughout the whole conversion, while 0 will mimic the volume range of the original as closely as possible. Use 0.25 or 0.2.

WHAT PITCH EXTRACTION METHOD TO USE

use rmvpe or crepe/mangio-crepe if you have it, ignore the rest. not good.

RMVPE is the best option overall (generally). If you see Mangio-crepe for example, that might be more “smoother” (covers singing well) and rmvpe is “clearer.” So use whatever you want; but I'll stick to rmvpe.

Transpose / Pitch setting: (-12 / 12)

If a male singer is supposed to sing a song sung by a female, use higher transpose settings and listen to the results (adjust accordingly).

Search feature ratio:

Controls the value of the .index file (basically adjusts the “accent”). Leave it at the default (0.75), but feel free to experiment by going higher or lower.

🤔 Where to get models? 🤔

If you want dont want to train a model yourself and just want to find existing ones, the best way to find them is either on the AI Hub Server in the 🎧┋voice-models channel or on weights.gg.

If the model doesnt exist yet, you can request one in the 🎫┋request-model channel, either free or paid (shameless plug)

Once you have found a model, download it and extract the zip. Put the files in their respective folders if you are using it locally:

.pth: (RVC Root)/weights
.index: (RVC Root)/logs

If you are using a Cloud-based solution, just take the link itself.

📤 PUBLISHING A MODEL 📤

If you want to share a model, grab your model with the best epoch state, and also its ADDED .index file (not trained index) under the corresponding folder in logs.

Zip these two together to share them with whoever you want.
If you want to post models on AI Hub (or learn how to generally share a model in more detail), follow one of these guides:

English Guide - FDG
German Guide - diablofx
Spanish Guide - Julia (ailen2091)
Hindi Guide - Enes
Turkish Guide - Enes

🎤 USING A REAL-TIME VOICE CHANGER 🎤

This guide by Raven has everything you need. No notes.

There is also this guide by Antasma, for using it on Colab, but just follow the first guide.

🛠️ TTS/Text-to-Speech 🛠️

Honestly, the best way to do that is to just use a Cloud solution, like Ilaria TTS
Bonus points for working on mobile too!
Just make sure you have a HuggingFace account if you havent already, and duplicate the space before using any HuggingFace Space.

☁️ Cloud-Alternatives ☁️

Use these if you dont have the capacities to work locally.

🗣️ Infering 🗣️

Colab

AICoverGen NO UI

Guide: https://docs.google.com/document/d/e/2PACX-1vThk7Qo7yCWNVbxOmahl2R8_Jgi6TFuMBUIi-PWre_HIN0lFTq-dr37Rh5iJlGgYb_vFapXMHt2W8Kp/pub

HuggingFace

AICoverGen

Ilaria RVC

Guide: https://rentry.org/ilarvc_inf_guide

You can go also go to the AI Hub Server and use the wordband bot or AstraLabs if you want.
Wordband guide: https://rentry.co/WordBand

💪 Training 💪

Colab

RVC Disconnected

Guide: https://docs.google.com/document/u/0/d/1XuxQYiqEhYrdYeCZRRLrmV_ciMKo0bV-jTCGHu_-5Cc/

🎤 Voice-Changer 🎤

Refer to this section of the guide.

🔉 Creating a Dataset 🔉

Use either MDX-NEX or MVSEP, as mentioned in the non-cloud section of this guide. Useful for Low-End-Rigs:
Follow this guide for more info, but the main processes stay the same throughout.

This Guide should pretty much cover everything you need!

Thanks for reading, any questions?

How To RVC

~ Table of Content ~

📜 PREPARATIONS 📜

🖥️ Installing Mainline-RVC locally 🖥️

📈 Installing Tensorboard 📈

🔉 CREATING A DATASET 🔉

🤔Where to get your audio? 🤔

▶️ YouTube ▶️

🎥 Audio from Movies/TV/Anime etc. 🎥

🗣️ Making a model of yourself 🗣️

HOW TO PREPARE YOUR DATASET

📏 Length of the Dataset 📏

Isolating Vocals

I will go over 3 methods:

Using UVR

BEST SETTINGS FOR ISOLATION

Removing Vocals/Instrumental:

Removing Reverb / Echo:

Removing Noise:

Removing Harmonies:

THE BEST WAY TO PREPARE YOUR VOCALS:

Noise Gating

Truncating (automatically removing silence

Normalising

De-esser

Training

USING TENSORBOARD

HOW TO READ THE GRAPHS

HOW TO SPOT OVERTRAINING

HOW TO TRAIN PROPERLY

Sample-Rate

CONTINUE TRAINING A MODEL

IMPORTANT:

Interference - How to make ai cover

Options:

WHAT PITCH EXTRACTION METHOD TO USE

Transpose / Pitch setting: (-12 / 12)

Search feature ratio:

🤔 Where to get models? 🤔

📤 PUBLISHING A MODEL 📤

🎤 USING A REAL-TIME VOICE CHANGER 🎤

🛠️ TTS/Text-to-Speech 🛠️

☁️ Cloud-Alternatives ☁️

🗣️ Infering 🗣️

Colab

AICoverGen NO UI

HuggingFace

AICoverGen

Ilaria RVC

💪 Training 💪

Colab

RVC Disconnected

🎤 Voice-Changer 🎤

🔉 Creating a Dataset 🔉

This Guide should pretty much cover everything you need!

Thanks for reading, any questions?

Warning