Creating Datasets for RVC using iZotope RX
In this guide i will be explaining how to use a Paid Software to clean audio for training models.
iZotope RX is known to be the software for denoising audio and the one used by "every" good model maker.
In this guide i'll be recommending mostly free VSTs/Plugins and some paid plugins which aren't needed but can make the process easier
This guide also has a step which uses Audacity a free software which you also need for audio labeling, better than truncate silence
from what i know this has only recently gotten support in RVC so make sure you aren't on an old version
If you want to skip to a section:
also uhm sorry my dumbass used Google Drive for a hopefully "permanent" image hosting, but apparently for the images to load you HAVE to be logged into your google account. so yea uwu
also someone made a "copy", files are hosted on imgur this time but well you might get errors like
so ye here's the link RVC-dataset-RX-imgur
also imgur may be banned in your country (just like my case with it being banned in Iran)
but well also that Google Drive is a bitch to host so i have to do goofy Inspect Menu stuff to get the link to it which apparently it can also change after a time period, either that or just that its a me issue
but well as a last ditch effort if nothing works, just go to the google drive folder it self and check the files there GoogleDrive folder with rentry images
I'm using Mainline RVC(at the time of writing ver.1006)
RVC already provides built versions, so go to the releases tab and download one of these two
something like this for example:
For Nvidia GPU users:
https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/RVC1006Nvidia.7z
For AMD/Intel GPU users:
https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/RVC1006AMD_Intel.7z
Getting a dataset
My prefered way of getting a dataset is using Cobalt
make sure to go to settings and change to Best and select audio
- Cobalt (Website)
me using Cobalt is mainly due to the issue of YouTube being banned in my country
but if you can do it locally use these:
- YT-DLP command line use (no GUI)
- Stacher (GUI)
both work fine but if you dont want the hassel of learning commands, Stacher would be a good choice
my prefered file format is WAV 24 and 32 bit but it has large file sizes, so if you want smaller files, FLAC level 8 compression would be your second choice. other than that the other options aren't good.
avoid using MP3 for your datasets.
If you end up using WAV make sure to do either 32 bit Float or 24 bit.
but in the end 32 bit float and 24 bit are practically the same and doesnt matter much.
Loading the audio and changing some settings
Open the WAV or FLAC file also FLAC takes time to decode, so if you can use in WAV so that you dont have to wait for 1 minute for your dataset to get decoded.
now that we have the file in RX
make sure to turn this to only show the Spectrogram since you dont need waveform for now
now after that it should look like this
this is using Mel scaling, if you right click on the numbers list(20k and such). you can change scaling.
Mel is the best scaling in our case since it shows vocals better than Linear scaling would.
the brighter a color on the spectrogram, the louder it is
if the spectrogram shows a point on 20k that point would be 40kHz, basically the spectrogramx2 would be the actual sample rate
trying to explain a spectrogram
in my opinion the spectrogram cant be taught as well as long as you dont mess around for yourself and basically you'll learn for yourself
but this spectrum as example, which if you just rotate on its side would be the spectrogram
the top of the Spectrogram would be the right-most of the spectrum.
examples
for example, this is noise:
and this is breathing:
and this is speach:
as for the rest, i'd say im too lazy to try and somehow explain all
Modules / VSTs used and their settings
i use a couple of the tools in
first i use these
1. Adaptive Phase Rotation
- Normalize to -3 dB
- De-Ess
- De-Crackle
- De-Click
- Spectral Denoise
- Deconstruct
- Plugins
- Auburn Renegate
https://www.auburnsounds.com/products/Renegate.html
(basically noisegate but better) the free version will be more than enough
- KiloHearts Dynamic
https://kilohearts.com/products/dynamics
Downward Expansion
- Bertom Denoiser Pro
https://bertomaudio.com/
for some reason doesnt work on RX for me, so i use it on Audacity instead
- Auburn Renegate
Denoising the dataset
so this audio is at 48kHz even tho there is about 24kHz of actual data, most likely since this was recorded on a phone. but its the same process either way just that it will be more noisy even with 32k pretrain since RVC has to guess the rest of the frequencies.
but for now i'll just denoise to 32kHz aka 16k
this is before me touching the audio file, only after resampling it to 32kHz
now we select the noise
now in this case one pass of Spectral Denoise wasnt enough
the noise profile for the first pass being
second time running Spectral Denoise
and as for second pass noise profile
from here on the rest is manual denoising which i cant really teach
so just use RX long enough to learn what noise looks like, and then manually clean those
now that we've also ran our plguins like Auburn Renegate and kHs Dynamics with Bertom Denoiser Pro
we save our file as 32 Bit float
24 bit will also be as good, but i like overkill uwu
now we open Audacity
first covnert your dataset to mono since RVC works on mono and not stereo
Truncate Silence is the old technique, we got a better way and its called audio labeling
follow these steps
open the menu for labeling
you will now have it like this
now we go to export our audio
the output will be like this
now go in the RVC folder and place all these files in datasets folder