Creating Datasets for RVC using iZotope RX

In this guide i will be explaining how to use a Paid Software to clean audio for training models.
iZotope RX is known to be the software for denoising audio and the one used by "every" good model maker.
In this guide i'll be recommending mostly free VSTs/Plugins and some paid plugins which aren't needed but can make the process easier
This guide also has a step which uses Audacity a free software which you also need for audio labeling, better than truncate silence
from what i know this has only recently gotten support in RVC so make sure you aren't on an old version
If you want to skip to a section:

I'm using Mainline RVC (at the time of writing ver.1006)
RVC already provides built versions, so go to the releases tab and download one of these two
something like this for example:

For Nvidia GPU users:
https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/RVC1006Nvidia.7z

For AMD/Intel GPU users:
https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/RVC1006AMD_Intel.7z

Getting a dataset

My preferred way of getting a dataset is using Cobalt
make sure to go to settings and change to Best and select audio

  • Cobalt (Website)
    me using Cobalt is mainly due to the issue of YouTube being banned in my country

but if you can do it locally use these:

-YT-DLP command line use (no GUI)
-Stacher (GUI)
both work fine but if you dont want the hassel of learning commands, Stacher would be a good choice

my preferred file format is WAV 24 and 32 bit but it has large file sizes, so if you want smaller files, FLAC level 8 compression would be your second choice. other than that the other options aren't good.
avoid using MP3 for your datasets.
If you end up using WAV make sure to do either 32 bit Float or 24 bit.
but in the end 32 bit float and 24 bit are practically the same and doesnt matter much.

Loading the audio and changing some settings

Open the WAV or FLAC file also FLAC takes time to decode, so if you can use in WAV so that you dont have to wait for 1 minute for your dataset to get decoded.

now that we have the file in RX
make sure to turn this to only show the Spectrogram since you dont need waveform for now

Opacity

now after that it should look like this
Spectrogrma
this is using Mel scaling, if you right click on the numbers list(20k and such). you can change scaling.
Mel is the best scaling in our case since it shows vocals better than Linear scaling would.

the brighter a color on the spectrogram, the louder it is

the spectrogram shows a point on 20k that point would be 40kHz, basically the spectrogramx2 would be the actual sample rate

trying to explain a spectrogram

in my opinion the spectrogram cant be taught as well as long as you dont mess around for yourself and basically you'll learn for yourself
but this spectrum as example, which if you just rotate on its side would be the spectrogram
Spectrum
the top of the Spectrogram would be the right-most of the spectrum.

examples

for example, this is noise:
Noise
and this is breathing:
Breathing
and this is speech:
Speech
as for the rest, i'd say im too lazy to try and somehow explain all

Modules / VSTs used and their settings

i use a couple of the tools in
first i use these

  1. Adaptive Phase Rotation
    Phase

    1. Normalize to -3 dB
      Normalize
    2. De-Ess
      De-Ess
    3. De-Crackle
      De-Crackle
    4. De-Click
      De-Click
    5. Spectral Denoise
      refer to the section on denoising the dataset if you dont know how to use it
      Spectral-Denoise
    6. Deconstruct
      Deconstruct
    7. Plugins
      1. Auburn Renegate
        https://www.auburnsounds.com/products/Renegate.html
        (basically noisegate but better) the free version will be more than enough
        Renegate
      2. KiloHearts Dynamic
        https://kilohearts.com/products/dynamics
        Downward Expansion
        kHs-Dynamics
        3.Bertom Denoiser Pro
        https://bertomaudio.com/
        for some reason doesnt work on RX for me, so i use it on Audacity instead
        Bertom

    Denoising the dataset

so this audio is at 48kHz even tho there is about 24kHz of actual data, most likely since this was recorded on a phone. but its the same process either way just that it will be more noisy even with 32k pretrain since RVC has to guess the rest of the frequencies.
but for now i'll just denoise to 32kHz aka 16k

this is before me touching the audio file, only after resampling it to 32kHz
Original
now we select the noise (shift + click after selecting an area), then press learn, unselect the audio and click render
Noise-Selected
now in this case one pass of Spectral Denoise wasnt enough
one-pass
the noise profile for the first pass being
Noise-Profile

second time running Spectral De-noise before adding Renegate noisegate and Kiloheart dynamics
Second-Pass
and as for second pass noise profile
second-pass-profile

for thud sounds or small noises that RX10, Bertom, or UVR denoise cant remove, use the lasso selection tool
we can carve out the small bright areas and press delete to silence it
silence

from here on the rest is manual denoising which i cant really teach
so just use RX long enough to learn what noise looks like, and then manually clean those.

now that we've also ran our plugins like Auburn Renegate and kHs Dynamics with Bertom Denoiser Pro
we save our file as 32 Bit float
24 bit will also be as good, but i like overkill uwu
now we open Audacity
first convert your dataset to mono since RVC works on mono and not stereo
Truncate Silence is the old technique, we got a better way and its called audio labeling
follow these steps
open the menu for labeling
label
Labels
you will now have it like this
this
now we go to export our audio
Export
Export-Settings
the output will be like this
output.
now go in the RVC folder and place all these files in datasets folder

Edit

Pub: 06 Mar 2024 07:52 UTC

Edit: 15 Mar 2024 15:22 UTC

Views: 2096