THIS WAS ALL TESTED ON THE Mangio-RVC-Fork WITH THE F-RVC-exp CONTINUATION FORK OF Mangio-RVC-Fork

(From what I know, each fork logs differently for TensorBoard) Things you may need for this guide: Audacity iZotope RX Audio Editor Adobe Audition Spek

TL;DR is at the bottom of the page

Starting with TensorBoard

Mangio logs a point on the graph every 1 epoch
TensorBoard For this showcase, the smoothness has been lowered to 0 to easily differentiate between each step, but the ideal smoothness should be along 0.7-0.95, I usually go for 0.987 since it’s a common ground for good smoothness

As for the recommended settings I recommend:
Settings Basically use Toggle Y-Axis Log Scale, and Fit Domain Data when you reload the data, with the Reload button being up on the right next to the Gear Icon, or when you want it to automatically reload the graph click the Gear Icon and choose the Reload Data option and put the Reload Time Period (seconds) and you can ignore the Pagination Limit.
SettingsGear You can check what the reload period should be, by looking at the Console
Console In this example we are doing an average of 1 epoch every 0.5 seconds, so we can choose the Reload Period of 30 seconds, but since TensorBoard doesnt allow under 30 second Reload Period, you have to use 30 seconds

loss/g/mel, loss/g/kl, loss/g/total, loss/d/total

These three are important for making a good model:
Mel and KL are basically the stats of how good a model is (You can ignore KL, it’s not very important)
Mel is basically how accurate the pitch is. The closer to 0, the more accurate it is.
But basically, as long as KL and Mel go down, it’s a good™ model.
KL stands for Kullback–Leibler divergence

G stands for Generative and D stands for Discriminative, in simple words:

  • Generative models can generate new stuff.
  • Discriminative models discriminate between different kinds of stuff.
    A generative model could generate new stuff like photos of animals that look like real animals, while a discriminative model could tell a dog from a cat.

When training begins, the generator generates obviously fake data, and the discriminator quickly learns to tell that it's fake:
Bad GAN As training progresses, the generator gets closer to generating output that can fool the discriminator:
Ok GAN The discriminator gets worse at telling the difference between real and fake. It starts to classify fake data as real, and its accuracy decreases.
Ok GAN For more detail go to: https://developers.google.com/machine-learning/gan

GAN

GAN Discriminator and Generative are basically fighting in a race
LossGraph1 LossGraph2 These are Illustrative Graphs and not actual graphs from having made a model
(Reasoning is because it would take a long time and be impractical to wait on chance for these graphs to be made) In the graphs above, both G and D are training. This means:
The Discriminator is getting better at knowing if something is fake or real and the Generator is also learning to make better output to fool the Discriminator.
If the d/loss is going down but the g/loss going up, the model is OVERTRAINING
same the other way, if G going down but D is going up it’s OVERTRAINING
OVERFITTING is the correct name for Overtraining
but basically, as long as G and D are mirroring each other, it’s good and also: when G and D stop going down, it doesn’t mean that it’s done training. you should check the KL and Mel as well. if those are going down as well, it's still not done training.
In other words: it gets only good at replicating its dataset, same pitch and other variations.
But for new UNSEEN data (which will be what is used for inferencing) it will do trash because the model is taught to replicate the dataset and not manipulate it for new scenarios.
Putting it simply: a normal model would have 60% for the original and 40% for manipulation. Or 50% for the original and 50% for manipulation.

(not actual numbers. These are purely as examples)

Here the graph goes a bit up at 4k-8k steps, but this is fine since after that it stabilizes again and starts learning, no matter what you should always use the lowest point of the graph, with exceptions being some situations like a Mode Collapse.

MODE COLLAPSE

Usually, you want your GAN to produce a wide variety of outputs. You want, for example, a different face for every random input to your face generator.
However, if a generator produces an especially plausible output, the generator may learn to produce only that output. In fact, the generator is always trying to find the one output that seems most plausible to the discriminator.
If the generator starts producing the same output (or a small set of outputs) over and over again, the discriminator's best strategy is to learn to always reject that output. But if the next generation of discriminator gets stuck in a local minimum and doesn't find the best strategy, then it's too easy for the next generator iteration to find the most plausible output for the current discriminator.
Each iteration of the generator over-optimizes for a particular discriminator, and the discriminator never manages to learn its way out of the trap. As a result, the generators rotate through a small set of output types. This form of GAN failure is called mode collapse.
Basically: after G collapses then goes back to being weak (hence the lower the graph the better the model) it goes back up to being weak. Yet D has been buffed so it’s too strong for the G, making that model a lost cause so you will have to use the model before a collapse, or train again with different Hyper Parameters (Epoch, Save Frequency, Batch Size…)
CollapseVector CollapseDrawn In this example, we have a Mode Collapse basically when the graph crashes down in a short amount of time, this is usually the worst thing to have to happen to a GAN Model, when this happens you either have to make a new model or use the model before the collapse happens.
Basically, where it goes abnormally down, by more than 80-90% just for one checkpoint (ckpt) and then gets back on track
For a more extensive explanation of Mode Collapse and comparisons go to: https://pub.towardsai.net/gan-mode-collapse-explanation-fa5f9124ee73

THINGS THAT CAN RUIN A MODEL

Model training for too long, wrong batch size, too little data.
You can address overfitting by having a simpler model or larger dataset, however, if the model is too simple it won't be able to fit the underlying function. This is called underfitting/undertraining.

SAMPLE RATE

To choose 32K, 40K or 48K On average 48K will have better higher frequencies handling but might introduce noise or ringing.
But anyways, the way you decide which pretrain to use:
you download Spek and then drop your dataset in, then see where the Frequency ends at.
For example for my dataset of a FLAC 44.1K:
Spek Now to confirm that its truly 44kHz, we can check. In this screenshot the audio frequency range ends at 22kHz, now double it and that’s your frequency
So this audio is 44kHz.
Or for a Audacity recording for YouTube:
SpekYT It ends at 20kHz, times 2 = 40kHz

BATCH SIZE

Batch Size is the number of examples that a machine learning model processes at once. Think of it like this:

imagine you’re trying to learn how to bake a cake.

You could try making one cake at a time, see how it turns out, and then adjust your recipe before trying again. This would be like having a Batch Size of one. Alternatively, you could try:

making several cakes at once, see how they all turn out, and then adjust your recipe before trying again.

This would be like having a larger Batch Size. The same concept applies to Machine Learning models: they can process one example at a time or several examples at once before updating their internal parameters.
dataset size (in samples amount) 100. Batch Size = 4 100 / 4 = 25 Batch Size is the number of samples that are processed at once during training. For example, if you have a dataset of 100 samples and a batch size of 4, then the dataset will be divided into 25 batches, each containing 4 samples. During training, the network's parameters are updated after processing each batch of samples. This means that the network will be updated 25 times during one epoch (one pass through the entire dataset).
Due to the nature of GAN each model is unique so some people make multiple models of the same dataset and pick the best one.
VRAM also limits Batch Size, but it’s not a factor.
The smaller the batch size, more randomness (aka, noise) the network has, so, in other words, more randomization + AI guessing.
Sometimes you tend to go for smaller when your dataset is limited (not enough characteristics, data, phonemes, etc.) and hope for AI to figure it out well. Sometimes you have to decrease it to improve the model in general as, model if having too many examples (samples to see) learns it all too fast, and too fast learning = memorizing (overfitting) so, in simple words, it has an easy time recreate them 1:1 (for example during inferences) than manipulating them correctly, so they're more fit for inference.
NOTE: There is no such a thing as perfect batch for every case. Each voice is unique, the dataset is unique and one might work well with 4, some might like 19.
In general, you'd want to use those that are a power of 2, so: 4, 8, 16, 32, 64 etc
but it is not a must. One might try 5, 9, 11 etc.
Generally, you'd want to start from 4 or 8 and expand the range both wise and see which one does the best on graphs.

UVR AND RECOMMENDED SETTINGS

Models to use:

Voc FT – MDX-Net
DeNoise – VR Architecture
Karaoke 2 – MDX-Net (if needed to separate Main and Background vocals)
DeReverb – MDX-Net//VR Architecture
INST_HQ-3 – MDX-Net (not needed)
Voc FT for the separating Instruments and Vocals (as of now Voc FT is both he most popular and best in terms of quality in terms of my testing)
Any of these models have the effect of a -80dB A Weighting noise at mostly 17kHz-18kHz but in general past 15kHz is all noise
Also with my testing’s, it seems to be so that the Instruments stems are more filled with noise at higher decibels, on the same frequency as before, but just louder
To remove this, use UVR DeNoise or iZotope RX which I don’t have much knowledge of how to use it for DeNoising
Noise NoiseSilent

The Pipe Line of UVR:

Voc FT Instrument and Vocals
DeNoise both
Some Recommend to use INST HQ 3 (I don’t, further on I will explain why)
De Reverb the DeNoised Vocals
DeNoise the De Reverbed Vocals
Karaoke 2 if needed to separate Main and Background Vocals
DeNoise if Karaoke 2 got used
But if you don’t feel like DeNoising every stem, just do it for the last stem. Which I don’t recommend. But its not much of a loss.
Some commend: Kim Vocal 2, but from my testing, it’s not much difference and just adds noise.
For more control of the stems, I recommend v4 HTDemucs-FT (FT meaning Fine Tuned), this can separate Drums, Bass and Other effects. v4 HTDemucs_6s being able to separate Vocals, Other, Bass, Drums, Guitar, Piano

If you can’t use any UVR on your own PC locally, either use a Google Collab UVR or MVSEP (but UVR is better)

From my testing of INST HQ 3, it just adds noise past 15kHz range, and it sounds the same.
INST HQ 3 Noise As for the file format in UVR in recommend FLAC or WAV. FLAC basically in the end is just a lower file size compared to the humongous space a WAV file takes up

AUTO-SYNC (Accurate Graphs)

The newer versions of RVC have Auto Sync, which is inaccurate and will eventually add up and “break” the graph. But currently, there is no way to manually sync unless you change to a way older version (which isn't recommended). So just live with the fact that for now your graphs are inaccurate.

OTHER THINGS I RECOMMEND FOR YOU TO READ

https://docs.google.com/document/d/1HmkG9cmL8SLX7-vJcPT1-1KgUQtCrwXB8CicYmG4LW8/edit [Audio Isolation on Low-End Rigs]
https://docs.google.com/document/d/1wTJ_wutDqEtsA99BJOXDDGax25pPIDE84O5E2Rio5Qk/edit [General Guide On Gathering Audio] (i highly recommend you to read this, since it’s full of information)

These guides have been made by Litsa The Dancer, Faze Masta and SCRFilms

I have my own things to add on top of this, first of all what does Truncate silence do and why it’s a must use:

First of all, RVC does training by Pitch Extraction basically works on the pitch of your dataset. When you have silence in your dataset, GAN thinks it's getting better at learning, but it’s wrong.

TRUNCATE SETTINGS

Provided by Codename;0

Truncate Levels

Phase Cancellation

This video explains how to do Noise Reduction//Phase Cancellation
Adobe Audition is recommended for this task
https://www.youtube.com/watch?v=wfNAN7ZZJnE

Another thing to do is Dialogue Contour. You need iZotope RX for this:

Using Dialogue Contour, you can reshape the intonation of dialogue to rescue or improve a performance in post-production. Dialogue Contour features Pitch Correction processing that is tailored to speech and designed to adjust the inflection of words within a phrase of dialogue that may not match or flow correctly with the rest of the dialogue in the clip.
Dialouge Contour

Converting to Mono and Normalizing

Since RVC already works in Mono channels, convert your files to Mono, also saving up on space.
Then Normalize the file with a -4dB.

WHAT PITCH EXTRACTION METHOD TO USE

According to the community and research papers on RMVPE, it's the best Extraction Method. https://arxiv.org/abs/2306.15412
RMVPE is the new standard since it is practically Universal™
in This article on arXiv, its explains why RMVPE is great, including graphs
RMVPE Other than this, its all Math and Computer Science stuff. Might as well be a sheeple and follow what other people do and use RMVPE

TL;DR

if G and D are going down that’s good
if they are mirroring each other, that’s good
if they stop going down check Mel and KL and if those are still going down good
basically if any of Mel, KL, G, D graphs are going down thats good.
use RMVPE (by my recommendation and the research)
Truncate silence
Dialogue Contour(manual but good if the dataset has repeated voice lines)
Phase Cancellation
The Pipe Line of UVR:

Voc FT: Instrument and Vocals
DeNoise both
De Reverb the DeNoised Vocals
DeNoise the DeReverbed Vocals
Karaoke 2 if needed to separate Main and Background Vocals
DeNoise if Karaoke 2 got used
Batch Size to the power of 2, so: 4, 8, 16, 32, 64, etc.

Special Thanks to @ame1997, @nofelt and @.codename0. for providing help with this guide

For the guide I had been learning while writing it, thanks to Codename;0 for most of the help and providing most of the topics discussed here.

Time of writing: 9/8/2023 3:22 PM

Edit
Pub: 08 Sep 2023 11:55 UTC
Views: 15657