Export Generator part of VITS model from Coquia-AI - text-to-speech

I have trained my own text to speech model using Coqui-AI TTS framework . Size of VITS model that I trained is bigger than official released model (900 MB vs 340 MB). I read in this discussion that official released VITS model only contains generator part. My question is how I can save generator part of my trained model to have a small model?
Also RAM usage of my model is too high (from 4 GB for small sentence and more than 20 GB for longer sentences), while official models use less memory.

Related

Small batch size using Object Detection API

I've developed a custom model using tf object detection api for human keypoint estimation.
Architecture is MobilenetV3 + FPN + Centernet. In the model zoo I saw there is an example using MobilenetV2 as feature extractor instead, and the pipeline.config there seems to be using batch size 512. I'm training on an Nvidia A100 80GB GPU, and it can only fit a batch size of 32. I've tried with only powers of 2 batch sizes because it makes adapting the training steps number easy.
This would suggest that I might need 16 such GPUs to train the model with the suggested 512 batch size. Are needed resources for training such a model expected to be this high?

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Large input image limitations for VGG19 transfer learning

I'm using the Tensorflow (using the Keras API) in Python 3.0. I'm using the VGG19 pre-trained network to perform style transfer on an Nvidia RTX 2070.
The largest input image that I have is 4500x4500 pixels (I have removed the fully-connected layers in the VGG19 to allow for a fully-convolutional network that handles arbitrary image sizes.) If it helps, my batch size is just 1 image at a time currently.
1.) Is there an option for parallelizing the evaluation of the model on the image input given that I am not training the model, but just passing data through the pre-trained model?
2.) Is there any increase in capacity for handling larger images in going from 1 GPU to 2 GPUs? Is there a way for the memory to be shared across the GPUs?
I'm unsure if larger images make my GPU compute-bound or memory-bound. I'm speculating that it's a compute issue, which is what started my search for parallel CNN evaluation discussions. I've seen some papers on tiling methods that seem to allow for larger images

Tensorflow model file size varies significantly

I am using a tensorflow framework and I have noticed that there are major variances in the size of the tensorflow model files.
For example the framework provides 2 models:
one of pretrained model to be used with fine tuning for example
and one which contains an untrained version.
They both have a size of 172.539 kb
When I apply fine tuning in my model with some minor changes in the graph (there is a module in framework for that) and save my model the size remains essentially the same: 178.525 kb.
First, I am bit surprised that my fine-tuned model is somewhat bigger since I change just the last layer from 21 to 14 classes so I would expect a somewhat smaller model file size but since the difference is so little I didn't pay attention.
But when I trained the same model using the same model file (the pretrained one I mean) and saved the model in disk the file size is quite different: 340.097 kb. By the term train I mean I allow the network to modify all parameter not just the parameters of the last layer.
The model that is being implemented is a variation of resnet for semantic image segmentation (if can someone deduct the expected model file size from the model itself).
So, my questions are why I have such a variance in the model file sizes and how come my saved fine-tuned model is larger than the original model? Is there a way to include/exclude parameters in the model to be saved?
P.S.1 Some information that might be handy:
I am using tensorflow v2 model saving while I think the framework files use v1. I am not sure how to identify this besides the fact that the former produces 3 files.
The framework is called tensorflow-deeplab-resnet and can be found here and the models are here.
P.S.2
I am not sure stack overflow it 's the right place for this question either.
That is because, when training models and saving them, Tensorflow will also save the gradients of your ops.
So allowing training on the last layer will increase the size of your saved model a little. And allowing training on the whole model will essentially double the size of the save file because each op will have its gradients saved.

SSD mobilenet trained model with custom data only recognize images in short distances

I've trained a model with a custom dataset (Garfield images) with Tensorflow Object Detection API (ssd_mobilenet_v1 model) and referring it in the android sample application available on Tensorflow repository. The application can only detected the images in distances less or equal 20cm approximately.
Do you have any clue about I can improve the model to perform recognitions in longer distances (about 30cm or more) ?
I don't know with this limitation is related with input size I'm using (tested with images with 300x300 and 68x68) or any custom data augmentation is needed to improve that.
SSD models are known to have worse performance on small objects. Have you tried using one of our FasterRCNN models to see if the result is acceptable?