Why does my spacy training produce 0.00 scores when I change the learn_rate schedule on the transformer optimizer - spacy

Hello I am using the generic gpu transformer config provided by spacy which uses the warmup_linear.v1 schedule and the roberta-base model. Using this config I can train models. When I change the learn rate for example to a linear schedule I get constantly 0.00 in the score during the training. Why does the training seemingly not work with other schedules?
Thanks!

Related

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training

I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you

How can I train with my own dataset with darkflow?

I'm a beginner with some programming experince. I'm trying to train darkflow with my own dataset. I'm following these instructions.
https://github.com/thtrieu/darkflow
So far I have done the following steps.
installed darkflow and the relevant modules
created test images and made annotations (Pascal VOC).
https://ibb.co/y4HmtGz
https://ibb.co/GkxLshK
If I have understood correctly the darkflow training requires Pascal VOC?
My problem is that I don't know how to start the training. How can I start the training process and how can I test if the neuralnet is working? Am I supposed to get weights as a result of training?
You can choose to use pre-trained weights from here. Download cfg and weights.
Assuming you have darkflow installed, you can train your network like this:
flow --model cfg/<your-config-filename>.cfg --load bin/<filename>.weights --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
If you want to train your network from scratch w/o using any pre-trained weights,
you can do this:
flow --model cfg/<your-config-filename>.cfg --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
After the training starts, model checkpoints are saved inside ckpt directory. You can load latest checkpoint and test on sample images.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

when to stop training object detection tensorflow

I am training faster rcnn model on fruit dataset using a pretrained model provided in google api(faster_rcnn_inception_resnet_v2_atrous_coco).
I made few changes to the default configuration. (number of classes : 12 fine_tune_checkpoint: path to the pretrained checkpoint model and from_detection_checkpoint: true). Total number of annotated images I have is around 12000.
After training for 9000 steps, the results I got have an accuracy percent below 1, though I was expecting it to be atleast 50% (In evaluation nothing is getting detected as accuracy is almost 0). The loss fluctuates in between 0 and 4.
What should be the number of steps I should train it for. I read an article which says to run around 800k steps but its the number of step when you train from scratch?
FC layers of the model are changed because of the different number of the classes but it should not effect those classes which are already present in the pre-trained model like 'apple'?
Any help would be much appreciated!
You shouldn't look at your training loss to determine when to stop. Instead, you should run your model through the evaluator periodically, and stop training when the evaluation mAP stops improving.

Keras training/testing results vary greatly after multiple runs

I am using Keras with TensorFlow backend. The dataset I am working with is sequence data with a Y value that is continuous between 0 and 1. The dataset is split into training with size 1900 and a testing with size 400. I am using the VGG19 architecture that I created from scratch in Keras. I am using an epoch of 30.
My question is, if I run this architecture multiple times, I get very different results. My results can be between 0.15 and 0.5 RMSE. Is this normal for this type of data? Is it because I am not running enough epochs? The loss from the network seems to stabilize around 0.024 at the end of the run. Any ideas?