TensorFlow image classification colab sheet from training material: newbie questions - google-colaboratory

Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.

Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.

Related

YOLOv4 loss too high

I am using YOLOv4-tiny for a custom dataset of 26 classes that I collected from Open Images Dataset. The dataset is almost balanced(850 images per class but different number of bounding boxes). When I used YOLOv4-tiny to train on just 3 classes the loss was near 0.5, it was fairly accurate. But for 26 classes as soon as the loss goes below 2 the model starts to overfit. The prediction are also very inaccurate.
I have tried to change the parameters like the learning rate, the momentum and the size but whatever I do the models becomes worse then before. Using regular YOLOv4 model rather then YOLO-tiny does not help either. How can I bring the loss further down?
Have you tried training with mAP? You can take a subset of your training set and make it the validation set. This can be done in the same way you made your training and test set. Then, you can run darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map. This will keep track of the loss in your validation set. When the error in the validation say goes up, this is the time to stop training and prevent overfitting (this is called: early stopping).
You need to run the training for (classes*2000)iterations. However, for the best scores, you need to train your model for at least 6000 iterations (also known as max_batches). Also please remember if you are using a b&w image, change the channels=3 to channels=1. You can stop your training once the avg loss becomes something like this: 0.XXXX.
Here's my mAP graph for 6000 iterations that ran for 6.2 hours:
avg loss with 6000 max_batches.
Moreover, you can follow this FAQ documentation here by Stéphane Charette.

Using ssd_inception_v2 to train on different resolution

The dataset contains images of different sizes.
The pretrained weights are trained on 300x300 resolution.
I am training on widerface dataset where objects are as small as 15x15.
Q1. I want to train with 800x800 resolution do i need to resize all the images manually or this will be done by Tensorflow automatically ?
I am using the following command to train:
python3 /opt/github/models/research/object_detection/legacy/train.py --logtostderr --train_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q2. I also tried training it using the model_main.py but after 1000 iterations it is evaluating the dataset with each iteration.
I am using the following command to train:
python3 /opt/github/models/research/object_detection/model_main.py --num_train_steps=200000 --logtostderr --model_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q3. Also if you can suggest any model i should use for real time face detection apart from mobilenet and inception, please suggest.
Thanks.
Q1. No you do not need to resize manually. See this detailed answer.
Q2. By 1000 iterations you meant steps right? (An iteration counts as a complete cycle of the dataset.) Usually the model performed evaluation after a certain amount of time, e.g. 10 minutes. So in every 10 minutes, the checkpoints are saved and an evaluation of the model on evaluation set is performed.
Q3. SSD models with mobilenet is one of the fast detectors, apart from that you can try YOLO models for real time detection

Time taken to train Resnet on CIFAR-10

I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.

same train and eval dataset but get a different result

Versions:
TensorFlow: 1.8.0
TensorBoard: 1.8.0
What i did:
I'm training a model with imbalanced dataset with tf.estimator.DNNClassifier. When i did two times of training process both start from a totally new beginning(AKA, no checkpoint for each training) with the same data. I got two results which are very different from each other as shown in the following pictures.
1st-train
2nd-train
A few points to comment:
There is not difference between the two training process (no code or data changes), they both start from a new beginning.
The training dataset size is about 100M.
Both training results are from 6 epochs. (And each result cost $25 on google ml-engine.)
From the two pictures we can tell:
The 1st training learns nothing for 6 epochs.
The 2nd training learns (it got a AUC over 0.6).
Although the difference of AUC values between two trainings is only 0.1 (0.6 - 0.5), but it has big different in the meaning (a-random-guess versus a-non-random-guess).
Problems:
Why is this happen: same training data but get a totally different result?

Keras training/testing results vary greatly after multiple runs

I am using Keras with TensorFlow backend. The dataset I am working with is sequence data with a Y value that is continuous between 0 and 1. The dataset is split into training with size 1900 and a testing with size 400. I am using the VGG19 architecture that I created from scratch in Keras. I am using an epoch of 30.
My question is, if I run this architecture multiple times, I get very different results. My results can be between 0.15 and 0.5 RMSE. Is this normal for this type of data? Is it because I am not running enough epochs? The loss from the network seems to stabilize around 0.024 at the end of the run. Any ideas?