Why do they compare performance over Validation and Test sets in the tensorflow time series tutorial? - tensorflow

Everything is in the title, I was reproducing the TensorTlow tutorial about time series here: https://www.tensorflow.org/tutorials/structured_data/time_series?hl=en#performance_3
I reproduce the same graph and obtain the following :
results obtained
Why do they plot validation and test error instead of train and test error?
And we can observe a significative difference between validation and test errors, how to interpret it, overfitting?
Thank you in advance,
Best regards

Related

Spacy train dev and test data

Does spaCy use dev-data to tune hyper-parameters? Or dev-data is totally out of the training process, and so equivalent to test data?
Following the standard greatly explained here, validation data and test data are different. Please is someone can clarify which is the case for spaCy under the referred standard. Thanks a lot.
The spacy core library does not do any hyperparameter tuning. For spacy train, the dev data is used for the evaluation displayed during training, to select the best model, and for early stopping (the early stopping setting is called patience).

Yolov5 object detection training

Please i need you help concerning my yolov5 training process for object detection!
I try to train my object detection model yolov5 for detecting small object ( scratch). For labelling my images i used roboflow, where i applied some data augmentation and some pre-processing that roboflow offers as a services. when i finish the pre-processing step and the data augmentation roboflow gives the choice for different output format, in my case it is yolov5 pytorch, and roboflow does everything for me splitting the data into training validation and test. Hence, Everything was set up as it should be for my data preparation and i got at the end the folder with data.yaml and the images with its labels, in data.yaml i put the path of my training and validation sets as i saw in the GitHub tutorial for yolov5. I followed the steps very carefully tought.
The problem is when the training start i get nan in the obj and box column as you can see in the picture bellow, that i don't know the reason why, can someone relate to that or give me any clue to find the solution please, it's my first project in computer vision.
This is what i get when the training process starts
This the last message error when the training finish
I think the problem comes maybe from here but i don't know how to fix it, i used the code of yolov5 team as it's in the tuto
The training continue without any problem but the map and precision remains 0 all the process !!
Ps : Here is the link of tuto i followed : https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
This is what I would do to troubleshoot it. - Run your code on collab because the environment is proven to work well - Confirm that your labels look good and are setup correctly. Can you checked to ensure the classes look right? In one of the screenshots it looks like you have no labels
Running my code in colab worked successfully and the resulats were good. I think that the problem was in my personnel laptop environment maybe the version of pytorch i was using '1.10.0+cu113', or something else ! If you have any advices to set up my environnement for yolov5 properly i would be happy to take from you guys. many Thanks again to #alexheat
I'm using Yolov5 for my custom dataset too. This problem might be due to the directory misplacement.
And using different version of Pytorch will not be a problem. Anyway you can try using the version they mentioned in 'requirements.txt'
It's better if you run
cd yolov5
pip3 install -r requirements.txt
Let me know if this helps.

How to drop elements in dataset that can cause an error while training a TensorFlow Lite model

I am trying to train a simple image classification model using TensorFlow Lite. I am following this documentation to write my code. As specified in the documentation, in order to train my model, I have written model = image_classifier.create(train_data, model_spec=model_spec.get('mobilenet_v2'), validation_data=validation_data). After training for a few seconds, however, I get an InvalidArgumentError. I believe that the error is due to something in my dataset but it is too difficult to eliminate all the sources of the error from the dataset manually because it consists of thousands of images. After some research, I found a potential solution - I could use tf.data.experimental.ignore_errors which would "produce a dataset that contains the same elements as the input, but silently drop any elements that caused an error." From the documentation, however, (here) I couldn't figure out how to integrate this transformation function with my code. If I place the line dataset = dataset.apply(tf.data.experimental.ignore_errors()) before training the model, the system doesn't know which elements to drop. If I place the line after, the system never reaches the line because an error arises in training. Moreover, the system gives an error message AttributeError: 'ImageClassifierDataLoader' object has no attribute 'apply'. I would appreciate if someone can tell me how to integrate tf.data.experimental.ignore_errors() with my model or possible alternatives to the issue I am facing.
Hi if you are exactly following the documentation then
tf.data.experimental.ignore_errors won't work for you because you are not loading your data using tf.data,You are most probably using from tflite_model_maker.image_classifier import DataLoader.
Note: Please mention the complete code snippet to help you out to solve the issue

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Tensorflow - Any input gives me same output

I am facing a very strange problem where I am building an RNN model using tensorflow and then storing the model variables (all) using tf.Saver after I finish training.
During testing, I just build the inference part again and restore the variables to the graph. The restoration part does not give any error.
But when I start testing on the evaluation test, I always get same output from the inference all i.e. for all test inputs, I get the same output.
I printed the output during training and I do see that output is different for different training samples and cost is also decreasing.
But when I do testing, it always gives me same output no matter what is the input.
Can someone help me to understand why this could be happening? I want to post some minimal example but as I am not getting any error, I am not sure what should I post here. I will be happy to share more information if it can help the issue.
One difference I have between the inference graph during training and testing is the number of time steps in RNN. During training I train for n steps (n = 20 or more) for a batch before updating gradients while for testing I just use one step as I only want to predict for that input.
Thanks
I have been able to resolve this issue. This seemed to be happening as one of my input feature was very dominant in its original values due to which after some operations all values were converging to single number.
Scaling that feature has helped to resolve this.
Thanks
Can you create a small reproducible case and post this as a bug to https://github.com/tensorflow/tensorflow/issues ? That will help this question get attention from the right people.