Tensorflow: Softmax cross entropy with logits becomes inf - tensorflow

I am working on the Tensorflow for poets tutorial. Most of the time, training fails with an error Nan in summary histogram.
I run the following command on the original data to retrain:
python -m scripts.retrain
--bottleneck_dir=tf_files/bottlenecks
--model_dir=tf_files/models/
--summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"
--output_graph=tf_files/retrained_graph.pb
--output_labels=tf_files/retrained_labels.txt
--image_dir=/ml/data/images
This error occurred in other mentions as well. I followed the instructions there using tfdg which gave me a bit more insight (see below). However, I am still stuck because I do not know why this happens and what I can do to fix it without much experience in TF and neural networks. This is especially confusing because it happens with 100% tutorial code & data.
Here is the output from tfdg. The first time the error appears:
And the node in detail:
To look at the retrain script you can find Google's original code here. It was not modified in my case. Sorry for not including it (too many characters).
Hyper parameters & result
For additional information: trainings works with ridiculously small values for learning rate (e.g. using 0,000001). However this does not lead to good results. No matter how many epochs I train, performance stays on a low level (probably being stuck in local minima during optimisation).

I had searched about compatibility as I was doing in 2.7 also, but it said 3.5 is the best version now with all latest tensorflow support. So I created virtual environment with python 3.5. I think that's why the stability issue.

Are you sure tf_files folder is being created?
I faced some issue on command line python. I switched to spyder and changed the variable data of input as required in retrain.py and it runs smoothly. I know, it's not a solution but a turnaround.

Related

Freeze Saved_Model.pb created from converted Keras H5 model

I am currently trying to train a custom model for use in Unity (Barracuda) for object detection and I am struggling near what I believe to be the last part of the pipeline. Following various tutorials and git-repos I have done the following...
Using Darknet, I have trained a custom-model using the Tiny-Yolov2 model. (model tested successfully on a webcam python script)
I have taken the final weights from that training and converted them
to a Keras (h5) file. (model tested successfully on a webcam python
script)
From Keras, I then use tf.save_model to turn it into a
save_model.pd.
From save_model.pd I then convert it using tf2onnx.convert to change
it to an onnx file.
Supposedly from there it can then work in one of a few Unity sample
projects...
...however, this project fails to read in the Unity Sample projects I've tried to use. From various posts it seems that I may need to use a 'frozen' save_model.pd before converting it to ONNX. However all the guides and python functions that seem to be used for freezing save_models require a lot more arguments than I have awareness of or data for after going through so many systems. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py - for example, after converting into Keras, I only get left with a h5 file, with no knowledge of what an input_graph_def, or output_node_names might refer to.
Additionally, for whatever reason, I cannot find any TF version (1 or 2) that can successfully run this python script using 'from tensorflow.python.checkpoint import checkpoint_management' it genuinely seems like it not longer exists.
I am not sure why I am going through all of these conversions and steps but every attempt to find a cleaner process between training and unity seemed to lead only to dead ends.
Any help or guidance on this topic would be sincerely appreciated, thank you.

Yolov5 object detection training

Please i need you help concerning my yolov5 training process for object detection!
I try to train my object detection model yolov5 for detecting small object ( scratch). For labelling my images i used roboflow, where i applied some data augmentation and some pre-processing that roboflow offers as a services. when i finish the pre-processing step and the data augmentation roboflow gives the choice for different output format, in my case it is yolov5 pytorch, and roboflow does everything for me splitting the data into training validation and test. Hence, Everything was set up as it should be for my data preparation and i got at the end the folder with data.yaml and the images with its labels, in data.yaml i put the path of my training and validation sets as i saw in the GitHub tutorial for yolov5. I followed the steps very carefully tought.
The problem is when the training start i get nan in the obj and box column as you can see in the picture bellow, that i don't know the reason why, can someone relate to that or give me any clue to find the solution please, it's my first project in computer vision.
This is what i get when the training process starts
This the last message error when the training finish
I think the problem comes maybe from here but i don't know how to fix it, i used the code of yolov5 team as it's in the tuto
The training continue without any problem but the map and precision remains 0 all the process !!
Ps : Here is the link of tuto i followed : https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
This is what I would do to troubleshoot it. - Run your code on collab because the environment is proven to work well - Confirm that your labels look good and are setup correctly. Can you checked to ensure the classes look right? In one of the screenshots it looks like you have no labels
Running my code in colab worked successfully and the resulats were good. I think that the problem was in my personnel laptop environment maybe the version of pytorch i was using '1.10.0+cu113', or something else ! If you have any advices to set up my environnement for yolov5 properly i would be happy to take from you guys. many Thanks again to #alexheat
I'm using Yolov5 for my custom dataset too. This problem might be due to the directory misplacement.
And using different version of Pytorch will not be a problem. Anyway you can try using the version they mentioned in 'requirements.txt'
It's better if you run
cd yolov5
pip3 install -r requirements.txt
Let me know if this helps.

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

Tensorboard projector will compute PCA endlessly

I have just over 100k word embeddings which I created using gensim, originally each containing 200 dimensions. I've been trying to visualize them within tensorboard's projector but I have only failed so far.
My problem is that tensorboard seems to freeze while computing PCA. At first, I left the page open for 16 hours, imagining that it was just too much to be calculated, but nothing happened. At this point, I started to try and test different scenarios just in case all I needed was more time and I was trying to rush things. The following is a list of my testing so far, all of which failed at the same spot, computing PCA:
I plotted only 10 points of 200 dimensions;
I retrained my gensim model so that I could reduce its dimensionality to 100;
Then I reduced it to 10;
Then to 2;
Then I tried plotting only 2 points, i.e. 2 two dimensional points;
I am using Tensorflow 1.11;
You can find my last saved tensor flow session here, would you mind trying it out?
I am still a beginner, therefore I used a couple tutorial to get me started; I used Sud Harsan work so far.
Any help is much appreciated. Thanks.
Updates:
A) I've found someone else dealing with the same problem; I tried the solution provided, but it didn't change anything.
B) I thought it could have something to do with my installation, therefore I tried uninstalling tensorflow and installing it back; no luck. I then proceeded to create a new environment dedicated to tensorflow and that also didn't work.
C) Assuming there was something wrong with my code, I ran tensorflow's basic embedding tutorial to check if I could open its projector's results. And guess what?! I still can't go past "Calculating PCA"
Now, I did visit the online projector example and that loads perfectly.
Again, Any help would be more than appreciated. Thanks!
I have the same problem with word2vec_basic.py
My environment: win10, conda, python 3.6.7, tensorflow 1.11, tensorboard 1.11
That may not your fault because I roll back tensorflow & tensorboard from 1.11 to 1.7
And guess what?! The projector appears just a few seconds!
reference
Update 10/11
tensorboard & tensorflow 1.12 are available in conda today, I take a try and this problem seems to be fixed.
As mentioned by Bluedrops, updating tensorboard and tensorflow seems to fix the problem.
I created a new environment with conda and installed the newest versions of Tensorflow, Tensorboard and their dependencies and that seems to fix the issue.