Tensorflow object detection API not displaying global steps - tensorflow

I am new here. I recently started working with object detection and decided to use the Tensorflow object detection API. But, when I start training the model, it does not display the global step like it should, although it's still training in the background.
Details:
I am training on a server and accessing it using OpenSSH on Windows. I trained a custom dataset, by collecting pictures and labeling them. I trained it using model_main.py. Also, until a couple of months back, the API was a little different, and only recently they changed to the latest version. For instance, earlier it used to use train.py for training, instead of model_main.py. All the online tutorial I can find use train.py, so it might be a problem with the latest commit. But I don't find anyone else fining this problem.
Thanks in advance!

Add tf.logging.set_verbosity(tf.logging.INFO) after the import section of the model_main.py script. It will display a summary after every 100th step.

As Thommy257 suggested, adding tf.logging.set_verbosity(tf.logging.INFO) after the import section of model_main.py prints the summary after every 100 steps by default.
Further, to specify the frequency of the summary, change
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
to
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, log_step_count_steps=k)
where it will print after every k steps.

Regarding the recent change to model_main , the previous version is available at the folder "legacy". I use train.py and eval.py from this legacy folder with the same functionality as before.

Related

Freeze Saved_Model.pb created from converted Keras H5 model

I am currently trying to train a custom model for use in Unity (Barracuda) for object detection and I am struggling near what I believe to be the last part of the pipeline. Following various tutorials and git-repos I have done the following...
Using Darknet, I have trained a custom-model using the Tiny-Yolov2 model. (model tested successfully on a webcam python script)
I have taken the final weights from that training and converted them
to a Keras (h5) file. (model tested successfully on a webcam python
script)
From Keras, I then use tf.save_model to turn it into a
save_model.pd.
From save_model.pd I then convert it using tf2onnx.convert to change
it to an onnx file.
Supposedly from there it can then work in one of a few Unity sample
projects...
...however, this project fails to read in the Unity Sample projects I've tried to use. From various posts it seems that I may need to use a 'frozen' save_model.pd before converting it to ONNX. However all the guides and python functions that seem to be used for freezing save_models require a lot more arguments than I have awareness of or data for after going through so many systems. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py - for example, after converting into Keras, I only get left with a h5 file, with no knowledge of what an input_graph_def, or output_node_names might refer to.
Additionally, for whatever reason, I cannot find any TF version (1 or 2) that can successfully run this python script using 'from tensorflow.python.checkpoint import checkpoint_management' it genuinely seems like it not longer exists.
I am not sure why I am going through all of these conversions and steps but every attempt to find a cleaner process between training and unity seemed to lead only to dead ends.
Any help or guidance on this topic would be sincerely appreciated, thank you.

Mask R-CNN not working properly on its own examples

I spent some time already to figure out how to get Mask R-CNN working properly.
I cloned the original Matterport implementation and a fork of it which has been modified to use TF 2.
The Matterport implementation seems to be somehow outdated with respect to the dependencies, and I could not make it work. I saw that some people could make it work using different versions of the required libraries or some code changes here and there... I thought I continue with the TF2 compatible version. There is a code change needed as well to make it work with the examples which have been provided with Mask R-CNN. I hope that this is sufficient and that I did not miss something else.
E.g. I ran the train_shapes.ipynb in samples folder. The generated shapes are trained on top of pretrained COCO weights. So far so good.
The notebook generates a sample image with shapes, and processes it. this is the result:
What can be the reason that so many shapes are detected which are not in the source image?
I was having the same issue. It is because model.detect does not work for TensorFlow 2.6 and above. When I downgraded to TensorFlow 2.4, everything worked. Check out this thread: https://github.com/matterport/Mask_RCNN/issues/2670

Yolov5 object detection training

Please i need you help concerning my yolov5 training process for object detection!
I try to train my object detection model yolov5 for detecting small object ( scratch). For labelling my images i used roboflow, where i applied some data augmentation and some pre-processing that roboflow offers as a services. when i finish the pre-processing step and the data augmentation roboflow gives the choice for different output format, in my case it is yolov5 pytorch, and roboflow does everything for me splitting the data into training validation and test. Hence, Everything was set up as it should be for my data preparation and i got at the end the folder with data.yaml and the images with its labels, in data.yaml i put the path of my training and validation sets as i saw in the GitHub tutorial for yolov5. I followed the steps very carefully tought.
The problem is when the training start i get nan in the obj and box column as you can see in the picture bellow, that i don't know the reason why, can someone relate to that or give me any clue to find the solution please, it's my first project in computer vision.
This is what i get when the training process starts
This the last message error when the training finish
I think the problem comes maybe from here but i don't know how to fix it, i used the code of yolov5 team as it's in the tuto
The training continue without any problem but the map and precision remains 0 all the process !!
Ps : Here is the link of tuto i followed : https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
This is what I would do to troubleshoot it. - Run your code on collab because the environment is proven to work well - Confirm that your labels look good and are setup correctly. Can you checked to ensure the classes look right? In one of the screenshots it looks like you have no labels
Running my code in colab worked successfully and the resulats were good. I think that the problem was in my personnel laptop environment maybe the version of pytorch i was using '1.10.0+cu113', or something else ! If you have any advices to set up my environnement for yolov5 properly i would be happy to take from you guys. many Thanks again to #alexheat
I'm using Yolov5 for my custom dataset too. This problem might be due to the directory misplacement.
And using different version of Pytorch will not be a problem. Anyway you can try using the version they mentioned in 'requirements.txt'
It's better if you run
cd yolov5
pip3 install -r requirements.txt
Let me know if this helps.

Tensorboard projector will compute PCA endlessly

I have just over 100k word embeddings which I created using gensim, originally each containing 200 dimensions. I've been trying to visualize them within tensorboard's projector but I have only failed so far.
My problem is that tensorboard seems to freeze while computing PCA. At first, I left the page open for 16 hours, imagining that it was just too much to be calculated, but nothing happened. At this point, I started to try and test different scenarios just in case all I needed was more time and I was trying to rush things. The following is a list of my testing so far, all of which failed at the same spot, computing PCA:
I plotted only 10 points of 200 dimensions;
I retrained my gensim model so that I could reduce its dimensionality to 100;
Then I reduced it to 10;
Then to 2;
Then I tried plotting only 2 points, i.e. 2 two dimensional points;
I am using Tensorflow 1.11;
You can find my last saved tensor flow session here, would you mind trying it out?
I am still a beginner, therefore I used a couple tutorial to get me started; I used Sud Harsan work so far.
Any help is much appreciated. Thanks.
Updates:
A) I've found someone else dealing with the same problem; I tried the solution provided, but it didn't change anything.
B) I thought it could have something to do with my installation, therefore I tried uninstalling tensorflow and installing it back; no luck. I then proceeded to create a new environment dedicated to tensorflow and that also didn't work.
C) Assuming there was something wrong with my code, I ran tensorflow's basic embedding tutorial to check if I could open its projector's results. And guess what?! I still can't go past "Calculating PCA"
Now, I did visit the online projector example and that loads perfectly.
Again, Any help would be more than appreciated. Thanks!
I have the same problem with word2vec_basic.py
My environment: win10, conda, python 3.6.7, tensorflow 1.11, tensorboard 1.11
That may not your fault because I roll back tensorflow & tensorboard from 1.11 to 1.7
And guess what?! The projector appears just a few seconds!
reference
Update 10/11
tensorboard & tensorflow 1.12 are available in conda today, I take a try and this problem seems to be fixed.
As mentioned by Bluedrops, updating tensorboard and tensorflow seems to fix the problem.
I created a new environment with conda and installed the newest versions of Tensorflow, Tensorboard and their dependencies and that seems to fix the issue.

Tensorflow Object Detection API w/ TPU Training - Display more granular Tensorboard plots

I've been following this tutorial on the Tensorflow Object Detection API, and I've successfully trained my own object detection model using Google's Cloud TPUs.
However, the problem is that on Tensorboard, the plots I'm seeing only have 2 data points each (so it just plots a straight line), like this:
...whereas I want to see more "granular" plots like these below, which are much more detailed:
The tutorial I've been following acknowledges that this issue is caused by the fact that TPU training requires very few steps to train:
Note that these graphs only have 2 points plotted since the model
trains quickly in very few steps (if you’ve used TensorBoard before
you may be used to seeing more of a curve here)
I tried adding save_checkpoints_steps=50 in the file model_tpu_main.py (see code fragment below), and when I re-ran training, I was able to get a more granular plot, with 1 data point every 300 steps or so.
config = tf.contrib.tpu.RunConfig(
# I added this line below:
save_checkpoints_steps=50,
master=tpu_grpc_url,
evaluation_master=tpu_grpc_url,
model_dir=FLAGS.model_dir,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_shards))
However, my training job is actually saving a checkpoint every 100 steps, rather than every 300 steps. Looking at the logs, my evaluation job is running every 300 steps. Is there a way I can make my evaluation job run every 100 steps (whenever there's a new checkpoint) so that I can get more granular plots on Tensorboard?
Code which addresses this issue is explained by a technical lead for the Google cloud platform in a Medium blogpost. Alternatively go directly to the Github code.
The train_and_evaluate function of 81 lines defines an TPUEstimator, train_input_fn and eval_input_fn. Then it iterates to the training steps and calls estimator.train and estimator.evaluate in each iteration. The metrics can be defined in the model_fn, which is called image_classifier. Note that it currently has no effect to add tf.summary calls in the model functions since the TPU does not support it:
"TensorBoard summaries are a great way see inside your model. A minimal set of basic summaries are automatically recorded by the TPUEstimator, to event files in the model_dir. Custom summaries, however, are currently unsupported when training on a Cloud TPU. So while the TPUEstimator will still run locally with summaries, it will fail if used on a TPU." (source)
If summaries are important it might be more convenient to switch to training on GPU.
Personally I think writing this code is quite a hassle for something which should be handled by the API. Please update this answer if better solutions exist! I'm looking forward to it.
Set save_summary_steps in RunConfig to 100, so you get the statistics you want
Also iterations_per_loop to 100 so that the training doesn't go more steps
p.s. I hope you realize that checkpointing is very slow. You are probably raising the cost of your job just for the sake of a pretty graph :)
You can try adding throttle_secs=100 to the EvalSpecs constructor here. The default is 600 seconds.