Can't save YOLOv4 model because of array shape mismatch - tensorflow

I am able to run transfer learning on YOLOv4 and my custom dataset with the following command (which runs successfully and can identify test images I present to the model):
!./darknet detector train /content/darknet/build/darknet/x64/data/obj.data /content/darknet/build/darknet/x64/cfg/yolov4_train.cfg /content/darknet/build/darknet/x64/yolov4.conv.137 -dont_show
I am using the save_model.py tool from this github site:
!git clone https://github.com/hunglc007/tensorflow-yolov4-tflite
When I enter the following command to save the model it fails:
!python3 save_model.py --weights /content/darknet/build/darknet/x64/backup/yolov4_train_final.weights --output ./checkpoints/yolov4-224 --input_size 224
The failure is a mismatch between the weights saved in training and the expected array shape in the core/utility module utils.py (line 63):
Traceback (most recent call last):
File "save_model.py", line 58, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "save_model.py", line 54, in main
save_tf()
File "save_model.py", line 49, in save_tf
utils.load_weights(model, FLAGS.weights, FLAGS.model, FLAGS.tiny)
File "/content/tensorflow-yolov4-tflite/core/utils.py", line 65, in load_weights
conv_weights = conv_weights.reshape(conv_shape).transpose([2, 3, 1, 0])
ValueError: cannot reshape array of size 4554552 into shape (1024,512,3,3)
I added a debug print, and it looks like the it's getting all the way to the last layer before choking. In other words, the previous layers all get through this line of code in utils.py with a match between the saved weights and the array shape. I think this is somehow related to the fact I'm using image sizes of 224,224,3 instead of 416,416,3, but I did specify that in the input_size. For completeness, here's the last couple of debug prints before the Traceback above:
layer (out_dim, in_dim, height, width) 107 512 1024 1 1
layer (out_dim, in_dim, height, width) 108 1024 512 3 3
If anyone has any ideas, that would be great!

Related

Camera Digit Prediction stopped working after moving to python 3.7, anyone know why?

When I moved my code from an interpreter based python 3.9 and tensorflow to python 3.7 and tensorflow-directml (so I could use my AMD GPU). The training part worked fine when I copied over the code. But when running the model I get an error suddenly complaining about the sizes of the input arrays to my neural network. The error does not occur with the initial interpreter but does with the second one even though the code is identical.
(The shapes of the digit array are the same for both versions (1, 28, 28) - binary image)
def cam_predict_digits(cam):
dig = np.zeros((1, 28, 28))
dig[0, :, :] = np.array(cam)
digit = np.array(dig)
print("predict input shape: " + str(digit.shape))
# Make prediction
prediction = model.predict(digit)
print(prediction)
print(f'Detected is probably: {np.argmax(prediction)}')
Traceback (most recent call last):
File "C:/Z_Uni/Individual_Project/Python_Projects/NeuralNet_GPU/Conv_NN_GPU_Model.py", line 123, in <module>
cam_predict_digits(Processed_Frame)
File "C:/Z_Uni/Individual_Project/Python_Projects/NeuralNet_GPU/Conv_NN_GPU_Model.py", line 74, in cam_predict_digits
prediction = model.predict(digit)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 908, in predict
use_multiprocessing=use_multiprocessing)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 716, in predict
x, check_steps=True, steps_name='steps', steps=steps)
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 2471, in _standardize_user_data
exception_prefix='input')
File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet_GPU\source\lib\site-packages\tensorflow_core\python\keras\engine\training_utils.py", line 563, in standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (1, 28, 28)
Process finished with exit code 1
Could anyone explain why this is happening and what I can do to fix it? Thanks

How to convert YOLOv4-CSP darknet weight to Tensorflow format?

How to convert YOLOv4-CSP darknet weights to Tensorflow (tf) format?
I have tried using this repo but it didn't work.
I had this error message:
Traceback (most recent call last):
File "save_model.py", line 58, in <module>
app.run(main)
File "C:\Python37\lib\site-packages\absl\app.py", line 303, in run
_run_main(main, args)
File "C:\Python37\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "save_model.py", line 54, in main
save_tf()
File "save_model.py", line 49, in save_tf
utils.load_weights(model, FLAGS.weights, FLAGS.model, FLAGS.tiny)
File "D:\swap\20210319\tensorflow-yolov4-tflite\core\utils.py", line 63, in load_weights
conv_weights = conv_weights.reshape(conv_shape).transpose([2, 3, 1, 0])
ValueError: cannot reshape array of size 3791890 into shape (1024,512,3,3)
The repository that you are using doesn't support conversion of Scaled YoloV4 or Yolov4-csp yet. It's still a feature request according to this issue
There's luckily a workaround. I found this repository that does the same thing, only difference being it converts the model to .h5 (keras format) before converting into tensorflow format. This also supports yolov4-csp.
I made a Google Colab notebook that does the conversion, which can be found here.

Training with multiple GPUs and ModelCheckpoint leads to exception

I'm training a 1D CNN with two GPUs (2xK80) with Keras (TensorFlow as backend).
The issue I'm having
The issue is (my guess) that I'm trying to save the model weights of one gpu while the other gpu is in the middle of training (or something like that) so I believe I'm looking for a way to halt the fit process when is done, save weights and than go to next epoch.
The exception I received
File "/root/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py", line 2622, in load_weights
load_weights_from_hdf5_group(f, self.layers)
File "/root/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py", line 3103, in load_weights_from_hdf5_group
layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/root/miniconda3/lib/python3.5/site-packages/h5py/_hl/attrs.py", line 60, in __getitem__
attr = h5a.open(self._id, self._e(name))
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5a.pyx", line 77, in h5py.h5a.open
KeyError: "Can't open attribute (can't locate attribute: 'layer_names')"
root#algoGpu:/home/gpu_user/SourceCode/voc#
The question is
How can I train a model on multiple GPUs and at the same time use ModelCheckpoint to save best epoch's weights?

tensorflow: ValueError: GraphDef cannot be larger than 2GB

This is the error i got
Traceback (most recent call last):
File "fully_connected_feed.py", line 387, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "fully_connected_feed.py", line 289, in main
run_training()
File "fully_connected_feed.py", line 256, in run_training
saver.save(sess, checkpoint_file, global_step=step)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1386, in save
self.export_meta_graph(meta_graph_filename)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1414, in export_meta_graph
graph_def=ops.get_default_graph().as_graph_def(add_shapes=True),
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2257, in as_graph_def
result, _ = self._as_graph_def(from_version, add_shapes)
File "/home/-/.local/lib/python2.7/site-
packages/tensorflow/python/framework/ops.py", line 2220, in _as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I believe it is from the result of this code
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden1")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden2")[0]
weights = tf.scatter_nd_update(weights,indices, updates)
I am not sure why my model is getting so big in size (15k steps and 240MB). Any thoughts? thanks!
It's hard to say what is happening without seeing the code, but in general TensorFlow model sizes will not increase with number of steps - they should be fixed.
If the model size is increasing with number of steps, it suggests that the computation graph is being added to on every step. For example, something like:
import tensorflow as tf
with tf.Session() as sess:
for i in xrange(1000):
sess.run(tf.add(1, 2))
# or perhaps sess.run(tf.scatter_nd_update(...)) in your case
will create 3000 nodes in the graph (one for add, one for '1' one for '2' on every iteration). Instead, you want to define your computational graph once and run repeatedly with something like:
import tensorflow as tf
x = tf.add(1, 2)
# or perhaps x = tf.scatter_nd_update(...) in your case
with tf.Session() as sess:
for i in xrange(1000):
sess.run(x)
Which will have a fixed graph of 3 nodes for all the 1000 (and any more) iterations. Hope that helps.

Near empty frozen graph after using freeze_graph from Tensorflow

I am currently trying to strip the training operations from my GraphDef so that I can run it on Android. However, to do so, I need to first freeze the graph using Tensorflow's freeze_graph.py script.
However, I get the error UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 331: invalid start byte when attempting to run the bash script:
#!/bin/bash
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=/Users/leslie/Downloads/trained_model.pb \
--input_checkpoint=/Users/leslie/Downloads/Y6_1478303913_Leslie \
--output_graph=/tmp/frozen_graph.pb --output_node_names=Y_GroundTruth
Could this be a problem in the way I created my graph and checkpoint? I created the input_graph via tf.train.write_graph(sess.graph_def, location, 'trained_model.pb', as_text=False) and the checkpoint is created via saver.save(sess, chkpointpath). Answers from StackOverflow say that the python script has non-ascii characters and that I should just simply strip them from the python script but I do not think that is such a great idea.
Full traceback:
Traceback (most recent call last):
File "/Users/leslie/tensorflow-master/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow /python/tools/freeze_graph.py", line 135, in <module>
tf.app.run()
File "/Users/leslie/tensorflow-master/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/Users/leslie/tensorflow-master/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 132, in main
FLAGS.output_graph, FLAGS.clear_devices, FLAGS.initializer_nodes)
File "/Users/leslie/tensorflow-master/bazel-bin/tensorflow/python/tools/freeze_graph.runfiles/org_tensorflow/tensorflow/python/tools/freeze_graph.py", line 98, in freeze_graph
text_format.Merge(f.read().decode("utf-8"), input_graph_def)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 331: invalid start byte
I also generated my protobuf file with as_text = True and the error above did not show up. However, I only got the following output.
Converted 0 variables to const ops.
1 ops in the final graph.
Complete contents of "frozen_graph.pb"
6
Y_GroundTruth��Placeholder*�
�dtype��0�*�
�shape��:
Snippet of PB-file generation code:
#Start all code before training code
# Tensor placeholders and variables
...
# Network weights and biases
...
# Network layer definitions
...
# Definition of cost function
...
# Create optimizer
...
# Session operations
...
#END all code before training code
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, model_save_path)
sess.run(tf.initialize_all_variables())
tf.train.write_graph(sess.graph_def, outputlocation, 'trained_model.pb', as_text=False)