Orthogonal initializer is being called on a matrix with more elements: Slowness may result - tensorflow

tensorflowjs version: 0.11.7
Keras version: 2.0.4
I am trying to run Keras converted model on browser. For conversion I have used tensorflowjs converter, the conversion went fine. However, at load time,
Orthogonal initializer is being called on a matrix with more than 2000
(2560000) elements: Slowness may result.
This message pops up and the memory blowup happens and lastly the browser stops working.
Following is my keras model architecture:
As we run the tensorflowjs code to load the aforementioned model, following warning pops up:
Is there a way to speed up the Orthogonal initialization process, so the model gets loaded in less time and with less effort?
Any help will be appreciated.

Related

OpenVINO - Toolkit with YoloV4

I am currently working with the YoloV3-tiny.
Repository: https://github.com/AlexeyAB/darknet
To import the network into C++ project I use OpenVINO-Toolkit. In more detail I use the following procedure to convert the network:
Converting YOLO* Models to the Intermediate Representation (IR)
This procedure carries out a conversion and an optimization to proceed with the inference.
Now, I would like to try the YoloV4 because it seems to be more effective for the purpose of the project. The problem is that OpenVINO Toolkit does not yet support this version and does not report the .json (file needed for optimization) file relative to version 4 but only up to version 3.
What has changed in terms of structure between version 3 and version 4 of the Yolo?
Can I hopefully hope that the conversion of the YoloV3-tiny (or YoloV3) is the same as the YoloV4?
Is the YoloV4 much slower than the YoloV3-tiny using only the CPU for inference?
When will the YoloV4-tiny be available?
Does anyone have information about it?
The difference between YoloV4 and YoloV3 is the backbone. YoloV4 has CSPDarknet53, whilst YoloV3 has Darknet53 backbone.
See https://arxiv.org/pdf/2004.10934.pdf.
Also, YoloV4 is not supported officially by OpenVINO. However, you can still test and validate YoloV4 on your end with some workaround. There is one way for now to run YoloV4 through OpenCV which will build network using nGraph API and then pass to Inference Engine. See https://github.com/opencv/opencv/pull/17185.
The key problem is the Mish activation function - there is no optimized implementation yet, which is why we have to implement it by definition with tanh and exponential functions. Unfortunately, one-to-one topology comparison shows significant performance degradation. The performance results are also available in the github link above.
https://github.com/TNTWEN/OpenVINO-YOLOV4
This is my project based on v3's converter (darknet -> tensorflow ->IR)and i have finished the adaptation of OpenVINO Yolov4,v4-relu,v4-tiny.
You could have a try. And you can use V4's IRmodel and run on v3's c++ demo directly

Tensorflow inference run time high on first data point, decreases on subsequent data points

I am running inference using one of the models from TensorFlow's object detection module. I'm looping over my test images in the same session and doing sess.run(). However, on profiling these runs, I realize the first run always has a higher time as compared to the subsequent runs.
I found an answer here, as to why that happens, but there was no solution on how to fix.
I'm deploying the object detection inference pipeline on an Intel i7 CPU. The time for one session.run(), for 1,2,3, and the 4th image looks something like (in seconds):
1. 84.7132628
2. 1.495621681
3. 1.505012751
4. 1.501652718
Just a background on what all I have tried:
I tried using the TFRecords approach TensorFlow gave as a sample here. I hoped it would work better because it doesn't use a feed_dict. But since more I/O operations are involved, I'm not sure it'll be ideal. I tried making it work without writing to the disk, but always got some error regarding the encoding of the image.
I tried using the TensorFlow datasets to feed the data, but I wasn't sure how to provide the input, since the during inference I need to provide input for "image tensor" key in the graph. Any ideas on how to use this to provide input to a frozen graph?
Any help will be greatly appreciated!
TLDR: Looking to reduce the run time of inference for the first image - for deployment purposes.
Even though I have seen that the first inference takes longer, the difference (84 Vs 1.5) that is shown there seems to be a bit unbelievable. Are you counting the time to load model also, inside this time metric? Can this could be the difference for the large time difference? Is the topology that complex, so that this time difference could be justified?
My suggestions:
Try Openvino : See if the topology you are working on, is supported in Openvino. OpenVino is known to speed up the inference workloads by a substantial amount due to it's capability to optimize network operations. Also, the time taken to load openvino model, is comparitively lower in most of the cases.
Regarding the TFRecords Approach, could you please share the exact error and at what stage you got the error?
Regarding Tensorflow datasets, could you please check out https://github.com/tensorflow/tensorflow/issues/23523 & https://www.tensorflow.org/guide/datasets. Regarding the "image tensor" key in the graph, I hope, your original inference pipeline should give you some clues.

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

How to use Tensorflow model comparison with tflite_diff_example_test

I have trained a model for detection, which is doing great when embedded in tensorflow sample app.
After freezing with export_tflite_ssd_graph and conversion to tflite using toco the results do perform rather bad and have a huge "variety".
Reading this answer on a similar problem with loss of accuracy I wanted to try tflite_diff_example_test on a tensorflow docker machine.
As the documentation is not that evolved right now, I build the tool referencing this SO Post
using:
bazel build tensorflow/contrib/lite/testing/tflite_diff_example_test.cc which ran smooth.
After figuring out all my needed input parameters I tried the testscript with following commands:
~/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh tensorflow/contrib/lite/testing/tflite_diff_example_test '--tensorflow_model=/tensorflow/shared/exported/tflite_graph.pb' '--tflite_model=/tensorflow/shared/exported/detect.tflite' '--input_layer=a,b,c,d' '--input_layer_type=float,float,float,float' '--input_layer_shape=1,3,4,3:1,3,4,3:1,3,4,3:1,3,4,3' '--output_layer=x,y'
and
bazel-bin/tensorflow/contrib/lite/testing/tflite_diff_example_test --tensorflow_model="/tensorflow/shared/exported/tflite_graph.pb" --tflite_model="/tensorflow/shared/exported/detect.tflite" --input_layer=a,b,c,d --input_layer_type=float,float,float,float --input_layer_shape=1,3,4,3:1,3,4,3:1,3,4,3:1,3,4,3 --output_layer=x,y
Both ways are failing. Errors:
way:
tflite_diff_example_test.cc:line 1: /bazel: Is a directory
tflite_diff_example_test.cc: line 3: syntax error near unexpected token '('
tflite_diff_example_test.cc: line 3: 'Licensed under the Apache License, Version 2.0 (the "License");'
/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh: line 184: /tensorflow/: Is a directory
/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh: line 276: /tensorflow/: Is a directory
way:
2018-09-10 09:34:27.650473: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Failed to create session. Op type not registered 'TFLite_Detection_PostProcess' in binary running on d36de5b65187. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.)tf.contrib.resamplershould be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
I would really appreciate any help, that enables me to compare the output of two graphs using tensorflows given tests.
The second way you mentioned is the correct way to use tflite_diff. However, the object detection model containing the TFLite_Detection_PostProcess op cannot be run via tflite_diff.
tflite_diff runs the provided TensorFlow (.pb) model in the TensorFlow runtime and runs the provided TensorFlow Lite (.tflite) model in the TensorFlow Lite runtime. In order to run the .pb model in the TensorFlow runtime, all of the operations must be implemented in TensorFlow.
However, in the model you provided, the TFLite_Detection_PostProcess op is not implemented in TensorFlow runtime - it is only available in the TensorFlow Lite runtime. Therefore, TensorFlow cannot resolve the op. Therefore, you unfortunately cannot use the tflite_diff tool with this model.

keras + scikit-learn wrapper, appears to hang when GridSearchCV with n_jobs >1

UPDATE: I have to re-write this question as after some investigation I realise that this is a different problem.
Context: running keras in a gridsearch setting using the kerasclassifier wrapper with scikit learn. Sys: Ubuntu 16.04, libraries: anaconda distribution 5.1, keras 2.0.9, scikitlearn 0.19.1, tensorflow 1.3.0 or theano 0.9.0, using CPUs only.
Code:
I simply used the code here for testing: https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/, the second example 'Grid Search Deep Learning Model Parameters'. Pay attention to line 35, which reads:
grid = GridSearchCV(estimator=model, param_grid=param_grid)
Symptoms: When grid search uses more than 1 jobs (means cpus?), e.g.,, setting 'n_jobs' on the above line A to '2', line below:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=2)
will cause the code to hang indefinitely, either with tensorflow or theano, and there is no cpu usage (see attached screenshot, where 5 python processes were created but none is using cpu).
By debugging, it appears to be the following line with 'sklearn.model_selection._search' that causes problems:
line 648: for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))
, on which the program hangs and cannot continue.
I would really appreciate some insights as to what this means and why this could happen.
Thanks in advance
Are you using a GPU? If so, you can't have multiple threads running each variation of the params because they won't be able to share the GPU.
Here's a full example on how to use keras, sklearn wrappers in a Pipeline with GridsearchCV: Pipeline with a Keras Model
If you really want to have multiple jobs in the GridSearchCV, you can try to limit the GPU fraction used by each job (e.g. if each job only allocates 0.5 of the available GPU memory, you can run 2 jobs simultaneously)
See these issues:
Limit the resource usage for tensorflow backend
GPU memory fraction does not work in keras 2.0.9 but it works in 2.0.8
I dealt with this problem too and it really slowed me down not being able to run what is essentially trivially-parallelizable code. The issue is indeed with the tensorflow session. If a session in created in the parent process before GridSearchCV.fit(), it will hang!
The solution for me was to keep all session/graph creation code restricted to the KerasClassifer class and the model creation function i passed to it.
Also what Felipe said about the memory is true, you will want to restrict the memory usage of TF in either the model creation function or a subclass of KerasClassifier.
Related info:
Session hang issue with python multiprocessing
Keras + Tensorflow and Multiprocessing in Python
TL;DR Answer: You can't because your Keras model can't be serialized, and serialization is needed for parallelizing in Python with joblib.
This problem is much detailed here: https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-you-can-t-parallelize-nor-save-pipelines-using-steps-that-can-t-be-serialized-as-is-by-joblib
The solution to parallelize your code is to make your Keras estimator serializable. This can be done using savers as described at the link above.
If you're lucky enough to be using TensorFlow v2's prebuilt Keras module, the following practical code sample will reveal to be useful to you as you'd practically just need to take the code and modify it with yours:
https://github.com/guillaume-chevalier/seq2seq-signal-prediction
In this example, all the saving and loading code is all pre-written for you using Neuraxle-TensorFlow, and this makes it parallelizeable if you use Neuraxle's AutoML methods (e.g.: Neuraxle's grid search and Neuraxle's own parallelism things).