Distributed Tensorflow: check failed: size>=0 - tensorflow

I'm using keras 2.0.6. The version of tensorflow is 1.3.0.
My code can run with theano backend, but failed with tensorflow backend:
F tensorflow/core/framework/tensor_shape.cc:241] Check failed: size >= 0 (-14428307456 vs. 0)
I was wondering if anyone can thought of any possible reason that might cause this.
Thank you!
----UPDATE-----
I tested exactly the same code on my PC with tensorflow. It runs perfectly.
However, it throw out this error when I run it on a Supercomputer.
Although this error looks like overflow, there is no way that it didn't overflow on my PC, but overflow on a supercomputer.
I suspect that it comes from a bug on tensorflow for distributed computation.

I came across the same bug, but Tensorflow ran ok after that I shrank the batch size.
I think the reason is the GPU running out of memory.

I had met the error, in my issue, the error is coming from TF with different vision.
the error is solved.
the model was trained in tf 1.15, but frozen the model in tf 1.13. When froze it in tf 1.15, everything is ok.
I think you can check the model version.

Related

Keras getting frozen when using regularizer in CNN model

I had a custom CNN implementation in keras running with TensorFlow backend. To improve generalizability I was working on adding regularization to the CNN model. The model works fine without any activity/kernel regularization. The moment I add an activity/kernel regularization the model freezes in between; training typically stops in between batches/iterations of a single epoch (for e.g. 67/172 batch). The issue is very repeatable and reproducible on my system and I was able to localize the issue to the implementation of regularization. It was strange to see this behavior and I could not find similar issues by others. I am not sure if I need to provide any additional information, if someone can guide me on what is lacking, I would be more than happy to provide the required information, and guidance on the issue would be greatly appreciated.
The following are some helpful information about things like the libraries/dependencies
Keras 2.4.3
Tensorflow 2.3.1
GPU: NVIDIA 1070 TI (8GB)
cudart64_101.dll was successfully openedT
The code was written in Spyder running on Python 3.8
Input: 32 batch size, input size (32, 256,64,1)
Using model.fit function to train the model
100,277 parameters, 99523 trainable
Actually, I think this issue is fixed after I updated the NVIDIA software to the latest version (11.1) and added the most recent ones to the path

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.
Thanks!

TF Keras NAN Loss when using multiple GPUs

System:
Ubuntu 18.04 LTS
(2) NVIDIA GTX 1080Ti GPUs 11GB
Driver Version: 440.33.01
CUDA Version: 10.0
I am currently using Tensorflow 2.0 (Python) and the tf.keras library to train a CNN.
However, I am encountering an issue when I try to train my model by calling model.fit(). After
I begin training, the loss is normal for 1 ~ 2 steps for the first epoch. But after that, it suddenly becomes NaN loss. If I try to stop the kernel that is running the training script, the whole computer freezes.
This issue only happens when using multiple GPUs. The code I'm using works perfectly fine on a single GPU. I have wrapped all of my code inside the scope of a tf.distribute.MirroredStrategy using with strategy.scope():. I am feeding my network with data from a tf.data.Dataset (though this error occurs regardless of the data I'm using to train).
I then ran some tests:
1) I tried to replace the data in my dataset with random numbers from a distribution, but the loss stil went to NaN.
2) I also tried feeding the numpy arrays directly to .fit(), but that didn't solve the issue.
3) I tried using different optimizers (Adam, RMSprop, SGD), batch sizes (4, 8, 16, 32), and learning rates, none of which helped to solve this problem.
4) I swapped out my network for a simple Multi-layer Perceptron, but the error persisted.
This doesn't appear to be an OOM issue, since the data is relatively small and running watch -n0.1 nvidia-smi reveals that memory usage never exceeds 30% on either of my GPUs. There doesn't seem to be any warning or error in the console output that might hint at the issue either.
Any help is appreciated

Tensorboard projector will compute PCA endlessly

I have just over 100k word embeddings which I created using gensim, originally each containing 200 dimensions. I've been trying to visualize them within tensorboard's projector but I have only failed so far.
My problem is that tensorboard seems to freeze while computing PCA. At first, I left the page open for 16 hours, imagining that it was just too much to be calculated, but nothing happened. At this point, I started to try and test different scenarios just in case all I needed was more time and I was trying to rush things. The following is a list of my testing so far, all of which failed at the same spot, computing PCA:
I plotted only 10 points of 200 dimensions;
I retrained my gensim model so that I could reduce its dimensionality to 100;
Then I reduced it to 10;
Then to 2;
Then I tried plotting only 2 points, i.e. 2 two dimensional points;
I am using Tensorflow 1.11;
You can find my last saved tensor flow session here, would you mind trying it out?
I am still a beginner, therefore I used a couple tutorial to get me started; I used Sud Harsan work so far.
Any help is much appreciated. Thanks.
Updates:
A) I've found someone else dealing with the same problem; I tried the solution provided, but it didn't change anything.
B) I thought it could have something to do with my installation, therefore I tried uninstalling tensorflow and installing it back; no luck. I then proceeded to create a new environment dedicated to tensorflow and that also didn't work.
C) Assuming there was something wrong with my code, I ran tensorflow's basic embedding tutorial to check if I could open its projector's results. And guess what?! I still can't go past "Calculating PCA"
Now, I did visit the online projector example and that loads perfectly.
Again, Any help would be more than appreciated. Thanks!
I have the same problem with word2vec_basic.py
My environment: win10, conda, python 3.6.7, tensorflow 1.11, tensorboard 1.11
That may not your fault because I roll back tensorflow & tensorboard from 1.11 to 1.7
And guess what?! The projector appears just a few seconds!
reference
Update 10/11
tensorboard & tensorflow 1.12 are available in conda today, I take a try and this problem seems to be fixed.
As mentioned by Bluedrops, updating tensorboard and tensorflow seems to fix the problem.
I created a new environment with conda and installed the newest versions of Tensorflow, Tensorboard and their dependencies and that seems to fix the issue.

keras + scikit-learn wrapper, appears to hang when GridSearchCV with n_jobs >1

UPDATE: I have to re-write this question as after some investigation I realise that this is a different problem.
Context: running keras in a gridsearch setting using the kerasclassifier wrapper with scikit learn. Sys: Ubuntu 16.04, libraries: anaconda distribution 5.1, keras 2.0.9, scikitlearn 0.19.1, tensorflow 1.3.0 or theano 0.9.0, using CPUs only.
Code:
I simply used the code here for testing: https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/, the second example 'Grid Search Deep Learning Model Parameters'. Pay attention to line 35, which reads:
grid = GridSearchCV(estimator=model, param_grid=param_grid)
Symptoms: When grid search uses more than 1 jobs (means cpus?), e.g.,, setting 'n_jobs' on the above line A to '2', line below:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=2)
will cause the code to hang indefinitely, either with tensorflow or theano, and there is no cpu usage (see attached screenshot, where 5 python processes were created but none is using cpu).
By debugging, it appears to be the following line with 'sklearn.model_selection._search' that causes problems:
line 648: for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))
, on which the program hangs and cannot continue.
I would really appreciate some insights as to what this means and why this could happen.
Thanks in advance
Are you using a GPU? If so, you can't have multiple threads running each variation of the params because they won't be able to share the GPU.
Here's a full example on how to use keras, sklearn wrappers in a Pipeline with GridsearchCV: Pipeline with a Keras Model
If you really want to have multiple jobs in the GridSearchCV, you can try to limit the GPU fraction used by each job (e.g. if each job only allocates 0.5 of the available GPU memory, you can run 2 jobs simultaneously)
See these issues:
Limit the resource usage for tensorflow backend
GPU memory fraction does not work in keras 2.0.9 but it works in 2.0.8
I dealt with this problem too and it really slowed me down not being able to run what is essentially trivially-parallelizable code. The issue is indeed with the tensorflow session. If a session in created in the parent process before GridSearchCV.fit(), it will hang!
The solution for me was to keep all session/graph creation code restricted to the KerasClassifer class and the model creation function i passed to it.
Also what Felipe said about the memory is true, you will want to restrict the memory usage of TF in either the model creation function or a subclass of KerasClassifier.
Related info:
Session hang issue with python multiprocessing
Keras + Tensorflow and Multiprocessing in Python
TL;DR Answer: You can't because your Keras model can't be serialized, and serialization is needed for parallelizing in Python with joblib.
This problem is much detailed here: https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-you-can-t-parallelize-nor-save-pipelines-using-steps-that-can-t-be-serialized-as-is-by-joblib
The solution to parallelize your code is to make your Keras estimator serializable. This can be done using savers as described at the link above.
If you're lucky enough to be using TensorFlow v2's prebuilt Keras module, the following practical code sample will reveal to be useful to you as you'd practically just need to take the code and modify it with yours:
https://github.com/guillaume-chevalier/seq2seq-signal-prediction
In this example, all the saving and loading code is all pre-written for you using Neuraxle-TensorFlow, and this makes it parallelizeable if you use Neuraxle's AutoML methods (e.g.: Neuraxle's grid search and Neuraxle's own parallelism things).