Tensorflow behaves differently between GPU and CPU - tensorflow

I modified the Tensorflow tutorial example for beginners by adding a hidden layer. The recognition rate obtained by running on GPU is ~95%. But when running in cpu-only mode, I can get ~40%. Does anybody know why the same python code behaves so differently? Thanks.


GPU support for TensorFlow & PyTorch

Okay, so I've worked on a bunch of Deep Learning projects and internships now and I've never had to do heavy training. But lately I've been thinking of doing some Transfer Learning for which I'll need to run my code on a GPU. Now I have a system with Windows 10 and a dedicated NVIDIA GeForce 940M GPU. I've been doing a lot of research online, but I'm still a bit confused. I haven't installed the NVIDIA Cuda Toolkit or cuDNN or tensorflow-gpu on my system yet. I currently use tensorflow and pytorch to train my DL models. Here are my queries -
When I define a tensor in tf or pytorch, it is a cpu tensor by default. So, all the training I've been doing so far has been on the CPU. So, if I make sure to install the correct versions of Cuda and cuDNN and tensorflow-gpu (specifically for tensorflow), I can run my models on my GPU using tf-gpu and pytorch and that's it? (I'm aware of the torch.cuda.is_available() in pytorch to ensure pytorch can access my GPU and the device_lib module in tf to check if my gpu is visible to tensorflow)(I'm also aware of the fact that tf doesnt support all Nvidia GPUs)
Why does tf have a separate module for GPU support? PyTorch doesnt seem to have that and all you need to do is cast your tensor from cpu() to cuda() to switch between them.
Why install cuDNN? I know it is a high-level API CUDA built for support to train Deep Neural Nets on the GPU. But do tf-gpu and torch use these in the backend while training on the gpu?
After tf == 1.15, did they combine CPU and GPU support all into one package?
First of all unfortunately 940M is a kinda weak GPU for training. I suggest you use Google colab for faster training but of course, it would be faster than the CPU. So here my answers to your four questions.
1-) Yes if you install the requirements correctly, then you can run on GPU. You can manually place your data to your GPU as well. You can check implementations on TensorFlow. In PyTorch, you should specify the device that you want to use. As you said you should do device = torch.device("cuda" if args.cuda else "cpu") then for models and data you should always call .to(device) Then it will automatically use GPU if available.
2-) PyTorch also needs extra installation (module) for GPU support. However, with recent updates both TF and PyTorch are easy to use for GPU compatible code.
3-) Both Tensorflow and PyTorch is based on cuDNN. You can use them without cuDNN but as far as I know, it hurts the performance but I'm not sure about this topic.
4-) No they are still different packages. tensorflow-gpu==1.15 and tensorflow==1.15 what they did with tf2, was making the tensorflow more like Keras. So it is more simplified then 1.15 or before.
Rest was already answered by regarding 3) cudNN optimizes layer and such operations on hardware level and those implementations are pure black magic. It is incredibly hard to write CUDA code that properly utilizes your GPU (how load data into the GPU, how to actually perform them using matrices etc. )

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.

keras + scikit-learn wrapper, appears to hang when GridSearchCV with n_jobs >1

UPDATE: I have to re-write this question as after some investigation I realise that this is a different problem.
Context: running keras in a gridsearch setting using the kerasclassifier wrapper with scikit learn. Sys: Ubuntu 16.04, libraries: anaconda distribution 5.1, keras 2.0.9, scikitlearn 0.19.1, tensorflow 1.3.0 or theano 0.9.0, using CPUs only.
I simply used the code here for testing: https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/, the second example 'Grid Search Deep Learning Model Parameters'. Pay attention to line 35, which reads:
grid = GridSearchCV(estimator=model, param_grid=param_grid)
Symptoms: When grid search uses more than 1 jobs (means cpus?), e.g.,, setting 'n_jobs' on the above line A to '2', line below:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=2)
will cause the code to hang indefinitely, either with tensorflow or theano, and there is no cpu usage (see attached screenshot, where 5 python processes were created but none is using cpu).
By debugging, it appears to be the following line with 'sklearn.model_selection._search' that causes problems:
line 648: for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))
, on which the program hangs and cannot continue.
I would really appreciate some insights as to what this means and why this could happen.
Thanks in advance
Are you using a GPU? If so, you can't have multiple threads running each variation of the params because they won't be able to share the GPU.
Here's a full example on how to use keras, sklearn wrappers in a Pipeline with GridsearchCV: Pipeline with a Keras Model
If you really want to have multiple jobs in the GridSearchCV, you can try to limit the GPU fraction used by each job (e.g. if each job only allocates 0.5 of the available GPU memory, you can run 2 jobs simultaneously)
See these issues:
Limit the resource usage for tensorflow backend
GPU memory fraction does not work in keras 2.0.9 but it works in 2.0.8
I dealt with this problem too and it really slowed me down not being able to run what is essentially trivially-parallelizable code. The issue is indeed with the tensorflow session. If a session in created in the parent process before GridSearchCV.fit(), it will hang!
The solution for me was to keep all session/graph creation code restricted to the KerasClassifer class and the model creation function i passed to it.
Also what Felipe said about the memory is true, you will want to restrict the memory usage of TF in either the model creation function or a subclass of KerasClassifier.
Related info:
Session hang issue with python multiprocessing
Keras + Tensorflow and Multiprocessing in Python
TL;DR Answer: You can't because your Keras model can't be serialized, and serialization is needed for parallelizing in Python with joblib.
This problem is much detailed here: https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-you-can-t-parallelize-nor-save-pipelines-using-steps-that-can-t-be-serialized-as-is-by-joblib
The solution to parallelize your code is to make your Keras estimator serializable. This can be done using savers as described at the link above.
If you're lucky enough to be using TensorFlow v2's prebuilt Keras module, the following practical code sample will reveal to be useful to you as you'd practically just need to take the code and modify it with yours:
In this example, all the saving and loading code is all pre-written for you using Neuraxle-TensorFlow, and this makes it parallelizeable if you use Neuraxle's AutoML methods (e.g.: Neuraxle's grid search and Neuraxle's own parallelism things).

Does tensorflow automatically detect GPU or do I have to specify it manually?

I have a code written in tensorflow that I run on CPUs and it runs fine.
I am transferring to a new machine which has GPUs and I run the code on the new machine but the training speed did not improve as expected (takes almost the same time).
I understood that Tensorflow automatically detects GPUs and run the operations on them (https://www.quora.com/How-do-I-automatically-put-all-my-computation-in-a-GPU-in-TensorFlow) & (https://www.tensorflow.org/tutorials/using_gpu).
Do I have to change the code to make it manually runs the operations on GPUs (for now I have a single GPU)? and what would be gained by doing that manually?
If the GPU version of TensorFlow is installed and if you don't assign all your tensors to CPU, some of them should be assigned to GPU.
To find out which devices (CPU, GPU) are available to TensorFlow, you can use this:
from tensorflow.python.client import device_lib
Regarding the question of the performance, it's quite a broad subject and it really depends of your model, your data and so on. Here are a few and wide remarks on TensorFlow performance.

Does Gensim library support GPU acceleration?

Using Word2vec and Doc2vec methods provided by Gensim, they have a distributed version which uses BLAS, ATLAS, etc to speedup (details here). However, is it supporting GPU mode? Is it possible to get GPU working if using Gensim?
Thank you for your question. Using GPU is on the Gensim roadmap. Will appreciate any input that you have about it.
There is a version of word2vec running on keras by #niitsuma called word2veckeras.
The code that runs on latest Keras version is in this fork and branch https://github.com/SimonPavlik/word2vec-keras-in-gensim/tree/keras106
#SimonPavlik has run performance test on this code. He found that a single gpu is slower than multiple CPUs for word2vec.