I've just started learning how to use Tensorflow and have run into an issue that's making me doubt my understanding of how it should work. I want to get a rough idea of how much performance I should be getting using basic arithmetical operations on a GPU. I create a one dimensional tensor of 100 million elements and then chain 1000 add operations on this tensor. My expectation is that the Tensorflow run-time would be able to link these operations into a single CUDA kernel that's executed on the GPU, however when I run it it seems that each operation is being issued to the GPU separately. It takes around 5 seconds to complete on my gtx 1080 ti, which gives around 20 Gflops. While running, python.exe is using up a full CPU core and Nvidia Nsight shows many kernels being submitted. In comparison, when I try and see what I get with Alea.GPU I get around 3Tflops and a single CUDA kernel issued.
Am I misunderstanding how basic operations should work on a GPU? is the only way to get good GPU efficiency to manually group operations into more complex custom operations or use the higher level ML functions?
Thank you.
import tensorflow as tf
import time
TENSOR_SIZE=100000000
TF_REP=1000
def testSpeed(x):
tf.InteractiveSession();
z=tf.zeros(TENSOR_SIZE)
for i in range(0, TF_REP):
z=tf.add(z,x)
return tf.reduce_sum(z).eval();
x=tf.range(0.0, TENSOR_SIZE)
t0=time.perf_counter()
testSpeed(x)
t1=time.perf_counter()
print("Time taken "+str(t1-t0)+"s gflops= " + str(TENSOR_SIZE * TF_REP / 1000000000.0 / (t1 - t0)))
Firstly, you should separate your code into 2 stages, a build_graph stage, which defines the various tensors. I suggest collecting them in a function called build_graph(). Then create your session and run data through it. You are trying to apply procedural programming techniques to an imperative library.
Next is the issue of swapping data onto and off of the GPU. When you run tf.reduce_sum(z).eval() you are copying the result from GPU back to CPU every time.
Lastly, you are creating many sessions with tf.InteractiveSession(), you should only have 1 session created. Go back to the first issue to resolve this. A best practice is to never create tensorflow OPs after the session has been created. Tensorflow will allow you to, but as a best practice don't, and you shouldn't need to if you coded things correctly. If you feel like you need to, post a question asking why you can't do XYZ without defining it before the session was created and someone will almost certainly offer a correction to the workflow.
Related
I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.
While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.
I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.
I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism provided by high level tf.train.MonitoredTrainingSession that can be implemented through adding a tf.train.CheckpointSaverHook while initializing the tf.train.MonitoredTrainingSession.
I am thinking about using a tf.train.CheckpointSaverListener (i.e. using before_save and after_save methods) to log the time and track IO. But I have a question, will this logging technique I am thinking about give me a proper percentage calculation (i.e. Time taken for checkpointing IO / Time taken for Training * 100%) ?
I am suspecting that, this checkpointing is done asynchronously through a thread different from training. I have been looking into the TensorFlow code to find this out, but I thought asking this question here can accelerate my exploration.
I am open to any suggestion on using any other alternative technique (e.g. using TensorBoard, IO profiling tools etc.)
I believe it will.
The checkpointing isn't done asynchronously. You'd want the checkpoint to contain a consistent snapshot of the variables/parameters and thus do not want to checkpoint asynchronously with other operations that may update the parameter values.
The CheckpointSaverHook explicitly uses the Session to execute the operation that saves the checkpoint (source code) and waits for it to complete (It's basically invoking tf.train.Saver.save).
So, the CheckpointSaverListener you thought of should work out fine - modulo the time taken by any other CheckpointSaverListeners in your program.
Hope that helps.
Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.
I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.