Limit GPU memory allocation in skflow - tensorflow

I am training convnets with Tensorflow and skflow, on an EC2 instance I share with other people. For all of us to be able to work at the same time, I'd like to limit the fraction of available GPU memory which is allocated.
This question does it with Tensorflow, but since I'm using sklfow I'm never using a tf.Session().
Is it possible to do the same thing through skflow ?

At this moment, you can only control the number of cores (num_cores) to be used in estimators by passing this parameter to estimator.
One can add gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) to tf.ConfigProto as suggested by this question you linked to achieve what you need.
Feel free to submit a PR to make changes here as well as adding this additional parameters to all estimators. Otherwise, I'll make the changes some time this week.
Edit:
I have made the changes to allow those options. Please check "Building A Model Using Different GPU Configurations" example in examples folder. Let me know if there's any particular need or other options you want to add. Pull requests are always welcomed!

Related

Is session really needed in tensorflow2?

I'm so confused why it remains keeping the session in tf2 as said by the official eager mode has so many beneficial. Also sometimes I'm not sure whether to use session or not, and keep making bugs in tf programming, sometimes adding session just trying luck.
Tensorflow 2 does not require session.
Every v1.Session.run call should be replaced by a Python function.
The feed_dict and v1.placeholders become function arguments.
The fetches become the function's return value.
During conversion eager execution allows easy debugging with standard Python tools like pdb.
For more details take a look at Tensorflow migration guideline.

How to disable summary for Tensorflow Estimator?

I'm using Tensorflow-GPU 1.8 API on Windows 10. For many projects I use the tf.Estimator's, which really work great. It takes care of a bunch of steps including writting summaries for Tensorboard. But right now the 'events.out.tfevents' file getting way to big and I am running into "out of space" errors. For that reason I want to disable the summary writting or at least reduce the amount of summaries written.
Going along with that mission I found out about the RunConfig you can pass over at construction of tf.Estimator. Apparently the parameter 'save_summary_steps' (which by default is 200) controls the way summaries are wrtitten out. Unfortunately changing this parameter seems to have no effect at all. It won't disable (using None value) the summary or reducing (choosing higher values, e.g. 3000) the file size of 'events.out.tfevents'.
I hope you guys can help me out here. Any help is appreciated.
Cheers,
Tobs.
I've observed the following behavior. It doesn't make sense to me so I hope we get a better answer:
When the input_fn gets data from tf.data.TFRecordDataset then the number of steps between saving events is the minimum of save_summary_steps and (number of training examples divided by batch size). That means it does it a minimum of once per epoch.
When the input_fn gets data from tf.TextLineReader, it follows save_summary_steps as you'd expect and I can give it a large value for infrequent updates.

Does TensorFlow support to save the initial hyper-parameter configuration automatically?

We need to run the designed networks many times for better performance and it would be better to record our experiments we have run. Maybe it could be good to provide to record these hyper-parameter configuration automatically by the tensorflow execution engine. For example, I record by set different directory name for the log directory as:
log_lr_0.001_theta_0.1_alpha_0.1
log_lr_0.01_theta_0.01_alpha_0.02
....
Are there any automatic ways to help this? In addition, it would be better that when we start a new tensorflow training instance, a new port will be allocated and a new tensorboard is started and shows its learning state.
No, tensorflow doesn't support initial hyper parameter configuration automatically.
I've faced the same issue as you, and I'm using a tool called Sacred, I hope you'd find that useful.

Tensorflow not linking operations into single CUDA kernel

I've just started learning how to use Tensorflow and have run into an issue that's making me doubt my understanding of how it should work. I want to get a rough idea of how much performance I should be getting using basic arithmetical operations on a GPU. I create a one dimensional tensor of 100 million elements and then chain 1000 add operations on this tensor. My expectation is that the Tensorflow run-time would be able to link these operations into a single CUDA kernel that's executed on the GPU, however when I run it it seems that each operation is being issued to the GPU separately. It takes around 5 seconds to complete on my gtx 1080 ti, which gives around 20 Gflops. While running, python.exe is using up a full CPU core and Nvidia Nsight shows many kernels being submitted. In comparison, when I try and see what I get with Alea.GPU I get around 3Tflops and a single CUDA kernel issued.
Am I misunderstanding how basic operations should work on a GPU? is the only way to get good GPU efficiency to manually group operations into more complex custom operations or use the higher level ML functions?
Thank you.
import tensorflow as tf
import time
TENSOR_SIZE=100000000
TF_REP=1000
def testSpeed(x):
tf.InteractiveSession();
z=tf.zeros(TENSOR_SIZE)
for i in range(0, TF_REP):
z=tf.add(z,x)
return tf.reduce_sum(z).eval();
x=tf.range(0.0, TENSOR_SIZE)
t0=time.perf_counter()
testSpeed(x)
t1=time.perf_counter()
print("Time taken "+str(t1-t0)+"s gflops= " + str(TENSOR_SIZE * TF_REP / 1000000000.0 / (t1 - t0)))
Firstly, you should separate your code into 2 stages, a build_graph stage, which defines the various tensors. I suggest collecting them in a function called build_graph(). Then create your session and run data through it. You are trying to apply procedural programming techniques to an imperative library.
Next is the issue of swapping data onto and off of the GPU. When you run tf.reduce_sum(z).eval() you are copying the result from GPU back to CPU every time.
Lastly, you are creating many sessions with tf.InteractiveSession(), you should only have 1 session created. Go back to the first issue to resolve this. A best practice is to never create tensorflow OPs after the session has been created. Tensorflow will allow you to, but as a best practice don't, and you shouldn't need to if you coded things correctly. If you feel like you need to, post a question asking why you can't do XYZ without defining it before the session was created and someone will almost certainly offer a correction to the workflow.

tensorflow one of 20 parameter server is very slow

I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.