Does TensorFlow support to save the initial hyper-parameter configuration automatically? - tensorflow

We need to run the designed networks many times for better performance and it would be better to record our experiments we have run. Maybe it could be good to provide to record these hyper-parameter configuration automatically by the tensorflow execution engine. For example, I record by set different directory name for the log directory as:
log_lr_0.001_theta_0.1_alpha_0.1
log_lr_0.01_theta_0.01_alpha_0.02
....
Are there any automatic ways to help this? In addition, it would be better that when we start a new tensorflow training instance, a new port will be allocated and a new tensorboard is started and shows its learning state.

No, tensorflow doesn't support initial hyper parameter configuration automatically.
I've faced the same issue as you, and I'm using a tool called Sacred, I hope you'd find that useful.

Related

Is session really needed in tensorflow2?

I'm so confused why it remains keeping the session in tf2 as said by the official eager mode has so many beneficial. Also sometimes I'm not sure whether to use session or not, and keep making bugs in tf programming, sometimes adding session just trying luck.
Tensorflow 2 does not require session.
Every v1.Session.run call should be replaced by a Python function.
The feed_dict and v1.placeholders become function arguments.
The fetches become the function's return value.
During conversion eager execution allows easy debugging with standard Python tools like pdb.
For more details take a look at Tensorflow migration guideline.

What Tensorflow API to use for Seq2Seq

This year Google produced 5 different packages for seq2seq:
seq2seq (claimed to be general purpose but
inactive)
nmt (active but supposed to be just
about NMT probably)
legacy_seq2seq
(clearly legacy)
contrib/seq2seq
(not complete probably)
tensor2tensor (similar purpose, also
active development)
Which package is actually worth to use for the implementation? It seems they are all different approaches but none of them stable enough.
I've had too a headache about some issue, which framework to choose? I want to implement OCR using Encoder-Decoder with attention. I've been trying to implement it using legacy_seq2seq (it was main library that time), but it was hard to understand all that process, for sure it should not be used any more.
https://github.com/google/seq2seq: for me it looks like trying to making a command line training script with not writing own code. If you want to learn Translation model, this should work but in other case it may not (like for my OCR), because there is not enough of documentation and too little number of users
https://github.com/tensorflow/tensor2tensor: this is very similar to above implementation but it is maintained and you can add more of own code for ex. reading own dataset. The basic usage is again Translation. But it also enable such task like Image Caption, which is nice. So if you want to try ready to use library and your problem is txt->txt or image->txt then you could try this. It should also work for OCR. I'm just not sure it there is enough documentation for each case (like using CNN at feature extractor)
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/seq2seq: apart from above, this is just pure library, which can be useful when you want to create a seq2seq by yourself using TF. It have a function to add Attention, Sequence Loss etc. In my case I chose that option as then I have much more freedom of choosing the each step of framework. I can choose CNN architecture, RNN cell type, Bi or Uni RNN, type of decoder etc. But then you will need to spend some time to get familiar with all the idea behind it.
https://github.com/tensorflow/nmt : another translation framework, based on tf.contrib.seq2seq library
From my perspective you have two option:
If you want to check the idea very fast and be sure that you are using very efficient code, use tensor2tensor library. It should help you to get early results or even very good final model.
If you want to make a research, not being sure how exactly the pipeline should look like or want to learn about idea of seq2seq, use library from tf.contrib.seq2seq.

tensorflow one of 20 parameter server is very slow

I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.

Sharing tensorflow model between processes

I have several processes running on a server, one of which is training a model in tensorflow. Periodically, I want the trainer to send the current model to the other processes. The way I do this now is with the usual Saver class that can save to and restore from disk.
However, I think this form of IPC is rather inefficient, and it is potentially causing filesystem lockups on the server. If there were a way to serialize the variables into some blob I could send that over a zmq broadcast pipe, I but I haven't found this in the docs.
Alternatively, distributed tensorflow is probably up to the task, but I don't think I need something so complicated.
You could preshare the architecture then use tf.get_collection(tf.GraphKeys.VARIABLES) at every number of steps of your liking, and run it to get the values, then you can use variable.assign at the other end to load up the values in the appropriate variables.

Limit GPU memory allocation in skflow

I am training convnets with Tensorflow and skflow, on an EC2 instance I share with other people. For all of us to be able to work at the same time, I'd like to limit the fraction of available GPU memory which is allocated.
This question does it with Tensorflow, but since I'm using sklfow I'm never using a tf.Session().
Is it possible to do the same thing through skflow ?
At this moment, you can only control the number of cores (num_cores) to be used in estimators by passing this parameter to estimator.
One can add gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) to tf.ConfigProto as suggested by this question you linked to achieve what you need.
Feel free to submit a PR to make changes here as well as adding this additional parameters to all estimators. Otherwise, I'll make the changes some time this week.
Edit:
I have made the changes to allow those options. Please check "Building A Model Using Different GPU Configurations" example in examples folder. Let me know if there's any particular need or other options you want to add. Pull requests are always welcomed!