How to write summaries for multiple runs in Tensorflow - tensorflow

If you look at the Tensorboard dashboard for the cifar10 demo, it shows data for multiple runs. I am having trouble finding a good example showing how to set the graph up to output data in this fashion. I am currently doing something similar to this, but it seems to be combining data from runs and whenever a new run starts I see the warning on the console:
WARNING:root:Found more than one graph event per run.Overwritting the graph with the newest event

The solution turned out to be simple (and probably a bit obvious), but I'll answer anyway. The writer is instantiated like this:
writer = tf.train.SummaryWriter(FLAGS.log_dir, sess.graph_def)
The events for the current run are written to the specified directory. Instead of having a fixed value for the logdir parameter, just set a variable that gets updated for each run and use that as the name of a sub-directory inside the log directory:
writer = tf.train.SummaryWriter('%s/%s' % (FLAGS.log_dir, run_var), sess.graph_def)
Then just specify the root log_dir location when starting tensorboard via the --logdir parameter.

As mentioned in the documentation, you can specify multiple log directories when running tensorboard. Alternatively, you can create multiple run subfolder in the log directory to visualize different plots in the same graph.

Related

Amazon SageMaker notebook rl_deepracer_coach_robomaker - Write log CSV on S3 after simulation

I created my first notebook instance on Amazon SageMaker.
Next I opened the Jupyter notebook and I used the SageMaker Example in the section Reinforcement Learning rl_deepracer_coach_robomaker.ipynb. The question is addressed principally to those who are familiar with this notebook.
There you can launch a training process and a RoboMaker simulation application to start the learning process for an autonomous car.
When a simulation job is launched, one can access to the log file, which is visualised by default in a CloudWatch console. Some of the informations that appear in the log file can be modified in the script deepracer_env.py in /src/robomaker/environments subdirectory.
I would like to "bypass" the CloudWatch console, saving the log file informations like episode, total reward, number of steps, coordinates of the car, steering and throttle etc. in a dataframe or csv file to be written somewhere on the S3 at the end of the simulation.
Something similar has been done in the main notebook rl_deepracer_coach_robomaker.ipynb to plot the metrics for a training job, namely the training reward per episode. There one can see that
csv_file_name = "worker_0.simple_rl_graph.main_level.main_level.agent_0.csv"
is called from the S3, but I simply cannot find where this csv is generated to mimic the process.
You can create a csv file in the /opt/ml/output/intermediate/ folder, and the file will be saved in the following directory:
s3://<s3_bucket>/<s3_prefix>/output/intermediate/<csv_file_name>
However, it is not clear to me where exactly you will create such a file. DeepRacer notebook uses two machines, one for training (SageMaker instance) and one for simulations (RoboMaker instance). The above method will only work in a SageMaker instance, but much of what you would like to log such as ("Total rewards" in an episode) is actually in RoboMaker instance. For RoboMaker instances, the intermediate folder feature doesn't exist, and you'll have to save the file to s3 yourself using the boto library. Here is an example of doing that: https://qiita.com/hengsokvisal/items/329924dd9e3f65dd48e7
There is a way to download the CloudWatch logs to a file. This way you can just print, save the logs and parse it. Assuming you are executing from a notebook cell:
STREAM_NAME= <your stream name as given by RoboMaker CloudWatch logs>
task = !aws logs create-export-task --task-name "copy_deepracer_logs" --log-group-name "/aws/robomaker/SimulationJobs" --log-stream-name-prefix $STREAM_NAME --destination "<s3_bucket>" --destination-prefix "<s3_prefix>" --from <unix timestamp in milliseconds> --to <unix timestamp in milliseconds>
task_id = json.loads(''.join(task))['taskId']
The export is an asynchronous call, so give it a few minutes to download. If you can print the task_id, then the export is done.

Filename of graph .meta in a TensorFlow checkpoint directory

How do I get the filename of the latest checkpoint's graph?
In the examples for tf.train.import_meta_graph I typically see the filename hard-coded to be something like checkpoint_dir + 'model.ckpt-1234.meta'.
After importing, .restore can load the latest training parameters as in:
saver.restore(*session*, tf.train.latest_checkpoint(*my_dir*))
However, what is a reliable way to get the graph's filename? In my case,
tf.train_import_meta_data(tf.train.latest_checkpoint(*my_dir*) + '.meta')
should do the job but I don't think its reliable as checkpoints don't necessarily save the metagraph every time, correct?
I can write a routine that looks through the checkpoint dir and walks back until I find a .meta. But, is there a better/built-in way such as tf.train.latest_metagraph(*my_dir*)?
I just found an article where someone implemented the routine I mentioned in my question. This isn't an "answer", but is what I'll use assuming there isn't a built-in solution.
This is from the very nice article at seaandsalor written by João Felipe Santos.
I don't know the usage rules, so I didn't want to directly link to his code. Follow that link if interested and then go to the gist mentioned on the bottom.

How to create an op like conv_ops in tensorflow?

What I'm trying to do
I'm new to C++ and bazel and I want to make some change on the convolution operation in tensorflow, so I decide that my first step is to create an ops just like it.
What I have done
I copied conv_ops.cc from //tensorflow/core/kernels and change the name of the ops registrated in my new_conv_ops.cc. I also changed some name of the functions in the file to avoid duplication. And here is my BUILD file.
As you can see, I copy the deps attributes of conv_ops from //tensorflow/core/kernels/BUILD. Then I use "bazel build -c opt //tensorflow/core/user_ops:new_conv_ops.so" to build the new op.
What my problem is
Then I got this error.
I tried to delete bounds_check and got same error for the next deps. Then I realize that there is some problem for including h files in //tensorflow/core/kernels from //tensorflow/core/user_ops. So how can I perfectely create a new op excatcly like conv_ops?
Adding a custom operation to TensorFlow is covered in the tutorial here. You can also look at actual code examples.
To address your specific problem, note that the tf_custom_op_library macro adds most of the necessary dependencies to your target. You can simply write the following :
tf_custom_op_library(
name="new_conv_ops.so",
srcs=["new_conv_ops.cc"]
)

Tensorflow: checkpoints simple load

I have a checkpoint file:
checkpoint-20001 checkpoint-20001.meta
how do I extract variables from this space, without having to load the previous model and starting session etc.
I want to do something like
cp = load(checkpoint-20001)
cp.var_a
It's not documented, but you can inspect the contents of a checkpoint from Python using the class tf.train.NewCheckpointReader.
Here's a test case that uses it, so you can see how the class works.
https://github.com/tensorflow/tensorflow/blob/861644c0bcae5d56f7b3f439696eefa6df8580ec/tensorflow/python/training/saver_test.py#L1203
Since it isn't a documented class, its API may change in the future.

Loop over file names in sub job (Kettle job)

The task is to get file names from the folder and then loop the same task (job) over all the files one by one.
I created a simple job with transformation (get files names) and then job with flag "Execute for each row" (now is just logging the name of the file).
Did it the same way it is described here: http://ramathoughts.blogspot.ch/2010/08/processing-group-of-files-with-kettle.html
However, the path of the received files is not passed to the sub-job (logging doesn't display variable value). But the sub-job is executed as many times as there is number of files in the input folder. So it looks like it is passed to some extent, but for some reason is not available as a variable.
Image with log details, as seen the variable is displayed as ${path} instead of value of the path:
http://i.imgur.com/pK1iHtl.png?1
The sample code is below as archive with jobs and transformation and also sample input files. Any help is appreciated, as I may be missing something simple here https://www.hightail.com/download/bXBhL0dNcklCMTVsQXNUQw
The issue is the 2nd Job (i.e. j_log_file_names.kjb) is unable to detect the parameter path. Just try defining the parameter to this Job; like the image below:
This will make sure that the parameter that is coming from the prev. step is correctly fetched into the Job. Rest of your job looks absolutely fine.
Hope this helps :)