My TensorBoard log files grow huge because – it seems – every image summary ever generated is stored. This even though in TensorBoard, it seems like I can only look at the most recent image. And I only need the most recent image anyway.
Is there a way to let TensorBoard know that I only need the latest iamge? I looked at the SummaryWriter API docs but there is no obvious flag.
Hi I work on TensorBoard. To the best of my knowledge, the logs are append only. However when TensorBoard loads them into memory, it uses reservoir sampling so they don't consume all your memory. In the future, we may be implementing a system that will reservoir sampling during the writing phase, or possibly, a tool for compressing logs so they only contain what TensorBoard needs.
While TensorBoard image dashboard only shows the most recent image at the moment, we'd be hesitant to write tools to remove the previous ones, since we may be extending the dashboard to show more than the most recent sample.
Related
I've used with tf.name_scope(...) in my program to separate out a few different parts of the graph.
This works nicely in the TensorBoard graph view, I can view these functions as subgraphs. However, what I'd like to do is profile how much time (in total) was spent in each of these functions.
I've looked at the TraceViewer tool (https://www.tensorflow.org/guide/profiler#trace_viewer). I can zoom right in and see individual operations, each one have a long name which includes the name_scopes I have defined. I can see the "Wall duration" for each in the details pane.
But I can't see a way to get a summary of total CPU time spent per name_scope or any kind of summary of aggregate times (this is the sort of thing I'd expect to get from a more typical profiling tool in Python, time spent per function). Is there any way to get such a summary?
(I should say that my program is a TF-compiled state space model, not a Keras model. I've exported my trace "manually" with tf.summary.trace_export. So I can't see some of the profiling outputs that are shown on the TensorBoard docs pages. I can only see the Profile (Trace Viewer) and Graph pages in TensorBoard.)
Many thanks in advance for any advice.
I'm using Tensorflow-GPU 1.8 API on Windows 10. For many projects I use the tf.Estimator's, which really work great. It takes care of a bunch of steps including writting summaries for Tensorboard. But right now the 'events.out.tfevents' file getting way to big and I am running into "out of space" errors. For that reason I want to disable the summary writting or at least reduce the amount of summaries written.
Going along with that mission I found out about the RunConfig you can pass over at construction of tf.Estimator. Apparently the parameter 'save_summary_steps' (which by default is 200) controls the way summaries are wrtitten out. Unfortunately changing this parameter seems to have no effect at all. It won't disable (using None value) the summary or reducing (choosing higher values, e.g. 3000) the file size of 'events.out.tfevents'.
I hope you guys can help me out here. Any help is appreciated.
Cheers,
Tobs.
I've observed the following behavior. It doesn't make sense to me so I hope we get a better answer:
When the input_fn gets data from tf.data.TFRecordDataset then the number of steps between saving events is the minimum of save_summary_steps and (number of training examples divided by batch size). That means it does it a minimum of once per epoch.
When the input_fn gets data from tf.TextLineReader, it follows save_summary_steps as you'd expect and I can give it a large value for infrequent updates.
I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.
I am training convnets with Tensorflow and skflow, on an EC2 instance I share with other people. For all of us to be able to work at the same time, I'd like to limit the fraction of available GPU memory which is allocated.
This question does it with Tensorflow, but since I'm using sklfow I'm never using a tf.Session().
Is it possible to do the same thing through skflow ?
At this moment, you can only control the number of cores (num_cores) to be used in estimators by passing this parameter to estimator.
One can add gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) to tf.ConfigProto as suggested by this question you linked to achieve what you need.
Feel free to submit a PR to make changes here as well as adding this additional parameters to all estimators. Otherwise, I'll make the changes some time this week.
Edit:
I have made the changes to allow those options. Please check "Building A Model Using Different GPU Configurations" example in examples folder. Let me know if there's any particular need or other options you want to add. Pull requests are always welcomed!
I have been using PyMC in an analysis of some high energy physics data. It has worked to perfection, the analysis is complete, and we are working on the paper.
I have a small problem, however. I ran the sampler with the RAM database backend. The traces have been sitting around in memory in an IPython kernel process for a couple of months now. The problem is that the workstation support staff want to perform a kernel upgrade and reboot that workstation. This will cause me to lose the traces. I would like to keep these traces (as opposed to just generating new), since they are what I've made all the plots with. I'd also like to include a portion of the traces (only the parameters of interest) as supplemental material with the publication.
Is it possible to take an existing chain in a pymc.MCMC object created with the RAM backend, change to a different backend, and write out the traces in the chain?
The trace values are stored as NumPy arrays, so you can use numpy.savetxt to send the values of each parameter to a file. (This is what the text backend does under the hood.)
While saving your current traces is a good idea, I'd suggest taking the time to make your analysis repeatable before publishing.