Not enough disk space when loading dataset with TFDS - tensorflow

I was implementing a DCGAN application based on the lsun-bedroom dataset. I was planning to utilize tfds, since lsun was on its catalog. Since the total dataset contains 42.7 GB of images, I only wanted to load a portion(10%) of the full data and used the following code to load the data according to the manual. Unfortunately, the same error informing not enough disk space occurred. Would there be a possible solution with tfds or should I use another API to load the data?
tfds.load('lsun/bedroom',split='train[10%:]')
Not enough disk space. Needed: 42.77 GiB (download: 42.77 GiB, generated: Unknown size)
I was testing on Google Colab

TFDS download the dataset from the original author website. As the datasets are often published as monolithic archive (e.g lsun.zip), it is unfortunately
impossible for TFDS to only download/install part of the dataset.
The split argument only filter the dataset after it has been fully generated. Note: You can see the download size of the datasets in the catalog: https://www.tensorflow.org/datasets/catalog/overview

To me, there seems to be some kind of issue or, at least, a misunderstanding about the variable 'split' of tfds.load().
'split' seems to be intended to load a given portion of the dataset, once the whole dataset has been downloaded.
I got the same error message when downloading the dataset called "librispeech". Any setting of the variable 'split' seems to be intended to download the whole dataset, which is too big for my disk.
I managed to download the much smaller "mnist" dataset, but I found both the train and test splits downloaded by setting 'split' to 'test'.

Related

Is there a way to keep a Tensorflow record file in memory?

Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...
You need tf.data.Dataset.cache() operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with dataset.map(...), then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.

What is the concept of CNTKTextFormatDeserializer and why use?

I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.

tensorflow one of 20 parameter server is very slow

I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.

Remove image outputs from TensorBoard

My TensorBoard log files grow huge because – it seems – every image summary ever generated is stored. This even though in TensorBoard, it seems like I can only look at the most recent image. And I only need the most recent image anyway.
Is there a way to let TensorBoard know that I only need the latest iamge? I looked at the SummaryWriter API docs but there is no obvious flag.
Hi I work on TensorBoard. To the best of my knowledge, the logs are append only. However when TensorBoard loads them into memory, it uses reservoir sampling so they don't consume all your memory. In the future, we may be implementing a system that will reservoir sampling during the writing phase, or possibly, a tool for compressing logs so they only contain what TensorBoard needs.
While TensorBoard image dashboard only shows the most recent image at the moment, we'd be hesitant to write tools to remove the previous ones, since we may be extending the dashboard to show more than the most recent sample.

Applying XGBOOST with large data set

I have a large dataset with size approximately 5.3 GB and i have stored data using bigmemory() in R. Please let me know how to apply XGBOOST to this kind of data??
There currently is no support for this with xgboost. You could file an issue on the github repo with respect to the R package.
Otherwise, you could attempt to have it read from a file of your data. The docs say you can point to a local data file. Not sure about format restrictions or how it will be handled in RAM but something to explore.