Tensorflow Datasets - what's the point - tensorflow

I'm hoping someone can guide me in the right direction.
I am trying to feed input variables (features) and label to tf.estimator.DNNClassifier, and it keeps recommending that I use tensorflow datasets instead of reading the data from a pandas dataframe (using tf.estimator.inputs.pandas_input_fn()).
The issue is, I need to read my CSV file first into a dataframe to make a lot of transformations before feeding into the DNN. As I understand from this blog post, the tensorflow dataset wants to read the data from a CSV file - for reasons that make sense.
So, then will have to write the transformed data to another CSV so I can re-import into a tensorflow dataset? That doesn't make any sense. Is there a good guide that I can read. I'm frustrated.

Related

CSV dataset does not fit in memory with tensorflow

With Tensorflow 2.3, my input/output data comes from a CSV file (floating numbers). Unfortunatly, my dataset does not fit in memory. It is probably possible to avoid loading the whole dataset in memory by, for example, splitting it.
I already made a lot of researches including StackOverflow questions, but answers are not always clear, or are refering to previous versions of TensorFlow.
how to fit tensorflow dataset
Dataset does not fit in memory
how to fit tensorflow dataset
https://www.tensorflow.org/tutorials/load_data/csv
https://medium.com/#mrgarg.rajat/training-on-large-datasets-that-dont-fit-in-memory-in-keras-60a974785d71
If someone could provide a simple starter example ...

Huge size of TF records file to store on Google Cloud

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

Is there a way to directly analyze and manipulate tensorboard "events.*" log file?

everyone. Tensorboard is a wonderful tool to visualize learning process, but it has some inconvenience for me. For some reason, sometimes I want to remove part of learning curve (like erase scalars after X-th step). However, all scalars are put together in a single "events.*" file, plus Tensorboard only provides high-level api (only "adding" but no "removing") to the best of my knowledge. Does anyone have some ideas about this? Thanks!
You can view the contents of a checkpoint file using print_tensors_in_checkpoint_file(),
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
print_tensors_in_checkpoint_file(file_name="<your_path_here>/model.ckpt-<whatever_step>", tensor_name='', all_tensors=True)

preprocess data sets for Tensorflow highlevel estimators

I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.

mnist and cifar10 examples with TFRecord train/test file

I am a new user of Tensorflow. I would like to use it for training a dataset of 2M images. I did this experiment in caffe using lmdb file format.
After reading Tensorflow related posts, I realized that TFRecord is the most suitable file format to do so. Therefore, I am looking for complete CNN examples which use TFRecord data. I noticed that the image related tutorials (mnist and cifar10 in link1 and link2) are provided with a different binary file format where the entire data-set is loaded at once. Therefore, I would like to know if anyone knows if these tutorials (mnist and cifar10) are available using TFRecord data (for both CPU and GPU).
I assume that you want to both, write and read TFRecord files. I think what is done here reading_data.py should help you converting MNIST data into TFRecors.
For reading it back, this script does the trick: fully_connected_reader.py
This could be done similarly with cifar10.