Convert textual document to tf.data in tensorflow for reading sequentially - tensorflow

In a textual corpus, there are 50 textual documents that each document approximately is about 80 lines.
I want to feed my corpus as an input to tensorflow, but I want to batch each document when system read each document? actually same as TfRecord that used for images I want to by using Tf.Data make batch each document in my corpus for reading it sequentially?
How can I solve this issue?

You can create a TextLineDataset that will contain the lines of your documents:
dataset = tf.data.TextLineDataset(['doc1.txt', 'doc2.txt', ...])
After you create the dataset, you can split the strings into batches using the batch method and other methods of the Dataset class.

Related

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.
There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

How to use TensorFlow to predict large csv files by chunks and glue results together

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?
Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft

cnn using tensorflow for own image set - what should be the tfrecord format

I have an image data set of size 600 x 400 each and I have converted each of the images to TFRecord format. But I am unable to figure out how to use this data? I have seen the imagenet dataset and found only one single binary file (when extracted it form here).
Is it that for an image dataset there will be only one TFRecord or each individual images will have their own TFRecord files?
Tensorflow doesnt look for single tfrecord file. So feel free and point your "data directory" and "train directory" to the location which has set of tfrecord files.
Also, keep in mind files should be in respective directories based on their names like TRAIN-*.tfrecord files in "train directory".
Answer can be more specific if you mentioned what model of TF you are targeting to run on these TF record files.
Hope it helps.

TensorFlow: Convolution Neural Network with non-image input

I am interested in using Tensorflow for training my data for binary classification based on CNN.
Now I wonder about how to set the filter value, number of output nodes in the convolution process.
I have read many tutorials and example. However, most of them use image data and I cannot compare it with my data that is customer data, not pixel.
So could you suggest me about this issue?
If you data varies in time or space then you can use CNN,I am currently working with EEG data set which varies in time.Also you can refer to this paper
http://www.nlpr.ia.ac.cn/english/irds/People/lwang/M-MCG_EN/Publications/2015/YD2015ACPR.pdf
were the input data(Which is not an image) is presented as an image to the CNN.
You have to reshape the data to be 4d. In this example, I have only 4 column.
x_train = np.reshape(x_train, (x_train.shape[0],2, 2,1))
x_test = np.reshape(x_test, (x_test.shape[0],2,2, 1))
This is a good example to use none image data
https://github.com/fengjiqiang/LSTM-Wind-Speed-Forecasting
You just need to change the following :
prediction_cols
feature_cols
features
and dataload
This tutorial for text :
Here !
You might use one of following classes:
class Dataset: Represents a potentially large set of elements.
class FixedLengthRecordDataset: A Dataset of fixed-length records
from one or more binary files.
class Iterator: Represents the state of iterating through a Dataset.
class TFRecordDataset: A Dataset comprising records from one or more
TFRecord files.
class TextLineDataset: A Dataset comprising lines from one or more
text files.
Tutorial
official documentation