How to use TensorFlow to predict large csv files by chunks and glue results together - tensorflow

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?

Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

Related

Can we build a LSTM classification model by using pyspark

I had familied with build a LSTM model for time series classification problem by using tensor flow with a small dataset. However, I try to further study and create a new LSTM model with a new big data set (around 30GB text files data). I realize that pyspark is an effective way to handle big data processing.
My question is Can we create LSTM model by using pyspark or Does anyone has a guideline or an example use case in the same propose?

With Tensorflow/Keras, is there a way to split up input data for training so I don’t run out of RAM?

I am using tensorflow with Keras to process audio data through conv1d layers, however, I am running out RAM (only 8GB available). I have an input .wav and a target .wav for the network to learn, and each file is 40MB (about 4 minutes of audio).
In this model, one sample of audio is learned from the previous 200 samples. In order to accomplish this, I am taking (for example) the input 8000000 samples and “unfolding” it into (8000000, 200, 1), where each audio sample becomes an array of the previous 200 samples. Then I call “model.fit(unfolded_input_samples, target_samples)”.
The problem is I quickly run out of RAM when unfolding the input data. Is there a way around creating this massive array while still telling Tensorflow how to use the previous 200 samples for each data point? Can I break up the unfolded input array into chunks and pass each to fit() without starting a new epoch? Or is there an easier way to accomplish this with Tensorflow.

Does it make sense to use Tensorflow Dataset over a Keras DataGenerator?

I am training a model using tf.keras and I have many small .npy files with single observations in a folder on local disk. I have build a DataGeneretor(keras.utils.Sequence) class and it works correctly, although I have a warning:
'tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.'
I have found out that I can simply create something like this:
ds = tf.data.Dataset.from_generator(
DataGenerator, args=[...],
output_types=(tf.float16, tf.uint8),
output_shapes=([None,256,256,3], [None,256,256,1]),
)
and then my Keras DataGenerator would work as a single file reader and a TF Dataset as interface to create batches. My question is: does it make any sense? Would it be safer? Would it read next batch during the training of previous batch, when using simple model.fit?

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.
There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft