Tensorflow Datasets with different structure Merging to Batch Merged Dataset - tensorflow

**
`
I have 2 TensorFlow datasets. One dataset is 19 million and other one is 24000. I want to create 19million * 24000 type input feature matrix. Can I achieve it using TensorFlow dataset operation.
Currently I am trying to merge the datasets using pandas as follows. But I get memory error in AWS Sage maker pipeline and it needs 2.74TiB. So I decided to batch inferencing After converting my 2 users_df and items_df data frames to TensorFlow datasets!
Is there any way to do this pandas merge operation in tensorflow datasets?`
**

Related

Can we build a LSTM classification model by using pyspark

I had familied with build a LSTM model for time series classification problem by using tensor flow with a small dataset. However, I try to further study and create a new LSTM model with a new big data set (around 30GB text files data). I realize that pyspark is an effective way to handle big data processing.
My question is Can we create LSTM model by using pyspark or Does anyone has a guideline or an example use case in the same propose?

How to train large dataset on tensorflow 2.x

I have a large dataset with about 2M rows and 6,000 columns. The input numpy array (X, y) can hold the training data okay. But when it goes to model.fit(), I get a GPU Out-Of-Memory error. I am using tensorflow 2.2. According to its manual, model.fit_generator has been deprecated and model.fit is preferred.
Can someone outline the steps for training large datasets with tensorflow v2.2?
The best solution is to use tf.data.Dataset() and thus you can easily batch your data with the .batch() method.
There are plenty of tutorials available here, you may want to use from_tensor_slices() for playing directly with numpy arrays.
Below there are two excellent documentations to suit your needs.
https://www.tensorflow.org/tutorials/load_data/numpy
https://www.tensorflow.org/guide/data

Image data augmentation in TF 2.0 -- rotation

I am training a tensorflow model with with multiple images as input and a segmentation mask at the output. I wanted to perform random rotation augmentation in my dataset pipeline.
I have a list of parallel image file names (input output files aligned) for which I convert into tf dataset object using tf.data.Dataset.from_generator and then and use Dataset.map function to load images with tf.image.decode_png and tf.io.read_file commands.
How can I perform random rotations on the input-output images. I tried using random_transform function of ImageDataGenerator class but it expects numpy data as input and does not work on Tensors (Since tensorflow does not support eager execution in data pipeline I cannot convert it into numpy as well). I suppose I can use tf.numpy_function but I expect there should be some simple solution for this simple problem.

Tensorflow dataset pipeline for large HDF5 file

Let's say I have 10 million training samples stored in a HDF5 file and I want to generate batches of size 1000 with a TensorFlow dataset. However, the 10 million samples are too big to be loaded in memory.
What I want to do is to load the numpy data from the HDF5 in memory 1 million at a time, and then iterate to generate my batches of size 1000. When the 1 million samples are finished, I want to load the next 1 million from the HDF5 and continue. I would like to managed this with a single Dataset in tensorflow.
However, I don't see how to do this with the Dataset API from tensorflow.
How can I iterate on two levels like this (1st level=big chunks of 1 million, 2nd level=small batches of 1000).
Thanks

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft