What is the best possible object/format (parquet/pickle/csv/txt/...) to store a scipy.sparse.csr.csr_matrix into S3 bucket? - dataframe

I used a scipy.sparse.csr.csr_matrix of size - (281333, 2550842) train a XGBoost model. Now I am in the process of deploying this model using SageMaker pipeline but facing a challenge to store this vector into S3 bucket. Appreciate your help. Thanks
I stored the rows of matrix into a csv but pandas.read_csv is giving me a string of
"[<1x2550842 sparse matrix of type '<class 'numpy.int64'>'\n\twith 220 stored elements in Compressed Sparse Row format>]"

Related

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.
There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

how to reduce input size for mask-RCNN trained model while running prediction on google cloud platform

I am trying to use Google AI Platform prediction to perform object recognition using Mask RCNN. After spending close to two weeks, I was able to:
find out how to train on Google Cloud
convert the model from h5 to the SavedModel format required by the AI platform
create AI Platform Models and deploy the trained models there.
Now, that I am trying to perform prediction, it said that my input size exceeds 1.5 MB which is the maximum size the input can be. When I checked it, the code that converts the image ( of size 65KB) to the format required for prediction makes the input file to 57MB.
I have no idea how a 64 KB image file can be converted to a 57 MB json file when molded. And I wanted to know how I can reduce this? Not sure if I am doing something wrong.
I have tried to perform local prediction using the gcloud local predict, and I am able to get the response with the 57MB file. So, that means that the file is correct.
I tried to set the max dimension of the image to 400X400, and that reduced the size of file from 57MB to around 7 MB. which is still very high. I cannot keep reducing it as it leads to loss of information.
As per the online prediction documentation
Binary data cannot be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it.
You need to have your input_image tensor called input_image_bytes and you will send it data like so:
{'input_image_bytes': {'b64': base64.b64encode(jpeg_data).decode()}}
If you need help correcting you model's inputs you should see def _encoded_image_string_tensor_input_placeholder() in exporter.py called from export_inference_graph.py

Best way to process terabytes of data on gcloud ml-engine with keras

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!
If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size

How to use TensorFlow to predict large csv files by chunks and glue results together

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?
Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft