how to work with large training set when dealing with auto-encoders on google colaboratory? - numpy

I am training an auto-encoder (keras) on google colab. however, I have 25000 input image and 25000 output image. I tried to:
1- copy the large file from google drive to colab each time (takes 5-6 hours).
2- convert the set to numpy array but when normalizing the images, the size get a lot bigger (from 7GB to 24GB for example) and then I can not fit it into the ram memory.
3- I can not zip and unzip my data.
So please, if anyone knows how to convert it into numpy array( and normalize it) without having large file(24GB).

What I usually do :
Zip all the images and load the .zip file on your Google Drive
Dezip in your colab :
from zipfile import ZipFile
with ZipFile('data.zip', 'r') as zip:
zip.extractall()
All your images are dezipped and stored on the Colab Disk, now you can have a faster acces to them.
Use Generators in keras like flow_from_directory or create your own generator
use you generator when you fit your model :
moel.fit(train_generator, steps_per_epoch = ntrain // batch_size,
epochs=epochs,validation_data=val_generator,
validation_steps= nval // batch_size)
with ntrain and nval the number of images in your train and validation dataset

Related

Train on colab TPU without data from GCP, for data that can be all loaded into memory

From the official TPU documentation, it says that train files must be on GCP
https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
But I have a smaller dataset (but the training would take a very long time due to the training being based on sampling/permutations) which can be all loaded into memory (1-2 gb). I am wondering if I can somehow just transfer the data objects to the TPU directly, and it can use that to train the files.
If it makes a difference, I am using Keras to do my TPU training.
What I looked at so far:
It seems that you can loaded certain data onto individual TPU cores
self.workers = ['/job:worker/replica:0/task:0/device:TPU:' + str(i) for i in range(num_tpu_cores)]
with tf.device(worker[0):
vecs = vectors[i]
However, I am not sure if this would translate into coordinated training among all the TPU cores.
You can read files with Python:
with open(image_path, "rb") as local_file:
img = local_file.read()
1-2 GB may be too big for TPU. If you are out of memory - split your data to smaller portions.

how to reduce input size for mask-RCNN trained model while running prediction on google cloud platform

I am trying to use Google AI Platform prediction to perform object recognition using Mask RCNN. After spending close to two weeks, I was able to:
find out how to train on Google Cloud
convert the model from h5 to the SavedModel format required by the AI platform
create AI Platform Models and deploy the trained models there.
Now, that I am trying to perform prediction, it said that my input size exceeds 1.5 MB which is the maximum size the input can be. When I checked it, the code that converts the image ( of size 65KB) to the format required for prediction makes the input file to 57MB.
I have no idea how a 64 KB image file can be converted to a 57 MB json file when molded. And I wanted to know how I can reduce this? Not sure if I am doing something wrong.
I have tried to perform local prediction using the gcloud local predict, and I am able to get the response with the 57MB file. So, that means that the file is correct.
I tried to set the max dimension of the image to 400X400, and that reduced the size of file from 57MB to around 7 MB. which is still very high. I cannot keep reducing it as it leads to loss of information.
As per the online prediction documentation
Binary data cannot be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it.
You need to have your input_image tensor called input_image_bytes and you will send it data like so:
{'input_image_bytes': {'b64': base64.b64encode(jpeg_data).decode()}}
If you need help correcting you model's inputs you should see def _encoded_image_string_tensor_input_placeholder() in exporter.py called from export_inference_graph.py

Best way to process terabytes of data on gcloud ml-engine with keras

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!
If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size

How to use TensorFlow to predict large csv files by chunks and glue results together

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?
Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

reduce size of pretrained deep learning model for feature generation

I am using an pretrained model in Keras to generate features for a set of images:
model = InceptionV3(weights='imagenet', include_top=False)
train_data = model.predict(data).reshape(data.shape[0],-1)
However, I have a lot of images and the Imagenet model outputs 131072 features (columns) for each image.
With 200k images I would get an array of (200000, 131072) which is too large to fit into memory.
More importantly, I need to save this array to disk and it would take 100 GB of space when saved as .npy or .h5py
I could circumvent the memory problem by feeding only batches of like 1000 images and saving them to disk, but not the disk space problem.
How can I make the model smaller without losing too much information?
update
as the answer suggested I include the next layer in the model as well:
base_model = InceptionV3(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('avg_pool').output)
this reduced the output to (200000, 2048)
update 2:
another interesting solution may be the bcolz package to reduce size of numpy arrays https://github.com/Blosc/bcolz
I see at least two solutions to your problem:
Apply a model = AveragePooling2D((8, 8), strides=(8, 8))(model) where model is an InceptionV3 object you loaded (without top). This is the next step in InceptionV3 architecture - so one may easily assume - that these features still hold loads of discriminatory clues.
Apply a some kind of dimensionality reduction (e.g. like PCA) on a sample of data and reduce the dimensionality of all data to get the reasonable file size.