Tensorflow Dataset API: input pipeline with parquet files - tensorflow

I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline?

We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API.
Here is a small example:
with Reader('hdfs://.../some/hdfs/path') as reader:
dataset = make_petastorm_dataset(reader)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
with tf.Session() as sess:
sample = sess.run(tensor)
print(sample.id)

Related

How to train tensorflow on sagemaker in script mode when the data resides in multiple files on s3?

I have a .npy file for each one of the training instances. All of these files are available on S3 in train_data folder. I want to train a tensorflow model on these training instances. To do that, I wish to spin up separate aws training instance for each training job which could access the files from s3 and train the model on it. What changes in the training script are required for doing this?
I have following config in the training script:
parser.add_argument('--gpu-count', type=int, default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train_channel', type=str, default=os.environ['SM_CHANNELS'])
I have created the training estimator in jupyter instance as:
tf_estimator = TensorFlow(entry_point = 'my_model.py',
role = role,
train_instance_count = 1,
train_instance_type = 'local_gpu',
framework_version = '1.15.2',
py_version = 'py3',
hyperparameters = {'epochs': 1})
I am calling the fit function of the estimator as:
tf_estimator.fit({'train_channel':'s3://sagemaker-ml/train_data/'})
where train_data folder on S3 contains the .npy files of training instances.
But when I call the fit function, I get an error:
FileNotFoundError: [Errno 2] No such file or directory: '["train_channel"]/train_data_12.npy'
Not sure what am I missing here, as I can see the file mentioned above on S3.
SM_CHANNELS returns a list of channel names. What you're looking for is SM_CHANNEL_TRAIN_CHANNEL ("SM_CHANNEL" + your channel name), which provides the filesystem location for the channel:
parser.add_argument('--train_channel', type=str, default=os.environ['SM_CHANNEL_TRAIN_CHANNEL'])
docs: https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name

Best way to process terabytes of data on gcloud ml-engine with keras

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!
If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size

How to augment data in tensorflow tfrecords?

I am storing my data using tfrecords and I read them as tensors using Dataset API and then I use the Estimator API to perform training. Now, I want to do online data-augmentation on each item in the dataset, but after trying for a while I cannot find a way out to do it. I want randomly flipping, randomly rotation and other manipulators.
I am following the instructions given in this tutorial with a custom estimator which is the my CNN and I am not sure where the data augmentation step occurs.
Using TFRecords doesn't prevent you from doing data augmentation.
Following the tutorial you linked in your comment, here is what roughly happens:
You create the dataset from the TFRecords files, and parse the file to get an image and a label
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
You can now apply a new preprocessing function to do some data augmentation during training
# Only do it when we are training
if train:
dataset = dataset.map(train_preprocess)
The train_preprocess function can be something like this:
def train_preprocess(image, label):
flip_image = tf.image.random_flip_left_right(image)
# Other transformations...
return flip_image, label

Equivalent of tf.SparseFeature in tf.data

The neural network I am currently working on is accepting a sparse tensor as input. I am reading my data from a TFRecord as follows:
_, examples = tf.TFRecordReader(options=options).read_up_to(
filename_queue, num_records=batch_size)
features = tf.parse_example(examples, features={
'input_feat': tf.SparseFeature(index_key='input_feat_idx',
value_key='input_feat_values',
dtype=tf.int64,
size=SIZE_FEATURE)})
It works like a charm but I was looking at the tf.data API which looks more convenient for a lot of tasks and I am not sure how to read tf.SparseTensor objects like I do with the tf.RecordReader and tf.parse_example(). Any idea?
TensorFlow 1.5 will add native support for tf.SparseTensor in the core transformations. (This is currently available if you pip install tf-nightly, or build from source on the master branch of TensorFlow.) This means that you can write your pipeline as the following:
# Create a dataset of string records from the input files.
dataset = tf.data.TFRecordReader(filenames)
# Convert each string record into a `tf.SparseTensor` representing a single example.
dataset = dataset.map(lambda record: tf.parse_single_example(
record, features={'input_feat': tf.SparseFeature(index_key='input_feat_idx',
value_key='input_feat_values',
dtype=tf.int64,
size=SIZE_FEATURE)})
# Stack together up to `batch_size` consecutive elements into a `tf.SparseTensor`
# representing a batch of examples.
dataset = dataset.batch(batch_size)
# Create an iterator to access the elements of `dataset` sequentially.
iterator = dataset.make_one_shot_iterator()
# `next_element` is a `tf.SparseTensor`.
next_element = iterator.get_next()

Use tflearn.DNN with google cloud ml-engine

Is there a good way to deploy a model built using tflearn.DNN class to Google Cloud ML Engine? It seems like SavedModel requires input and output tensors to be defined in the prediction signature definition but unsure how to get that from tflearn.DNN.
I figured this out later at least for my specific case. This snippet lets you export your DNN as a SavedModel which can then be deployed to Google Cloud ML Engine.
Snippet is below with the following arguments
filename is the export directory
input_tensor is the input_data layer given to tflearn.DNN
output_tensor is the entire network passed to tflearn.DNN
session is an attribute of the object returned by tflearn.DNN
builder = tf.saved_model.builder.SavedModelBuilder(filename)
signature = tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'in':input_tensor}, outputs={'out':output_tensor})
builder.add_meta_graph_and_variables(session,
[tf.saved_model.tag_constants.SERVING],
signature_def_map={'serving_default':signature})
builder.save()
serving_vars = {
'name':self.name
}
assets = filename + '/assets.extra'
os.makedirs(assets)
with open(assets + '/serve.pkl', 'wb') as f:
pickle.dump(serving_vars, f, pickle.HIGHEST_PROTOCOL)