Having issues reading S3 bucket when transitioning a tensorflow model from local machine to AWS SageMaker - tensorflow

When testing on a local machine in Python I would normally use the following to read a training set with sub-directories of all the classes and files/class:
train_path = r"C:\temp\coins\PCGS - Gold\train"
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=['0','1',2','3' etc...], batch_size=32)
Found 4100 images belonging to 22 classes.
but on AWS SageMaker's Jupyter notebook I am now pulling the files from an S3 bucket. I tried the following:
bucket = "coinpath"
train_path = 's3://{}/{}/train'.format(bucket, "v1") #note that the directory structure is coinpath/v1/train where coinpath is the bucket
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=
['0','1',2','3' etc...], batch_size=32)
but I get: ** Found 0 images belonging to 22 classes.**
Looking for some guidance on the right way to pull training data from S3.

From Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform? "ImageDataGenerator.flow_from_directory() currently does not allow you to stream data directly from a GCS bucket. "
I had to download the image from S3 first. This is best for latency reasons as well.

Related

Stream private data to google collab TPUs from GCS

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using
my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
output_signature=(
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
tf.data.experimental.save(my_tf_ds,filename)
Then I sent it to my bucket on GCS.
But when I try to load it from my bucket with
import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")
It doesn't work and gives available datasets like :
- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc
that are not on my GCS bucket.
When loading it myself from local:
tfds_from_file = tf.data.experimental.load(filename, element_spec= (
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
it works, the dataset is fine.
So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.
I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.
Here are some ideas you could try:
You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab

What do you use to access CSV data on S3 and other object storage providers as a PyTorch Dataset?

My dataset is stored as a collection of CSV files in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. I'd like to train a PyTorch model based on this data but the built-in Dataset classes do not provide native support for object storage services like S3 or Google Cloud Storage (GCS), Azure Blob storage, and such. I checked the PyTorch documentation here https://pytorch.org/docs/stable/data.html# about the available Dataset classes and it comes up short when it comes to public cloud object storage support.
It looks like I have to create my own custom Dataset according to the following instructions: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class but the effort seems overwhelming: I need to figure out how to download data from the object storage to local node, parse the CSV files to read them into PyTorch tensors, and then deal with the possibility of running out of disk space since my dataset is 100s of GBs.
Since PyTorch models are trained using gradient descent and I only need to store just a small batch of data (less than 1GB) in memory at once, is there a custom dataset implementation that can help?
Check out ObjectStorage Dataset which has support for object storage services like S3 and GCS osds.readthedocs.io/en/latest/gcs.html
You can run
pip install osds
to install it and then point it at your S3 bucket to instantiate the PyTorch Dataset and DataLoader using something like
from osds.utils import ObjectStorageDataset
from torch.utils.data import DataLoader
ds = ObjectStorageDataset(f"gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv",
storage_options = {'anon' : False },
batch_size = 32768,
worker = 4,
eager_load_batches = False)
dl = DataLoader(ds, batch_size=None)
where you use your S3 location path instead of gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv. So your glob for S3 would be something like s3://<bucket name>/<object path>/*.csv depending on the bucket and the bucket directory where you store your CSV objects for the dataset.

How to make prediction with sagemaker on pandas dataframe

I am using Sagemaker to train and deploy my machine learning model. As regard to prediction, it will be executed by a lambda function as a scheduled job (every hour). The process is as follows:
pull new data from S3 since last prediction
preprocess, aggregate and create prediction data set
call sagemaker endpoint and make prediction
either save result to s3 or insert to database table
Based on my finding, typically the input will either from lambda payload
data = json.loads(json.dumps(event))
payload = data['data']
print(payload)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
or read from s3 file:
my_bucket = resource.Bucket('pred_data') #subsitute this for your s3 bucket name.
obj = client.get_object(Bucket=my_bucket, Key='foo.csv')
lines= obj['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(lines)
file = io.StringIO(lines)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT,
ContentType='*/*',
Body = file.getvalue(),
Body=payload)
output = response['Body'].read().decode('utf-8')
Since I will be pulling raw data from s3 and preprocess, a pandas dataframe will be generated. Is it possible to feed this directly as the input of invoke_endpoint? I could upload the aggregated dataset to another S3 bucket, but does it have to go through the decoding, csv.reader, StringIO and all that just like the example I found or is there an easy way to do it? Is the decode step really necessary to get the output?
You can send whatever payload you want when you call InvokeEndpoint and in whatever format. You can control the contract on either side (assuming your model supports it). If you are using a model that you didn't create, look to see if it supports pre/post processing which would allow you to define the contract yourself.
In addition to this, one thing we often see customers do is to do processing within the model instead of before calling SageMaker's InvokeEndpoint. A common use case is to accept the S3 path of the object you need to do predictions on when you call InvokeEndpoint. Then the model would be responsible for downloading the S3 item and transforming it and then running the inference on that data.
Depending on the InvokeEndpoint response, it can do the same and the model can upload it to S3 and just send the S3 key back as a response. This might not be what you are looking to do but it's just an additional example of the flexibility you have when using SageMaker.

Using your own evaluation and training set in cloud-ml-engine sample

in the flowers tutorial by google here: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial
For preproccessing of data we used the dollwoing command:
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
--output_path "${GCS_PATH}/preproc/train" \
--cloud
I understand we could replace the csv file with our own list and hence train with a different set of images, however creating a csv files for over a 100 types of images will be cumbersome, is there a way to overcome this?
The train_set.csv is a list of file paths in Google Cloud Storage and the prediction label.
This is a part of the file:
gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/284497199_93a01f48f6.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/3554992110_81d8c9b0bd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/4065883015_4bb6010cb7_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/roses/7420699022_60fa574524_m.jpg,roses
gs://cloud-ml-data/img/flower_photos/dandelion/4558536575_d43a611bd4_n.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/7568630428_8cf0fc16ff_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/tulips/7064813645_f7f48fb527.jpg,tulips
gs://cloud-ml-data/img/flower_photos/sunflowers/4933229095_f7e4218b28.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/21518663809_3d69f5b995_n.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/dandelion/15782158700_3b9bf7d33e_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/tulips/8713398906_28e59a225a_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/tulips/6770436217_281da51e49_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/dandelion/8754822932_948afc7cef.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/22873310415_3a5674ec10_m.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/5967283168_90dd4daf28_n.jpg,sunflowers
So you will have to collect a set of images for your own train set, upload it to the GCS and clasify them. Then you just have to retrieve the list of path (it could be easily achieve using gsutil ls command) and concatenate with the classification label.

Running distributed Tensorflow on Google Cloud ML engine ClusterSpec

I am trying to run a large distributed tensorflow model on Google Cloud's ML engine and am having trouble understanding what should go on tf.train.ClusterSpec.
When you run a job on Google Cloud you can select the scale tier from BASIC, STANDARD_1, PREMIUM_1, BASIC_GPU or CUSTOM, each giving you access to different types of clusters. However, I can't find the name/addresses of the machines in these clusters.
Please take a look at the documentation and sample here. You should set ClusterSpec using the environment variable TF_CONFIG; e.g.
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
...
cluster_spec = tf.train.ClusterSpec(cluster)