What do you use to access CSV data on S3 and other object storage providers as a PyTorch Dataset? - amazon-s3

My dataset is stored as a collection of CSV files in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. I'd like to train a PyTorch model based on this data but the built-in Dataset classes do not provide native support for object storage services like S3 or Google Cloud Storage (GCS), Azure Blob storage, and such. I checked the PyTorch documentation here https://pytorch.org/docs/stable/data.html# about the available Dataset classes and it comes up short when it comes to public cloud object storage support.
It looks like I have to create my own custom Dataset according to the following instructions: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class but the effort seems overwhelming: I need to figure out how to download data from the object storage to local node, parse the CSV files to read them into PyTorch tensors, and then deal with the possibility of running out of disk space since my dataset is 100s of GBs.
Since PyTorch models are trained using gradient descent and I only need to store just a small batch of data (less than 1GB) in memory at once, is there a custom dataset implementation that can help?

Check out ObjectStorage Dataset which has support for object storage services like S3 and GCS osds.readthedocs.io/en/latest/gcs.html
You can run
pip install osds
to install it and then point it at your S3 bucket to instantiate the PyTorch Dataset and DataLoader using something like
from osds.utils import ObjectStorageDataset
from torch.utils.data import DataLoader
ds = ObjectStorageDataset(f"gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv",
storage_options = {'anon' : False },
batch_size = 32768,
worker = 4,
eager_load_batches = False)
dl = DataLoader(ds, batch_size=None)
where you use your S3 location path instead of gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv. So your glob for S3 would be something like s3://<bucket name>/<object path>/*.csv depending on the bucket and the bucket directory where you store your CSV objects for the dataset.

Related

reading hdf5 file from s3 to sagemaker, is the whole file transferred?

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)
But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ?
I suppose that if the whole file is transferred at each call it could cost me a lot.
When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.
You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)
As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.
Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
# your way to get a numpy array -- .get() is depreciated:
data = h5_file.get(dataset_path_in_h5)
# this is the preferred syntax to return an array:
data_arr = h5_file[dataset_path_in_h5][()]
# this returns a h5py dataset object:
data_ds = h5_file[dataset_path_in_h5] # deleted [()]

Trouble reading Blob Storage File into Azure ML Notebook

I have an Excel file uploaded to my ML workspace.
I can access the file as an azure FileDataset object. However, I don't know how to get it into a pandas DataFrame since 'FileDataset' object has no attribute 'to_dataframe'.
Azure ML notebooks seem to make a point of avoiding pandas for some reason.
Does anyone know how to get blob files into pandas dataframes from within Azure ML notebooks?
To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame.
Here are the steps to follow for this procedure:
Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlobServiceClient
import pandas as pd
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
t1=time.time()
blob_service_client_instance =
BlobServiceClient(account_url=STORAGEACCOUNTURL,
credential=STORAGEACCOUNTKEY)
blob_client_instance =
blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME,
snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
t2=time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
Read the data into a pandas DataFrame from the downloaded file.
#LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
For more details you can follow this link

Stream private data to google collab TPUs from GCS

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using
my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
output_signature=(
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
tf.data.experimental.save(my_tf_ds,filename)
Then I sent it to my bucket on GCS.
But when I try to load it from my bucket with
import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")
It doesn't work and gives available datasets like :
- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc
that are not on my GCS bucket.
When loading it myself from local:
tfds_from_file = tf.data.experimental.load(filename, element_spec= (
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
it works, the dataset is fine.
So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.
I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.
Here are some ideas you could try:
You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab

Having issues reading S3 bucket when transitioning a tensorflow model from local machine to AWS SageMaker

When testing on a local machine in Python I would normally use the following to read a training set with sub-directories of all the classes and files/class:
train_path = r"C:\temp\coins\PCGS - Gold\train"
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=['0','1',2','3' etc...], batch_size=32)
Found 4100 images belonging to 22 classes.
but on AWS SageMaker's Jupyter notebook I am now pulling the files from an S3 bucket. I tried the following:
bucket = "coinpath"
train_path = 's3://{}/{}/train'.format(bucket, "v1") #note that the directory structure is coinpath/v1/train where coinpath is the bucket
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=
['0','1',2','3' etc...], batch_size=32)
but I get: ** Found 0 images belonging to 22 classes.**
Looking for some guidance on the right way to pull training data from S3.
From Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform? "ImageDataGenerator.flow_from_directory() currently does not allow you to stream data directly from a GCS bucket. "
I had to download the image from S3 first. This is best for latency reasons as well.

How to make prediction with sagemaker on pandas dataframe

I am using Sagemaker to train and deploy my machine learning model. As regard to prediction, it will be executed by a lambda function as a scheduled job (every hour). The process is as follows:
pull new data from S3 since last prediction
preprocess, aggregate and create prediction data set
call sagemaker endpoint and make prediction
either save result to s3 or insert to database table
Based on my finding, typically the input will either from lambda payload
data = json.loads(json.dumps(event))
payload = data['data']
print(payload)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
or read from s3 file:
my_bucket = resource.Bucket('pred_data') #subsitute this for your s3 bucket name.
obj = client.get_object(Bucket=my_bucket, Key='foo.csv')
lines= obj['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(lines)
file = io.StringIO(lines)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT,
ContentType='*/*',
Body = file.getvalue(),
Body=payload)
output = response['Body'].read().decode('utf-8')
Since I will be pulling raw data from s3 and preprocess, a pandas dataframe will be generated. Is it possible to feed this directly as the input of invoke_endpoint? I could upload the aggregated dataset to another S3 bucket, but does it have to go through the decoding, csv.reader, StringIO and all that just like the example I found or is there an easy way to do it? Is the decode step really necessary to get the output?
You can send whatever payload you want when you call InvokeEndpoint and in whatever format. You can control the contract on either side (assuming your model supports it). If you are using a model that you didn't create, look to see if it supports pre/post processing which would allow you to define the contract yourself.
In addition to this, one thing we often see customers do is to do processing within the model instead of before calling SageMaker's InvokeEndpoint. A common use case is to accept the S3 path of the object you need to do predictions on when you call InvokeEndpoint. Then the model would be responsible for downloading the S3 item and transforming it and then running the inference on that data.
Depending on the InvokeEndpoint response, it can do the same and the model can upload it to S3 and just send the S3 key back as a response. This might not be what you are looking to do but it's just an additional example of the flexibility you have when using SageMaker.