Stream private data to google collab TPUs from GCS - tensorflow

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using
my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
output_signature=(
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
tf.data.experimental.save(my_tf_ds,filename)
Then I sent it to my bucket on GCS.
But when I try to load it from my bucket with
import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")
It doesn't work and gives available datasets like :
- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc
that are not on my GCS bucket.
When loading it myself from local:
tfds_from_file = tf.data.experimental.load(filename, element_spec= (
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
it works, the dataset is fine.
So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.

I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.
Here are some ideas you could try:
You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab

Related

What do you use to access CSV data on S3 and other object storage providers as a PyTorch Dataset?

My dataset is stored as a collection of CSV files in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. I'd like to train a PyTorch model based on this data but the built-in Dataset classes do not provide native support for object storage services like S3 or Google Cloud Storage (GCS), Azure Blob storage, and such. I checked the PyTorch documentation here https://pytorch.org/docs/stable/data.html# about the available Dataset classes and it comes up short when it comes to public cloud object storage support.
It looks like I have to create my own custom Dataset according to the following instructions: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class but the effort seems overwhelming: I need to figure out how to download data from the object storage to local node, parse the CSV files to read them into PyTorch tensors, and then deal with the possibility of running out of disk space since my dataset is 100s of GBs.
Since PyTorch models are trained using gradient descent and I only need to store just a small batch of data (less than 1GB) in memory at once, is there a custom dataset implementation that can help?
Check out ObjectStorage Dataset which has support for object storage services like S3 and GCS osds.readthedocs.io/en/latest/gcs.html
You can run
pip install osds
to install it and then point it at your S3 bucket to instantiate the PyTorch Dataset and DataLoader using something like
from osds.utils import ObjectStorageDataset
from torch.utils.data import DataLoader
ds = ObjectStorageDataset(f"gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv",
storage_options = {'anon' : False },
batch_size = 32768,
worker = 4,
eager_load_batches = False)
dl = DataLoader(ds, batch_size=None)
where you use your S3 location path instead of gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv. So your glob for S3 would be something like s3://<bucket name>/<object path>/*.csv depending on the bucket and the bucket directory where you store your CSV objects for the dataset.

Having issues reading S3 bucket when transitioning a tensorflow model from local machine to AWS SageMaker

When testing on a local machine in Python I would normally use the following to read a training set with sub-directories of all the classes and files/class:
train_path = r"C:\temp\coins\PCGS - Gold\train"
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=['0','1',2','3' etc...], batch_size=32)
Found 4100 images belonging to 22 classes.
but on AWS SageMaker's Jupyter notebook I am now pulling the files from an S3 bucket. I tried the following:
bucket = "coinpath"
train_path = 's3://{}/{}/train'.format(bucket, "v1") #note that the directory structure is coinpath/v1/train where coinpath is the bucket
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=
['0','1',2','3' etc...], batch_size=32)
but I get: ** Found 0 images belonging to 22 classes.**
Looking for some guidance on the right way to pull training data from S3.
From Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform? "ImageDataGenerator.flow_from_directory() currently does not allow you to stream data directly from a GCS bucket. "
I had to download the image from S3 first. This is best for latency reasons as well.

If you use Tensorflow Dataset do you have to upload your data?

I have been looking at Tensorflow Dataset (TFDS) and seems really useful. The only issue I can see is that if you want to use it, you have to upload your dataset publicly to TFDS. Is that correct?
Is there anyway of using TFDS on a private server? Only for internal use?
You can easily create new local datasets, as described in the documentation. Here's an excerpt:
class MyDataset(tfds.core.GeneratorBasedBuilder):
"""DatasetBuilder for my_dataset dataset."""
VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {
'1.0.0': 'Initial release.',
}
def _info(self) -> tfds.core.DatasetInfo:
"""Dataset metadata (homepage, citation,...)."""
return tfds.core.DatasetInfo(
builder=self,
features=tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(256, 256, 3)),
'label': tfds.features.ClassLabel(names=['no', 'yes']),
}),
)
def _split_generators(self, dl_manager: tfds.download.DownloadManager):
"""Download the data and define splits."""
extracted_path = dl_manager.download_and_extract('http://data.org/data.zip')
# dl_manager returns pathlib-like objects with `path.read_text()`,
# `path.iterdir()`,...
return {
'train': self._generate_examples(path=extracted_path / 'train_images'),
'test': self._generate_examples(path=extracted_path / 'test_images'),
}
def _generate_examples(self, path) -> Iterator[Tuple[Key, Example]]:
"""Generator of examples for each split."""
for img_path in path.glob('*.jpeg'):
# Yields (key, example)
yield img_path.name, {
'image': img_path,
'label': 'yes' if img_path.name.startswith('yes_') else 'no',
}
If you want to see your datasets in TFDS you have to put your data in your GCS bucket or on your drive, but if you want that your data can be downloaded using some registration or password then your data also be implemented in TFDS but it was not directly downloaded and extracted by tfds data pipelines, user have to follow manually downloading instructions as given here.
Also there is a public GCS bucket of tfds in which you can also upload your data but before it you have to contact with members of TFDS (as they have to check if it's good to upload your data in tfds gcs), by putting data on TFDS GCS helps a lot it avoid some issues of bad servers as well as user can directly load your data using tfds.load API.

Accessing already downloaded dataset with tensorflow_datasets API

I am trying to work with the quite recently published tensorflow_dataset API to train a Keras model on the Open Images Dataset. The dataset is about 570 GB in size. I downloaded the data with the following code:
import tensorflow_datasets as tfds
import tensorflow as tf
open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")
After the download was complete, the connection to my jupyter notebook somehow interrupted but the extraction seemed to be finished as well, at least all downloaded files had a counterpart in the "extracted" folder. However, I am not able to access the downloaded data now:
tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)
This only gives the following error:
AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.
When I call the function download_and_prepare() it only downloads the whole dataset again.
Am I missing something here?
Edit:
After the download the folder under "extracted" has 18 .tar.gz files.
This is with tensorflow-datasets 1.0.1 and tensorflow 2.0.
The folder hierarchy should be like this:
/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0
All the datasets have a version. Then the data could be loaded like this.
ds = tf.load('open_images_v4', data_dir='/notebooks/open_images_dataset/extracted', download=False)
I didn't have open_images_v4 data. I put cifar10 data into a folder named open_images_v4 to check what folder structure tensorflow_datasets was expecting.
The solution to this was to also use the "data_dir" parameter when initializing the dataset:
builder = tfds.image.OpenImagesV4(data_dir="/raid/openimages/dataset")
builder.download_and_prepare(download_dir="/raid/openimages/dataset")
This way the dataset is donwloaded and extracted in the same directory. Before, it was (for me unnoticeably) extracting to the default directory, which is under /home/.../. That's what caused the error, as there wasn't enough space left under my home directory.
After the extraction, the folder structure is exactly as Manoj-Mohan described.
Above solution haven't worked for me.
builder = tfds.builder(name='folder_name', data_dir=data_dir)
builder.download_and_prepare(download_dir="/home/...")
ds = builder.as_dataset()

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()