If you use Tensorflow Dataset do you have to upload your data? - tensorflow2.0

I have been looking at Tensorflow Dataset (TFDS) and seems really useful. The only issue I can see is that if you want to use it, you have to upload your dataset publicly to TFDS. Is that correct?
Is there anyway of using TFDS on a private server? Only for internal use?

You can easily create new local datasets, as described in the documentation. Here's an excerpt:
class MyDataset(tfds.core.GeneratorBasedBuilder):
"""DatasetBuilder for my_dataset dataset."""
VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {
'1.0.0': 'Initial release.',
}
def _info(self) -> tfds.core.DatasetInfo:
"""Dataset metadata (homepage, citation,...)."""
return tfds.core.DatasetInfo(
builder=self,
features=tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(256, 256, 3)),
'label': tfds.features.ClassLabel(names=['no', 'yes']),
}),
)
def _split_generators(self, dl_manager: tfds.download.DownloadManager):
"""Download the data and define splits."""
extracted_path = dl_manager.download_and_extract('http://data.org/data.zip')
# dl_manager returns pathlib-like objects with `path.read_text()`,
# `path.iterdir()`,...
return {
'train': self._generate_examples(path=extracted_path / 'train_images'),
'test': self._generate_examples(path=extracted_path / 'test_images'),
}
def _generate_examples(self, path) -> Iterator[Tuple[Key, Example]]:
"""Generator of examples for each split."""
for img_path in path.glob('*.jpeg'):
# Yields (key, example)
yield img_path.name, {
'image': img_path,
'label': 'yes' if img_path.name.startswith('yes_') else 'no',
}

If you want to see your datasets in TFDS you have to put your data in your GCS bucket or on your drive, but if you want that your data can be downloaded using some registration or password then your data also be implemented in TFDS but it was not directly downloaded and extracted by tfds data pipelines, user have to follow manually downloading instructions as given here.
Also there is a public GCS bucket of tfds in which you can also upload your data but before it you have to contact with members of TFDS (as they have to check if it's good to upload your data in tfds gcs), by putting data on TFDS GCS helps a lot it avoid some issues of bad servers as well as user can directly load your data using tfds.load API.

Related

when trying to load external tfrecord with TFDS, given tf.train.Example, how to get tfds.features?

What I need help with / What I was wondering
Hi, I am trying to load external tfrecord files with TFDS. I have read the official doc here, and find I need to define the feature structure using tfds.features. However, since the tfrecords files are alreay generated, I do not have control the generation pipeline. I do, however, know the tf.train.Example structre used in TFRecordWriter during generation, shown as follows.
from tensorflow.python.training.training import BytesList, Example, Feature, Features, Int64List
dict(Example=Features({
'image': Feature(bytes_list=BytesList(value=[img_str])), # img_str is jpg encoded image raw bytes
'caption': Feature(bytes_list=BytesList(value=[caption])), # caption is a string
'height': Feature(bytes_list=Int64List(value=[caption])),
'width': Feature(bytes_list=Int64List(value=[caption])),
})
The doc only describes how to translate tfds.features into the human readable structure of the tf.train.Example. But nowhere does it mention how to translate a tf.train.Example into tfds.features, which is needed to automatically add the proper metadata fileswith tfds.folder_dataset.write_metadata.
I wonder how to translate the above tf.train.Example into tfds.features? Thanks a lot!
BTW, while I understand that it is possible to directly read the data as it is in TFRecord with tf.data.TFRecordDataset and then use map(decode_fn) for decoding as suggested here, it seems to me this approach lacks necessary metadata like num_shards or shard_lengths. In this case, I am not sure if it is still ok to use common operations like cache/repeat/shuffle/map/batch on that tf.data.TFRecordDataset. So I think it is better to stick to the tfds approach.
What I've tried so far
I have searched the official doc for quite some time but cannot find the answer. There is a Scalar class in tfds.features, which I assume could be used to decode Int64List. But How can I decode the BytesList?
Environment information
tensorflow-datasets version: 4.8.2
tensorflow version: 2.11.0
After some searching, I find the simplest solution is
features = tfds.features.FeaturesDict({
'image': tfds.features.Image(), # << Ideally best if add the `shape=(h, w, c)` info
'caption': tfds.features.Text(),
'height': tf.int32,
'width': tf.int32,
})
Then I can load the data either with tfds.folder_dataset.write_metadata or directly with:
ds = tf.data.TFRecordDataset()
ds = ds.map(features.deserialize_example)
batch, shuffle ,... will work as expected in both cases.
The TFDS metadata would be helpful if fine-grained split control is needed.

Problem with connecting google Colab with google Cloud TPUs

I have this code which based on t5 notebook (https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb)
FINETUNE_STEPS = 3000##param {type: "integer"}
model.finetune(
mixture_or_task_name="text_diacritization_short",
pretrained_model_dir=PRETRAINED_DIR,
finetune_steps=FINETUNE_STEPS
)
my code was working fine in 8 Augustus then something happened resulting of this error.
these two lines appeared when my model worked so i don't think they are the problem.
INFO:root:system_path_file_exists:gs://my_bucket/my_file/models/small/operative_config.gin
ERROR:root:Path not found: gs://my_bucket/my_file/models/small/operative_config.gin
Rest of the error.
From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/training_util.py:399: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:absl:Using an uncached FunctionDataset for training is not recommended since it often results in insufficient shuffling on restarts, resulting in overfitting. It is highly recommended that you cache this task before training with it or use a data source that supports lower-level shuffling (e.g., FileDataSource).
SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order.
Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order.
From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py:1161: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
From /usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:758: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
I changed the google cloud account and the Colab notebook to completely new gmail account, I think the problem is that something got updated in google Colab regarding connecting to Google Cloud TPUs.
Also, I can connect to my bucket normally using this code.
BASE_DIR = "gs://my_bucket/my_file" ##param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
raise ValueError("You must enter a BASE_DIR.")
DATA_DIR = os.path.join(BASE_DIR, "data")
FINETUNE_MODELS_DIR = os.path.join(BASE_DIR, "models")
ON_CLOUD = True
if ON_CLOUD:
print("Setting up GCS access...")
import tensorflow_gcs_config
from google.colab import auth
# Set credentials for GCS reading/writing from Colab and TPU.
TPU_TOPOLOGY = "v2-8"
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
TPU_ADDRESS = tpu.get_master()
print('Running on TPU:', TPU_ADDRESS)
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
auth.authenticate_user()
tf.enable_eager_execution()
tf.config.experimental_connect_to_host(TPU_ADDRESS)
tensorflow_gcs_config.configure_gcs_from_colab_auth()
tf.disable_v2_behavior()
# Improve logging.
from contextlib import contextmanager
import logging as py_logging
if ON_CLOUD:
tf.get_logger().propagate = False
py_logging.root.setLevel('INFO')
#contextmanager
def tf_verbosity_level(level):
og_level = tf.logging.get_verbosity()
tf.logging.set_verbosity(level)
yield
tf.logging.set_verbosity(og_level)
it would be great if someone can help me I have been looking in the issue for a week and found nothing, is there any changes to how Google Colab works that I am not aware of.
Thanks in advance.

Stream private data to google collab TPUs from GCS

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using
my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
output_signature=(
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
tf.data.experimental.save(my_tf_ds,filename)
Then I sent it to my bucket on GCS.
But when I try to load it from my bucket with
import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")
It doesn't work and gives available datasets like :
- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc
that are not on my GCS bucket.
When loading it myself from local:
tfds_from_file = tf.data.experimental.load(filename, element_spec= (
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
it works, the dataset is fine.
So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.
I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.
Here are some ideas you could try:
You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab

How do I obtain URL to download the csv files of Tensorflow datasets?

In TensorFlow examples, I can see URLs to download the csv format of the dataset.
For example,
Iris- https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv
Titanic- https://storage.googleapis.com/tf-datasets/titanic/train.csv
However, I can't find the URL for every dataset in TensorFlow that are listed over her. (https://www.tensorflow.org/datasets/catalog/overview).
you don't need the URLs. Tensorflow datasets are already ready to use. check out the tutorial here tfds guide
For titanic, it is available here titanic structured dataset
Hope this would help :)
TensorFlow Datasets is having a collection of ready-to-use datasets.
loaded from tfds - "Dataset downloaded and prepared to /root/tensorflow_datasets/iris/2.0.0. Subsequent calls will reuse this data. "- really covinient... but if you'd better take dataset from url (see here - pipelines are convinient):
# https://www.tensorflow.org/guide/data#consuming_csv_data
import tensorflow as tf
import pandas as pd
# test_file = tf.keras.utils.get_file("temperature.csv", "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv")
titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
df = pd.read_csv(titanic_file)
df.head()
# make dataset from pandas:
myDataset = tf.data.Dataset.from_tensor_slices(dict(df))
for feature_batch in myDataset.take(1):
for key, value in feature_batch.items():
print(" {!r:20s}: {}".format(key, value))
titanic_lines = tf.data.TextLineDataset(titanic_file)
for line in titanic_lines.take(10):
print(line.numpy())
here are different Datasets & Flows also

How do I add a directory of .wav files to the Kedro data catalogue?

This is my first time trying to use the Kedro package.
I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.
Any thoughts?
I don't believe there's currently a dataset format that handles .wav files. You'll need to build a custom dataset that uses something like Wave - not as much work as it sounds!
This will enable you to do something like this in your catalog:
dataset:
type: my_custom_path.WaveDataSet
filepath: path/to/individual/wav_file.wav # this can be a s3://url
and you can then access your WAV data natively within your Kedro pipeline. You can do this for each .wav file you have.
If you wanted to be able to access a whole folders worth of wav files, you might want to explore the notion of a "wrapper" dataset like the PartitionedDataSet whose usage guide can be found in the documentation.
This worked:
import pandas as pd
from pathlib import Path, PurePosixPath
from kedro.io import AbstractDataSet
class WavFile(AbstractDataSet):
'''Used to load a .wav file'''
def __init__(self, filepath):
self._filepath = PurePosixPath(filepath)
def _load(self) -> pd.DataFrame:
df = pd.DataFrame({'file': [self._filepath],
'data': [load_wav(self._filepath)]})
return df
def _save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(filepath=self._filepath)
class WavFiles(PartitionedDataSet):
'''Replaces the PartitionedDataSet.load() method to return a DataFrame.'''
def load(self)->pd.DataFrame:
'''Returns dataframe'''
dict_of_data = super().load()
df = pd.concat(
[delayed() for delayed in dict_of_data.values()]
)
return df
my_partitioned_dataset = WavFiles(
path="path/to/folder/of/wav/files/",
dataset=WavFile,
)
my_partitioned_dataset.load()