Suppose I have TensorFlow Dataset, and I am using TF 2.0.
I can iterate on each element with for loop:
import tensorflow_datasets as tfds
ds = tfds.load('coco', data_dir='D:\\DataSet\\COCO')
test_data = ds["test"]
for rec in test_data:
print(rec['image'])
Is it possible to access the Nth element directly?
something like rec = test_data[N]?
It is not possible at this moment. If you have control on how the dataset is loaded, the closest I have come up with is to create a generator that supports random access, and create a dataset from this generator using tf.data.Dataset.from_generator().
Related
I am new to GPflow and I am trying to figure out how to write a custom loss function to optimize the model. For my purpose, I need to manipulate the predicted output of the GP through different data treatments, and thus, it is the output I get after these treatments, that I would like the optimise the GP model according to. For that purpose I would like to use the root mean square error as loss function.
Workflow:
Input -> GP model -> GP_output -> Data treatment -> Predicted_output -> RMSE(Predicted_output, Observations)
I hope this makes sense.
Normally models are optimised doing something like this:
import gpflow as gf
import numpy as np
X = np.linspace(0, 100, num=100)
n = np.random.normal(scale=8, size=X.size)
y_obs = 10 * np.sin(X) + n
model = gf.models.GPR(
data=(X, y_obs),
kernel=gf.kernels.SquaredExponential(),
)
gf.optimizers.Scipy().minimize(
model.training_loss, model.trainable_variables, options=optimizer_config
)
I have figured out how to do a workaround using the scipy minimize function to optimise using RMSE, but I would like to stay within the GPflow framework, where I can just input model.trainable_variables as argument, and have a general function that also works if I have multiple input/output dimensions.
def objective_func(params):
model.kernel.lengthscales.assign(params[0])
model.kernel.variance.assign(params[1])
model.likelihood.variance.assign(params[2])
GP_output = model.predict_y(X)[0]
GP_output = GP_output.numpy()
Predicted_output = data_treatment_func(GP_output)
return np.sqrt(np.square(np.subtract(Predicted_output, y_obs)).mean())
from scipy.optimize import minimize
res = minimize(objective_func,
x0=(1.0, 1.0, 1.0),)
I found the answer myself.
If you write your objective_func using TensorFlow instead of NumPy (e.g. tf.math.sqrt, tf.reduce_mean) you can simply pass that to gf.optimizers.Scipy().minimize(...) instead of model.training_loss:
import tensorflow as tf
def objective_func():
GP_output = model.predict_y(X)[0]
Predicted_output = data_treatment_func(GP_output)
return tf.sqrt(tf.reduce_mean(tf.square(Predicted_output - y_obs)))
gf.optimizers.Scipy().minimize(
objective_func, model.trainable_variables, options=optimizer_config
)
In TensorFlow examples, I can see URLs to download the csv format of the dataset.
For example,
Iris- https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv
Titanic- https://storage.googleapis.com/tf-datasets/titanic/train.csv
However, I can't find the URL for every dataset in TensorFlow that are listed over her. (https://www.tensorflow.org/datasets/catalog/overview).
you don't need the URLs. Tensorflow datasets are already ready to use. check out the tutorial here tfds guide
For titanic, it is available here titanic structured dataset
Hope this would help :)
TensorFlow Datasets is having a collection of ready-to-use datasets.
loaded from tfds - "Dataset downloaded and prepared to /root/tensorflow_datasets/iris/2.0.0. Subsequent calls will reuse this data. "- really covinient... but if you'd better take dataset from url (see here - pipelines are convinient):
# https://www.tensorflow.org/guide/data#consuming_csv_data
import tensorflow as tf
import pandas as pd
# test_file = tf.keras.utils.get_file("temperature.csv", "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv")
titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
df = pd.read_csv(titanic_file)
df.head()
# make dataset from pandas:
myDataset = tf.data.Dataset.from_tensor_slices(dict(df))
for feature_batch in myDataset.take(1):
for key, value in feature_batch.items():
print(" {!r:20s}: {}".format(key, value))
titanic_lines = tf.data.TextLineDataset(titanic_file)
for line in titanic_lines.take(10):
print(line.numpy())
here are different Datasets & Flows also
My dataset here is comprised of data in the following structure [...x, y] and I want to convert it to
[...x], categorical([y])
this is what I tried:
def map_sequence(sequence):
return sequence[:-1], keras.utils.to_categorical(sequence[-1])
dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
dataset = dataset.map(map_sequence)
but I am getting an error as sequence does not actually have any data when mapping is executed.
how does one use to_categorical and map() together?
Replacing keras.utils.to_categorical with tf.one_hot should do the trick.
Question: Is there a way to append an existing TFRecord?
Note: The .TFRecord is created by my own script (not a .tfrecord I found on web), so I have full control of its contents.
It is not possible to append to an existing records file as such, or at least not through the functions that TensorFlow provides. Record files are written at C++ level by a PyRecordWriter, which calls the function NewWriteableFile when it is created, deleting any existing file with the given name to create a new one. However, it is possible to create a new records file with the contents of another one followed by new records.
For TensorFlow 1.x, you could do it like this:
import tensorflow as tf
def append_records_v1(in_file, new_records, out_file):
with tf.io.TFRecordWriter(out_file) as writer:
with tf.Graph().as_default(), tf.Session():
ds = tf.data.TFRecordDataset([in_file])
rec = ds.make_one_shot_iterator().get_next()
while True:
try:
writer.write(rec.eval())
except tf.errors.OutOfRangeError: break
for new_rec in new_records:
writer.write(new_rec)
In TensorFlow 2.x (eager execution), you could do it like this:
import tensorflow as tf
def append_records_v2(in_file, new_records, out_file):
with tf.io.TFRecordWriter(out_file) as writer:
ds = tf.data.TFRecordDataset([in_file])
for rec in ds:
writer.write(rec.numpy())
for new_rec in new_records:
writer.write(new_rec)
I am trying to create a 78TB HDF5 dataset by filling it in a 2d block-partition manner. This is very slow when the block I'm writing spans rows that haven't ever been written to, because HDF5 is going in and allocating the diskspace and filling in the missing entries with zero.
Instead, I would like h5py to allocate the disk space for my dataset as soon as its created, and never fill it. This is possible with the C api according to Table 16 in the HDF5 Dataset documentation, but how can I do this with h5py, preferably with the high level interface?
I believe that you want to set the fill time to "never", with the H5Pset_fill_time() routine, but I don't know the h5py way to do that.
As Quincey suggested. You can use the low-level H5py API to create the dataset with the FILL_TIME_NEVER property then convert it back to a high-level Dataset object:
# create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
spaceid = h5py.h5s.create_simple((numRows, numCols))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
plist.set_chunk((rowchunk, colchunk))
datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
rows = h5py.Dataset(datasetid)
Try specifying a chunk shape that matches your write pattern. For example if you are writing in blocks of 1024x1024, it would look like this:
import h5py
import numpy as np
f = h5py.File('mybigdset.h5', 'w')
dset = f.create_dataset('dset', (78*1024*1024, 1024*1024), dtype='f4', chunks=(1024,1024))
arr = np.random.rand(1024,1024)
dset[0:1024, 0:1024] = arr
f.close()
Thankfully, this didn't use 78TB of disk, the file size was just 4MB.