Tensorflow Dataset API how to order list_files? - tensorflow

I am using the Dataset API list_files in order to get a list of files in a source directory and target directory, something like:
source_path = '/tmp/data/source/*.ext1'
target_path = '/tmp/data/target/*.ext2'
source_dataset = tf.data.Dataset.list_files(source_path)
target_dataset = tf.data.Dataset.list_files(data_path)
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
Source and target dir contents have same sequential filenames, but different extensions (e.g, source 0001.ext1 <-> target 0001.ext2).
But since list_files is not ordered in anyway, the zipped dataset contains missmatches between the source and the target.
How can I solve this within the new dataset API?

The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed or shuffle=False to get results in a deterministic order.
source_dataset = tf.data.Dataset.list_files(source_path, shuffle=False)
or
val = 5
source_dataset = tf.data.Dataset.list_files(source_path, seed = val)
target_dataset = tf.data.Dataset.list_files(data_path, seed = val)

I had the same issue and I solved it by sorting the file paths first.
My files are named like in OP's case:
input image -> corresponding output
data/mband/01.tif -> data/gt_mband/01.tif
data/mband/02.tif -> data/gt_mband/02.tif
The code looks like this:
from pathlib import Path
import tensorflow as tf
DATA_PATH = Path("data")
# Sort the PATHS
img_paths = sorted(map(str, (DATA_PATH / 'mband').glob('*.tif')))
mask_paths = sorted(map(str, (DATA_PATH / 'gt_mband').glob('*.tif')))
# These are tensors of PATHS
# Paths are strings, so order will be preserved
img_paths = tf.data.Dataset.from_tensor_slices(img_paths)
mask_paths = tf.data.Dataset.from_tensor_slices(mask_paths)
# Load the actual images
def parse_image(image_path: 'some_tensor'):
# Load the image somehow...
return image_as_tensor
imgs = img_paths.map(parse_image)
masks = mask_paths.map(parse_mask)

Related

tfrecordswriter does not write

I am trying to create a tf.data.Dataset from a generator I wrote, and following this great answer: Split .tfrecords file into many .tfrecords files
Generator Code
def get_examples_generator(num_variants, vcf_reader):
def generator():
counter = 0
for vcf_read in vcf_reader:
is_vcf_ok = ... # checking whether this "vcf" example is ok
if is_vcf_ok and counter < num_variants:
counter += 1
# features extraction ...
# we create an example
example = make_example(img=img, label=label) # returns a SerializedExample
yield example
return generator
TFRecordsWriter Usage Code
def write_sharded_tfrecords(filename, path, vcf_reader,
num_variants,
shard_len):
assert Path(path).exists(), "path does not exist"
generator = get_examples_generator(num_variants=num_variants,
vcf_reader=vcf_reader,
cfdna_bam_reader=cfdna_bam_reader)
dataset = tf.data.Dataset.from_generator(generator,
output_types=tf.string,
output_shapes=())
num_shards = int(np.ceil(num_variants/shard_len))
formatter = lambda batch_idx: f'{path}/{filename}-{batch_idx:05d}-of-' \
f'{num_shards:05d}.tfrecord'
# inspired by https://stackoverflow.com/questions/54519309/split-tfrecords-file-into-many-tfrecords-files
for i in range(num_shards):
shard_path = formatter(i)
writer = tf.data.experimental.TFRecordWriter(shard_path)
shard = dataset.shard(num_shards, index=i)
writer.write(shard)
This is supposed to be a straight-forward use of tfrecords writer. However, It does not write any files at all. Does anyone understand why this doesn't work?
In my functions, I call the writer with tf.io.TFRecordWriter. Try changing your writer and see if it works:
writer = tf.io.TFRecordWriter
...
As a further reference, this answer helped me:
https://stackoverflow.com/a/60283571

How to map a dataset of filenames to a dataset of file contents

For example, I have a tensorflow dataset where each element is a tf.string Tensor represents a filename of an image file. Now I want to map this filename dataset to a dataset of image content Tensors.
I wrote code like this, but it doesn't work because map function can't execute eagerly. (Raises an error saying Tensor type has no attribute named numpy.)
def parseline(line):
filename = line.numpy()
image = some_library.open_image(filename).to_numpy()
return image
dataset = dataset.map(parseline)
Basically, it can be done the following way:
path = 'path_to_images'
files = [os.path.join(path, i) for i in os.listdir(path)] # If you need to create a list of filenames, because tf functions require tensors
def parse_image(filename):
file = tf.io.read_file(filename) # this will work only with filename as tensor
image = tf.image.decode_image(f)
return img
dataset = tf.data.Dataset.from_tensor_slices(files)
dataset = dataset.map(parse_image).batch(1)
if you're in eager mode just iterate over dataset
for i in dataset:
print(i)
If not, you'll need an iterator
iterator = dataset.make_one_shot_iterator()
with tf.Session as sess:
sess.run(iterator.get_next())

How to read parameters of layers of .tflite model in python

I was trying to read tflite model and pull all the parameters of the layers out.
My steps:
I generated flatbuffers model representation by running (please build flatc before):
flatc -python tensorflow/tensorflow/lite/schema/schema.fbs
Result is tflite/ folder that contains layer description files (*.py) and some utilitarian files.
I successfully loaded model:
in case of import Error: set PYTHONPATH to point to the folder where tflite/ is
from tflite.Model import Model
def read_tflite_model(file):
buf = open(file, "rb").read()
buf = bytearray(buf)
model = Model.GetRootAsModel(buf, 0)
return model
I partly pulled model and node parameters out and stacked in iterating over nodes:
Model part:
def print_model_info(model):
version = model.Version()
print("Model version:", version)
description = model.Description().decode('utf-8')
print("Description:", description)
subgraph_len = model.SubgraphsLength()
print("Subgraph length:", subgraph_len)
Nodes part:
def print_nodes_info(model):
# what does this 0 mean? should it always be zero?
subgraph = model.Subgraphs(0)
operators_len = subgraph.OperatorsLength()
print('Operators length:', operators_len)
from collections import deque
nodes = deque(subgraph.InputsAsNumpy())
STEP_N = 0
MAX_STEPS = operators_len
print("Nodes info:")
while len(nodes) != 0 and STEP_N <= MAX_STEPS:
print("MAX_STEPS={} STEP_N={}".format(MAX_STEPS, STEP_N))
print("-" * 60)
node_id = nodes.pop()
print("Node id:", node_id)
tensor = subgraph.Tensors(node_id)
print("Node name:", tensor.Name().decode('utf-8'))
print("Node shape:", tensor.ShapeAsNumpy())
# which type is it? what does it mean?
type_of_tensor = tensor.Type()
print("Tensor type:", type_of_tensor)
quantization = tensor.Quantization()
min = quantization.MinAsNumpy()
max = quantization.MaxAsNumpy()
scale = quantization.ScaleAsNumpy()
zero_point = quantization.ZeroPointAsNumpy()
print("Quantization: ({}, {}), s={}, z={}".format(min, max, scale, zero_point))
# I do not understand it again. what is j, that I set to 0 here?
operator = subgraph.Operators(0)
for i in operator.OutputsAsNumpy():
nodes.appendleft(i)
STEP_N += 1
print("-"*60)
Please point me to documentation or some example of using this API.
My problems are:
I can not get documentation on this API
Iterating over Tensor objects seems not possible for me, as it doesn't have Inputs and Outputs methods. + subgraph.Operators(j=0) I do not understand what j means in here. Because of that my cycle goes through two nodes: input (once) and the next one over and over again.
Iterating over Operator objects is surely possible:
Here we iterate over them all but I can not get how to map Operator and Tensor.
def print_in_out_info_of_all_operators(model):
# what does this 0 mean? should it always be zero?
subgraph = model.Subgraphs(0)
for i in range(subgraph.OperatorsLength()):
operator = subgraph.Operators(i)
print('Outputs', operator.OutputsAsNumpy())
print('Inputs', operator.InputsAsNumpy())
I do not understand how to pull parameters out Operator object. BuiltinOptions method gives me Table object, that I do not know what to map at.
subgraph = model.Subgraphs(0)
What does this 0 mean? should it always be zero? obviously no, but what is it? Id of the subgraph? If so - I'm happy. If no, please try to explain it.

preprocess images with tf.data.experimental.make_csv_dataset or with read_csv option

I am adding this summarization of my issue to make it easier to understand:
I want to do exactly what is done in the following tensorflow example:
https://www.tensorflow.org/guide/datasets
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
The only differences are: I read the data from CSV that has many more features and then I call the map method:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(_parse_function)
How does my _parse_function need to look like? How do I access the image path features, updates it to be an image presentation and return a modified numeric matrix feature of the image without changing anything at the other features?
thanks,
eilalan
==================Here are my code tries:==================
My code reads a CSV with feature columns and label. One of the features is image path, the others are strings.
The image path need to be processed into image numbers matrix.
I have tried doing so with the following options. In both ways tf.read_file fails with the input dimension error.
My question is how to pass one image at a time into the map methods
def read_image_png_option_1(image_path, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(image_path), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
return image
def read_image_png_option_2(features, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(features['image']), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
features['image'] = image
return features
def make_input_fn(fileName,batch_size=8, perform_shuffle=True):
"""An input function for training """
def _input_fn():
def decode_csv(line):
print('line is ',line)
filename_col,label_col,gender_col,ethinicity = tf.decode_csv(line,
[[""]]*amount_of_columns_csv,
field_delim=",",
na_value='NA',
select_cols=None)
image_col = read_image_png_option_1(filename_col)
d = dict(zip(['image','label','gender','ethinicity'], [image_col,label_col,gender_col,ethinicity])), label
return d
## OPTION 1:
# filenames could be more than one
# dataset = tf.data.TextLineDataset(filenames=fileName).skip(1).batch(batch_size).map(decode_csv)
## OPTION 2:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(read_image_png_option_2)
#select_columns=[0,1]) #[tf.string,tf.string,tf.string,tf.string])
if perform_shuffle:
dataset = dataset.shuffle(buffer_size=256)
return dataset
return _input_fn()
train_input_fn = lambda: make_input_fn(CSV_PATH_TRAIN)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=50)
eval_input_fn = lambda: make_input_fn(CSV_PATH_VAL)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
feature_columns = [tf.feature_column.numeric_column("image",shape=(224,224)), # here i need a pyhton method to transform
tf.feature_column.categorical_column_with_vocabulary_list("gender", ["ww","ee"]),
tf.feature_column.categorical_column_with_vocabulary_list("ethinicity",["xx","yy"])]
estimator = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[1024, 512, 256],warm_start_from=ws)
tf.estimator.train_and_evaluate(estimator, train_spec=train_spec, eval_spec=eval_spec)
Error for option 2:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [2].
Error for option 1:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [?].
Any help is appreciated.
Thanks
First you need to read the CSV file into dataset.
Then for each row in your CSV you can call your parse function.
def getInput(fileList):
# returns a dataset containing list of filenames
files = tf.data.Dataset.from_tensor_slices(fileList)
# Returs a dataset containing list of rows taken from all the files in file list.
# dataset is filled dynamically and not all entries are read at once
dataset = files.interleave(tf.data.TextLineDataset)
# call parse function for each row
# returned dataset will contain list of whatever the parse function is returning for the row
# we want the image path to be converted to decoded image in parse function
dataset = dataset.map(_parse_function, num_parallel_calls=8)
# return an iterator for the dataset which will be used to get elements.
return dataset.make_one_shot_iterator().get_next()
The parse function will be passed only one parameter that will be a single row from the CSV file. You need to decode the CSV and do further processing on each value.
Let's say you have 3 columns in your CSV each being a string.
def _parse_function(value):
columns_default = [[""], [""], [""]]
# this will be a tensor of columns in the row
columns = tf.decode_csv(value, record_defaults=columns_default,
field_delim=',')
col_names = ["label", "imagepath", "c3"]
features = dict(zip(col_names, columns))
for f, tensor in features.items():
# process imagepath to decoded image
if f == "imagepath":
image_string = tf.read_file(tensor)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
features[f] = image_resized
labels = tf.equal(features.pop('label'), "1")
labels = tf.expand_dims(labels, 0)
return features, labels
Edit:
Explanation for comment:
Dataset object simply contains a list of elements. The elements can be tensors or a tuple of tensors etc. Tensor object can contain anything. It could represent a single feature, a single record or a batch of record. Further dataset API provide handy methods to manipulate the elements within.
If you are using dataset with another API like estimator then they expect the dataset elements to be in specific format which is what need to return from our input function for eg.
https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
I have edited my code block above to describe what dataset object at each step will contain.
From what I understand is that you have image path as one of the field in your CSV and you want to convert that path into an actual decoded image which you will use as one of the feature.
Since the image is going to be just one of the feature, you should not try to create a dataset using image files alone. Dataset object will include all your features at once.
So doing this would be incorrect:
files = tf.data.Dataset.from_tensor_slices(ds['imagepath'])
dataset = files.interleave(tf.data.TextLineDataset)
If you are using make_csv() function to read your csv then it will convert each row of your csv into one record where one record will contain list of all features, same as columns of csv.
So each element in the returned dataset should contain a single tensor containing all your features.
Here your image path will be one of the features. now you want to transform that image path to decoded image.
I suppose you can do it by applying a parse function to elements of dataset using map() function but it will be slightly tricky as all your features are already packed inside a single tensor.

Do not understand the classes part and reshape from reading a h5 dataset file

Hello can somebody explain step by step what's hapening in following code?
Escpecially the part classes and the reshape? tnx
def load_data():
train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels
test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
Most of the lines just load datasets from the h5 file. The np.array(...) wrapper isn't needed. test_dataset[name][:] is sufficient to load an array.
test_set_y_orig = test_dataset["test_set_y"][:]
test_dataset is the opened file. test_dataset["test_set_y"] is a dataset on that file. The [:] loads the dataset into a numpy array. Look up the h5py docs for more details on load a dataset.
I deduce from
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
that the array, as loaded is 1d, with shape (n,), and this reshape is just adding an initial dimension, making it (1,n). I would have coded it as
train_set_y_orig = train_set_y_orig[None,:]
but the result is the same.
There's nothing special about the classes array (though it might well be an array of strings).