Live refreshing filenames list - tensorflow

What is the proper way of live updating the file list while keeping the tensorflow session is open?
I am aware of the word once in the tf.train.match_filenames_once() function name but could not find an alternative. Here is a snippet showing what I am trying to do:
import tensorflow as tf
pattern = '/tmp/test_file_*'
filenames = tf.train.match_filenames_once(pattern)
# Create some files
for i in range(3):
with open(f'/tmp/test_file_old_{i}', 'w'):
pass
# Open session and initialize vars
init_op = tf.local_variables_initializer()
sess = tf.Session()
sess.run(init_op)
# Read filenames list before adding new files
filenames_result_1 = sess.run(filenames)
print('First filenames list:')
print(filenames_result_1)
# Create some new files
for i in range(3):
with open(f'/tmp/test_file_new_{i}', 'w'):
pass
# Read filenames list after adding new files
filenames_result_2 = sess.run(filenames)
print('Second filenames list:')
print(filenames_result_2)
# Trying to reinitialize the contents
sess.run(init_op)
filenames_result_3 = sess.run(filenames)
print('Third filenames list:')
print(filenames_result_3)
Output:
First filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
Second filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
Third filenames list:
[b'/tmp/test_file_old_1' b'/tmp/test_file_old_0' b'/tmp/test_file_old_2']
I would have liked to obtain the up-to-date list of files instead.

Related

TensorFlow Federated - Loading and preprocessing data on a remote client

Part of the simulation program that I am working on allows clients to load local data from their device without the server being able to access that data.
Following the idea from this post, I have the following code configured to assign the client a path to load the data from. Although the data is in svmlight format, loading it line-by-line can still allow it to be preprocessed afterwards.
client_paths = {
'client_0': '<path_here>',
'client_1': '<path_here>',
}
def create_tf_dataset_for_client_fn(id):
path = client_paths.get(id)
data = tf.data.TextLineDataset(path)
path_source = tff.simulation.datasets.ClientData.from_clients_and_fn(client_paths.keys(), create_tf_dataset_for_client_fn)
The code above allows a path to be loaded during runtime from the remote client's-side by the following line of code.
data = path_source.create_tf_dataset_for_client('client_0')
Here, the data variable can be iterated through and can be used to display the contents on the client on the remote device when calling tf.print(). But, I need to preprocess this data into an appropriate format before continuing. I am presently attempting to convert this from a string Tensor in svmlight format into a SparseTensor of the appropriate format.
The issue is that, although the defined preprocessing method works in a standalone scenario (i.e. when defined as a function and tested on a manually defined Tensor of the same format), it fails when the code is executed during the client update #tf.function in the tff algorithm. Below is the specified error when executing the notebook cell which contains a #tff.tf_computation function which calls an #tf.function which does the preprocessing and retrieves the data.
ValueError: Shape must be rank 1 but is rank 0 for '{{node Reshape_2}} = Reshape[T=DT_INT64, Tshape=DT_INT32](StringToNumber_1, Reshape_2/shape)' with input shapes: [?,?], [].
Since the issue occurs when executing the client's #tff.tf_computation update function which calls the #tf.function with the preprocessing code, I am wondering how I can allow the function to perform the preprocessing on the data without errors. I assume that if I can just get the functions to properly be run when defined that when called remotely it will work.
Any ideas on how to address this issue? Thank you for your help!
For reference, the preprocessing function uses tf computations to manipulate the data. Although not optimal yet, below is the code presently being used. This is inspired from this link on string_split examples. I have extracted the code to put directly into the client's #tf.function after loading the TextLineDataset as well, but this also fails.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, -1)
feat_ids = tf.expand_dims(feat_ids, 1)
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, -1)
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}
Update (Fix)
Following the advice from Jakub's comment, the issue was fixed by enclosing the reshape and expand_dim calls in [], when needed. Now there is no issue running the code within tff.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, [-1])
feat_ids = tf.expand_dims(feat_ids, [1])
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, [-1])
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}

tf.keras.preprocessing.image.ImageDataGenerator does not load all the images present in Google Colab

I am training a model to classify images into 10 different labels. To load data I'm using ImageDataGenerator.
tensorflow.keras.preprocessing.image import ImageDataGenerator
train_dir = '/content/drive/MyDrive/Colab Notebooks/EuroSAT/Train/'
train_datagen = ImageDataGenerator(rescale=1./255,
horizontal_flip=True, vertical_flip=True)
train_generator = train_datagen.flow_from_directory(train_dir, batch_size=16,
class_mode='categorical', target_size=(64, 64),
subset ='training', shuffle = False)
But there are almost 3000 images in each category while ImageDataGenerator loads only 5443 images in total.
Found 5827 images belonging to 10 classes.
What can I do to possibly go around?
It may be the case that you have image formats that are not supported or corrupted image files. This can happen often if for example you download images via google or bing. As I do this often I developed a function provided below that checks a directory that contains images held in sub directories (class directories if you are using the ImageDataGenerator(),flow_from_directory. It checks to see if the files are valid image files and have the extensions specified in a user defined list of proper extensions. The code is shown below. It is a bit lengthy because it does a lot of checking on inputs etc. Note if it detects a file with the extension jfif it renames it as jpg since they are the same format. The parameter convert_ext can be set to convert all the images to a new image format based on the extension specified, for example 'bmp' If left as None the images retain their original format.
import os
import shutil
import cv2
def check_file_extension (source_dir, good_ext_list, delete=False, convert_ext=None):
# source_dir is the directory containing the class sub directories that hold the images
# good_ext_list is a list of strings you specify as good extensions for the ImageDataGenerator
# this list should be ['jpg', 'jpeg', 'bmp', 'png', 'tiff']
# delete is a boolean, if set to True image files that have invalid extensions or are not valid
# image files will be deleted.
# the function return a list. If delete=False this is a list of all files that have invalid
# extensions or are not valid image files
# if convert_ext is set to other than None, it should be a string indicating the new image format
# the files will be converted to, for example "jpg"
processed_count=0 # will be total number of files found
good_count=0 # will be total number of valid image files found
bad_file_list=[] # will be a list of all files processed that had invalid extensions
removed_count=0 # will be the number of files deleted if delete is set to true
class_list=os.listdir(source_dir)
if len(class_list)==0:
print('directory ', source_dir, ' is empty *** Program Terminating')
return None
print('{0:^20s}{1}{2:^17s}{1}{3:^14s}{1}{4:^15s}'.format('Class Directory',' ', 'Files Processed', 'Files Verified', 'Files Removed'))
for klass in class_list:
class_path=os.path.join(source_dir, klass)
if os.path.isdir(class_path)==False:# check if this is a directory if it is not print a warning
print ('*** Warning *** there are files in ', source_dir, ' it should only contain sub directories' )
else:
class_file_count=0 # will be number of files found in the class directory
class_good_count=0 # will be the number of good files found in the class directory
class_removed_count =0
f_list=os.listdir(class_path) # get a list of files in the class directory
for f in f_list:
f_path=os.path.join(class_path,f)
if os.path.isfile(f_path)==False: # check if it is a file if it is a directory print a warning
print ('*** Warning *** there is a directory in ', class_path, ' there should only be files there')
else:
class_file_count +=1 #increment class file counter
index=f.rfind('.')
fname=f[:index]
fext=f[index+1:].lower()
if fext not in good_ext_list and fext !='jfif':
if delete:
os.remove(f_path)
class_removed_count +=1 # increment removed file counter
else:
bad_file_list.append(f_path) # don't delete but put the path in list of files with bad extensions
else:
if fext =='jfif': # if ext= jfif change it to jpg
fnew_path=os.path.join(class_path, fname + '.' + 'jpg')
shutil.copy(f_path,fnew_path )
os.remove(f_path)
else:
try:
img=cv2.imread(f_path)
shape=img.shape
if convert_ext !=None:
fnew_path=os.path.join(class_path, fname + '.' + convert_ext)
cv2.imwrite(fnew_path,img)
os.remove (f_path)
class_good_count +=1
except:
if delete:
os.remove(f_path)
class_removed_count +=1
else:
bad_file_list.append(f_path)
print('{0:^20s}{1}{2:^17s}{1}{3:^14s}{1}{4:^15s}'.format(klass,' ', str(class_file_count),str(class_good_count), str(class_removed_count)) )
processed_count=processed_count + class_file_count
good_count=good_count + class_good_count
removed_count=removed_count+ class_removed_count
print('processed ', processed_count, ' files ', good_count, 'files were verified ', removed_count, ' files were removed')
return bad_file_list
Below is an example of use
source_dir=r'c:\temp\people\storage'
good_ext_list=['jpg', 'jpeg', 'bmp', 'tiff', 'png']
new_ext='bmp'
bad_file_list=check_file_extension (source_dir, good_ext_list, delete=False,convert_ext=new_ext )
print (bad_file_list)
below is the typical output
Class Directory Files Processed Files Verified Files Removed
savory 20 20 0
unsavory 21 20 0
processed 41 files 40 files were verified 0 files were removed
['c:\\temp\\people\\storage\\unsavory\\040.xyz']

using tfrecord but getting file too large

I am trying to create a tfrecord from a folder of numpy arrays, the folder contains about 2000 numpy files of 50mb each.
def convert(image_paths,out_path):
# Args:
# image_paths List of file-paths for the images.
# labels Class-labels for the images.
# out_path File-path for the TFRecords output file.
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_images = len(image_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path) in enumerate(image_paths):
# Print the percentage-progress.
print_progress(count=i, total=num_images-1)
# Load the image-file using matplotlib's imread function.
img = np.load(path)
# Convert the image to raw bytes.
img_bytes = img.tostring()
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'image': wrap_bytes(img_bytes)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
# Write the serialized data to the TFRecords file.
writer.write(serialized)
i think it converts about 200 files and then i get this
Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
File "tf_record.py", line 54, in convert
writer.write(serialized)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large
Any suggestions to fix this would be helpful, Thanks in advance.
I'm not sure what the limits are to tfrecords but the more common approach assuming you have enough disk space is to store your dataset over several tfrecords file e.g. store every 20 numpy files in a different tfrecords file.

Getting filename a exampe came from in tf,parse_exampes

I am writing a Data input pipeline in tensorflow that uses a bunch of tfrecord files with different Examples (types).
I am using code like:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)
However I want my parse_function to be different for file1.tfrecord than for file2.tfrecord. How do I achieve this. Is there someway of knowin in parse_example which file a particular example came from?
You can use a Dataset.flat_map() transformation to include the filename with each record as follows:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
filenames = tf.data.from_tensor_slices(filenames)
# `Dataset.flat_map()` creates a nested dataset from each element in `filenames`.
#
# For each file in filename, zip together the filename (repeated infinitely) with
# the records read from that file.
dataset = filenames.flat_map(
lambda fn: tf.data.Dataset.zip((tf.data.Dataset.from_tensors(fn).repeat(None),
tf.data.TFRecordDataset(fn))))
# The _parse_function can now be modified to take both the filename and the record.
dataset = dataset.map(lambda fn, record: _parse_function(fn, record))

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)