using tfrecord but getting file too large - tensorflow

I am trying to create a tfrecord from a folder of numpy arrays, the folder contains about 2000 numpy files of 50mb each.
def convert(image_paths,out_path):
# Args:
# image_paths List of file-paths for the images.
# labels Class-labels for the images.
# out_path File-path for the TFRecords output file.
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_images = len(image_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path) in enumerate(image_paths):
# Print the percentage-progress.
print_progress(count=i, total=num_images-1)
# Load the image-file using matplotlib's imread function.
img = np.load(path)
# Convert the image to raw bytes.
img_bytes = img.tostring()
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'image': wrap_bytes(img_bytes)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
# Write the serialized data to the TFRecords file.
writer.write(serialized)
i think it converts about 200 files and then i get this
Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
File "tf_record.py", line 54, in convert
writer.write(serialized)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large
Any suggestions to fix this would be helpful, Thanks in advance.

I'm not sure what the limits are to tfrecords but the more common approach assuming you have enough disk space is to store your dataset over several tfrecords file e.g. store every 20 numpy files in a different tfrecords file.

Related

How to decode a .csv .gzip file containing tweets?

I'm trying to do a twitter sentiment analysis and my dataset is a couple of .csv.gzip files.
This is what I did to convert them to all to one dataframe.
(I'm using google colab, if that has anything to do with the error, filename or something)
apr_files = [file[9:] for file in csv_collection if re.search(r"04+", file)]
apr_files
Output:
['0428_UkraineCombinedTweetsDeduped.csv.gzip',
'0430_UkraineCombinedTweetsDeduped.csv.gzip',
'0401_UkraineCombinedTweetsDeduped.csv.gzip']
temp_list = []
for file in apr_files:
print(f"Reading in {file}")
# unzip and read in the csv file as a dataframe
temp = pd.read_csv(file, compression="gzip", header=0, index_col=0)
# append dataframe to temp list
temp_list.append(temp)
Error:
Reading in 0428_UkraineCombinedTweetsDeduped.csv.gzip
Reading in 0430_UkraineCombinedTweetsDeduped.csv.gzip
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Reading in 0401_UkraineCombinedTweetsDeduped.csv.gzip
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-5cba3ca01b1e> in <module>()
3 print(f"Reading in {file}")
4 # unzip and read in the csv file as a dataframe
----> 5 tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
6 # append dataframe to temp list
7 tmp_df_list.append(tmp_df)
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 8048: invalid start byte
I assumed that this error might be because the tweets contain multiple characters (like emoji, non-english characters, etc.).
I just switched to Jupyter Notebook, and It worked fine there.
As of now, I don't know what was the issue with Google Colab though.

DataFolder class not detecting training samples

I am trying to reproduce the experiments in the paper Cross Modal Focal Loss for RGBD Face Anti-Spoofing (https://arxiv.org/pdf/2103.00948.pdf) . I've pointed my preprocessed directory to point to the mc-pixbis-224 preprocessed data in order to train the RGBDMH - CMFL model .I've selected to train the grandtest protocol and pointed the annotations directory to PROTOCOL-grand_test-curated.csv file . However , my DataFolder class fails to load any training samples as the length of dataset when printed is 0 .
Traceback (most recent call last):
File "bin/train_generic.py", line 22, in <module>
sys.exit(bob.learn.pytorch.scripts.train_generic.main())
File "/home/hazeeq/anaconda3/envs/bob.paper.cross_modal_focal_loss_cvpr2021/lib/python3.7/site-packages/bob/learn/pytorch/scripts/train_generic.py", line 150, in main
shuffle=True,
File "/home/hazeeq/anaconda3/envs/bob.paper.cross_modal_focal_loss_cvpr2021/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 262, in __init__
sampler = RandomSampler(dataset, generator=generator) # type: ignore
File "/home/hazeeq/anaconda3/envs/bob.paper.cross_modal_focal_loss_cvpr2021/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 104, in __init__
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
(bob.paper.cross_modal_focal_loss_cvpr2021) hazeeq#hazeeq-U3033:~/test/bob.paper.cross_modal_focal_loss_cvpr2021$
Line 152 of train_generic.py refers to this section of the code where dataloader["train"] fails to be loaded with the proper DataLoader object within the 'else' statement:
# Which device to use is figured out at this point, no need to use `use-gpu` flag anymore
# get data
if hasattr(configuration, "dataset"):
dataloader = {}
if not do_crossvalidation:
logger.info(
"There are {} training samples".format(
len(configuration.dataset["train"])
)
)
dataloader["train"] = torch.utils.data.DataLoader(
configuration.dataset["train"],
batch_size=batch_size,
num_workers=num_workers,
shuffle=True,
)
else:
dataloader["train"] = torch.utils.data.DataLoader(
configuration.dataset["train"],
batch_size=batch_size,
num_workers=num_workers,
shuffle=True,
)
dataloader["val"] = torch.utils.data.DataLoader(
configuration.dataset["val"],
batch_size=batch_size,
num_workers=num_workers,
shuffle=True,
)
logger.info(
"There are {} training samples".format(
len(configuration.dataset["train"])
)
)
logger.info(
"There are {} validation samples".format(
len(configuration.dataset["val"])
)
)
else:
logger.error("Please provide a dataset in your configuration file !")
sys.exit()
assert hasattr(configuration, "optimizer")
# train the network
if hasattr(configuration, "network"):
trainer = GenericTrainer(
configuration.network,
configuration.optimizer,
configuration.compute_loss,
learning_rate=learning_rate,
device=device,
verbosity_level=verbosity_level,
tf_logdir=output_dir + "/tf_logs",
do_crossvalidation=do_crossvalidation,
save_interval=save_interval,
)
trainer.train(dataloader, n_epochs=epochs, output_dir=output_dir, model=model)
else:
logger.error("Please provide a network in your configuration file !")
sys.exit()
The code also reports multiple missing files , so I am not sure if there are any missing files that should be part of the MC-pixbis-224 preprocessed data . Here I have attached some of the missing file prompts , there are more missing files than the ones showed below .
...............................HLDI self.annotation_directory ./hqwmca-protocols-csv/PROTOCOL-grand_test-curated.csv
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/02.04.19/1_03_0064_0000_06_01_013-e3a1456b.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/01.04.19/1_03_0001_0000_07_00_001-c8bd4c01.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/02.04.19/1_03_0001_0000_06_01_001-48c7d79c.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/11.03.19/1_03_0523_0018_08_00_004-315ad7b2.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/02.04.19/1_03_0002_0000_06_01_002-173e70ed.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/11.10.19/1_01_0002_0000_00_00_000-51e86383.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/11.10.19/1_01_0002_0000_00_00_000-7517b634.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/07.10.19/1_01_0077_0000_00_00_000-9f7b92f8.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/07.10.19/1_01_0077_0000_00_00_000-d416451d.hdf5
Missing file: /home/Dataset/FaceAntiSpoofing/HQ-WMCA/MC-PixBiS-224/preprocessed/face-station/11.10.19/1_01_0084_0000_00_00_000-305a3a31.hdf5
The CSV files are not annotations, they are just showing the distribution of files in different folds in each of the protocols if you want to use the datasets outside bob. Moreover, the MC-pixbis-224 preprocessed files do not correspond to the RGB-D data, it corresponds to an earlier paper. For RGB-D data you have to access the RAW data, and preprocess the data using the documentation.
Regards,
Anjith

Obtaining paths from .tfrecords file in tensorflow

Is it possible to get the paths of records (data items) from .tfrecord file? For example, in order to get the total number of records, we can use tf.python_io.tf_record_iterator .
For example
If I have 100 raw images and I converted them to .tfrecords format. Now I can load them into my tensorflow model to access them. Is there a way I can access the location of the disk (paths) of these images using .tfrecords?
When you create a tfrecord file from a batch of images, It means that the data from these images is stored in the tfrecord file in bytes format. You can store the path of the original image to the tfrecord file e.g.:
def image_example(image_string, label, path):
feature = {
'label': _int64_feature(label),
'image_raw': _bytes_feature(image_string),
'path': _bytes_feature(path),
}
return tf.train.Example(features=tf.train.Features(feature=feature))

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)