tf.decode_csv lasts too long - tensorflow

I am using TensorFlow v0.8, and it's strange that it takes around 5 minutes to print the second print time.time(). I thought tf.decode_csv() would just simply add an operation into the graph without doing any computation.
Why does it take so long to call tf.decode_csv()?
def main(argv=None):
# deal with arguments
with tf.device("/cpu:0"):
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(train_set_filename + "*"))
reader = tf.TextLineReader()
_, line = reader.read(filename_queue)
default = [[-1.0] for x in range(image_size * image_size * channels + 1)]
print time.time()
line = tf.decode_csv(line, record_defaults=default)
print time.time()
label = line[0]
feature = tf.pack(list(line[1:]))
...

The tf.decode_csv(line, record_defaults=default) takes a lot of time because you use so many columns.
I don't know your image_size, but if it is around 200 you are trying to set 120,001 columns to your csv, which is huge. You are right, TensorFlow is not doing any computation, but it has to build the graph properly and with that much columns it takes a lot of time !
I strongly advise you to not use csv format for images. Instead you should store your images in JPEG format, and use tf.image.decode_jpeg().

Related

Tensorflow tf.data.Dataset takes too much time to generate dataset. Better way to optimize it?

I have .stem.mp4 files each of which is composed of multiple audio sources.
Each length of file is 2 minutes to 6 minutes. It varies a lot.
When I try to make tf.data.Dataset out of it, it seems to take a lot of time to generate a input_batch much more than my model makes a prediction of a given batch.
Let me illustarte an example.
import tensorflow as tf
import tensorflow.keras as keras
sample_data = tf.random.normal((5, 755200, 2)) # 5 sources of audio, stereo channel
# First axis is the mixture of the audio, so this is the input
# Rest 4 axes are the each source of the audio(eg. bass, drum, vocals, etc) so these are the output
input_mixture = sample_data[0, :, :]
target_mixtures = sample_data[1:, :, :]
target_mixtures = np.column_stack(target_mixtures)
length = 44100 * 11 # I want to split these into length of 11 seconds
strides = 44100 # 1 second stride
ds_inp = tf.data.Dataset.from_tensor_slices((input_mixture))
ds_inp = ds_inp.window(length, shift=strides, drop_remainder=True)
ds_inp = ds_inp.flat_map(lambda windows: windows.batch(length))
ds_inp = ds_inp.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_tar = tf.data.Dataset.from_tensor_slices((target_mixtures))
ds_tar = ds_tar.window(length, shift=strides, drop_remainder=True)
ds_tar = ds_tar.flat_map(lambda windows: windows.batch(length))
ds_tar = ds_tar.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_total = [ds_inp, ds_tar]
total_ds = tf.data.Dataset.zip(tuple(ds_total))
total_ds = total_ds.batch(BATCH_SIZE)
total_ds = total_ds.prefetch(tf.data.AUTOTUNE)
This is how I made a tf.data.Dataset from the given file.
And when I measure the time how fast does this make a input_batch and output_batch,
%%time
for i, j in total_ds.take(1):
pass
# Wall time: 18.3 s
My model has about 100 million variables, but since it fairly has a simple structure so that it takes about 6 seconds to generate a predicted_batch out of given input_batch.
So my problem is, is there any way to make it to generate input_batch, output_batch faster?
(My assumption is that, as this 'window' the given arrays, there is no better way to improve this.)
Obviously all of the files are big enough not to be cached.

Example pipeline for TFRecords with chunking for long input sequences

I'm trying to optimise the input pipeline for a model I am using that uses GRUs. The data consists of a large number of files that contain time series of length 5000 with dimensionality of 50. I know that it isn't feasible to feed a single sequence of length 5000 into an RNN owing to the vanishing gradient, and you should instead try to chunk it into (5000-seq_len) overlapping chunks, where seq_len is a more manageable length, say 200 timesteps.
The most obvious method for getting this to work with TFRecords/SequenceExamples is to simply have each chunk included as a new SequenceExample within the same file. This seems massively inefficient however, as the majority of data in the resulting TFRecords file will be duplicate data.
Is there a better method of doing this? I've seen very few examples of how to use TFRecords that don't involve images, and no examples that use non-trivial sequence lengths!
For example:
def chunk_save_tfrecords(X, file_path_prefix, seq_length):
# Generate tfrecord writer
result_tf_file = file_path_prefix + '.tfrecords'
with tf.python_io.TFRecordWriter(result_tf_file) as writer:
# Chunk the data
for i in range(int(X.shape[0] - seq_length)):
chunk = X[i:i+seq_length]
data_features = [
tf.train.Feature(
float_list=tf.train.FloatList(value=chunk[t]))
for t in range(seq_length)] # FloatList per timestep
feature_lists = tf.train.FeatureLists(
feature_list={
'data': tf.train.FeatureList(feature=data_features)})
serialized = tf.train.SequenceExample(
feature_lists=feature_lists).SerializeToString()
writer.write(serialized)
def save_tfrecords(X, file_path_prefix):
# Generate tfrecord writer
result_tf_file = file_path_prefix + '.tfrecords'
with tf.python_io.TFRecordWriter(result_tf_file) as writer:
data_features = [
tf.train.Feature(
float_list=tf.train.FloatList(value=X[t]))
for t in range(X.shape[0])] # FloatList per timestep
feature_lists = tf.train.FeatureLists(
feature_list={
'data': tf.train.FeatureList(feature=data_features)})
serialized = tf.train.SequenceExample(
feature_lists=feature_lists).SerializeToString()
writer.write(serialized)
test = np.random.randn(5000,50)
save_tfrecords(test, 'test')
chunk_save_tfrecords(test, 'test_chunk', 200)
save_tfrecords creates a 1MB file, while chunk_save_tfrecords creates a 200MB file!

TensorFlow example, MemoryError while run text_classification_character_cnn.py

I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.
I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)

TensorFlow: How to apply the same image distortion to multiple images

Starting from the Tensorflow CNN example, I'm trying to modify the model to have multiple images as an input (so that the input has not just 3 input channels, but multiples of 3 by stacking images).
To augment the input, I try to use random image operations, such as flipping, contrast and brightness provided in TensorFlow.
My current solution to apply the same random distortion to all input images is to use a fixed seed value for these operations:
def distort_image(image):
flipped_image = tf.image.random_flip_left_right(image, seed=42)
contrast_image = tf.image.random_contrast(flipped_image, lower=0.2, upper=1.8, seed=43)
brightness_image = tf.image.random_brightness(contrast_image, max_delta=0.2, seed=44)
return brightness_image
This method is called multiple times for each image at graph construction time, so I thought for each image it will use the same random number sequence and consequently, it will result in have the same applied image operations for my image input sequence.
# ...
# distort images
distorted_prediction = distort_image(seq_record.prediction)
distorted_input = []
for i in xrange(INPUT_SEQ_LENGTH):
distorted_input.append(distort_image(seq_record.input[i,:,:,:]))
stacked_distorted_input = tf.concat(2, distorted_input)
# Ensure that the random shuffling has good mixing properties.
min_queue_examples = int(num_examples_per_epoch *
MIN_FRACTION_EXAMPLES_IN_QUEUE)
# Generate a batch of sequences and prediction by building up a queue of examples.
return generate_sequence_batch(stacked_distorted_input, distorted_prediction, min_queue_examples,
batch_size, shuffle=True)
In theory, this works fine. And after doing some test runs, this really seemed to solve my problem. But after a while, I found out that I'm having a race-condition, because I use the input pipeline of the CNN-example code with multiple threads (which is the suggested method in TensorFlow to improve performance and reduce memory consumption at runtime):
def generate_sequence_batch(sequence_in, prediction, min_queue_examples,
batch_size):
num_preprocess_threads = 8 # <-- !!!
sequence_batch, prediction_batch = tf.train.shuffle_batch(
[sequence_in, prediction],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
return sequence_batch, prediction_batch
Because multiple threads create my examples, it is not guaranteed anymore that all image operations are performed in the right order (in sense of the right order of random operations).
Here I came to a point where I got completely stuck. Does anyone know how to solve this problem to apply the same image distortion to multiple images?
Some thoughts of mine:
I thought about to do some synchronizations arround these image distortion methods, but I could find anything provided by TensorFlow
I tried to generate to generate a random number for e.g. the random brightness delta using tf.random_uniform() by myself and use this value for tf.image.adjust_contrast(). But the result of the TensorFlow random generator is always a tensor, and I have not found a way to use this tensor as a parameter for tf.image.adjust_contrast() which expects a simple float32 for its contrast_factor parameter.
A solution that would (partly) work would be to combine all images to a huge image using tf.concat(), apply random operations to change contrast and brightness, and split the image afterwards. But this would not work for random flipping, because this would (at least in my case) change the order of the images, and there is no way to detect whether tf.image.random_flip_left_right() has performed a flip or not, which would be required to fix the wrong order of images if necessary.
Here is what I came up with by looking at the code of random_flip_up_down and random_flip_left_right within tensorflow :
def image_distortions(image, distortions):
distort_left_right_random = distortions[0]
mirror = tf.less(tf.pack([1.0, distort_left_right_random, 1.0]), 0.5)
image = tf.reverse(image, mirror)
distort_up_down_random = distortions[1]
mirror = tf.less(tf.pack([distort_up_down_random, 1.0, 1.0]), 0.5)
image = tf.reverse(image, mirror)
return image
distortions = tf.random_uniform([2], 0, 1.0, dtype=tf.float32)
image = image_distortions(image, distortions)
label = image_distortions(label, distortions)
I would do something like this using tf.case. It allows you to specify what to return if certain condition holds https://www.tensorflow.org/api_docs/python/tf/case
import tensorflow as tf
def distort(image, x):
# flip vertically, horizontally, both, or do nothing
image = tf.case({
tf.equal(x,0): lambda: tf.reverse(image,[0]),
tf.equal(x,1): lambda: tf.reverse(image,[1]),
tf.equal(x,2): lambda: tf.reverse(image,[0,1]),
}, default=lambda: image, exclusive=True)
return image
def random_distortion(image):
x = tf.random_uniform([1], 0, 4, dtype=tf.int32)
return distort(image, x[0])
To check if it works.
import numpy as np
import matplotlib.pyplot as plt
# create image
image = np.zeros((25,25))
image[:10,5:10] = 1.
# create subplots
fig, axes = plt.subplots(2,2)
for i in axes.flatten(): i.axis('off')
with tf.Session() as sess:
for i in range(4):
distorted_img = sess.run(distort(image, i))
axes[i % 2][i // 2].imshow(distorted_img, cmap='gray')
plt.show()