Tensorflow - shuffling at "batch-level" instead of"example-level" - tensorflow

I have a problem that I will try to explain with an example for easier understanding.
I want to classify oranges (O) and apples (A). For technical/legacy reasons (a component in the network) each batch should have either only O or only A examples. So traditional shuffling at example-level is not possible/adequate, since I cannot afford to have a batch that includes a mixture of O and A examples. However some kind of shuffling is desirable, as it is a common practise to train deep networks.
These are the steps that I take:
I first need to convert raw data/examples into TFRecords.
I shuffle the order of the raw examples, and then I create separate TFRecords that contained either only the shuffled O examples, or only the shuffled A examples. Let's call this "example-level" shuffling. This is something that takes place offline and only once.
At this point I have "clean batches": O-baches that contain only O examples, and A-batches that contain only A examples.
I do not want to first feed the network with all the O-batches and then with all the A-batches sequentially. This would probably not help much in convergence.
Can I shuffle these batches on the "batch-level", i.e. without affecting their interior?

If you use the Dataset api it's fairly straightforward. Just zip the O and A batches, then apply a random selection function with Dataset.map():
ds0 = tf.data.Dataset.from_tensor_slices([0])
ds0 = ds0.repeat()
ds0 = ds0.batch(5)
ds1 = tf.data.Dataset.from_tensor_slices([1])
ds1 = ds1.repeat()
ds1 = ds1.batch(5)
def rand_select(ds0, ds1):
rval = tf.random_uniform([])
return tf.cond(rval<0.5, lambda: ds0, lambda: ds1)
dataset = tf.data.Dataset()
dataset = dataset.zip((ds0, ds1)).map(lambda ds0, ds1: rand_select(ds0, ds1))
iterator = dataset.make_one_shot_iterator()
ds = iterator.get_next()
with tf.Session() as sess:
for _ in range(5):
print(sess.run(ds))
> [0 0 0 0 0]
[1 1 1 1 1]
[1 1 1 1 1]
[0 0 0 0 0]
[0 0 0 0 0]

Related

What does batch, repeat, and shuffle do with TensorFlow Dataset?

I'm currently learning TensorFlow but I came across a confusion in the below code snippet:
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
I know that first the dataset will hold all the data but what shuffle(),repeat(), and batch() do to the dataset?
Please help me with an example and explanation.
Update: Here is a small collaboration notebook for demonstration of this answer.
Imagine, you have a dataset: [1, 2, 3, 4, 5, 6], then:
How ds.shuffle() works
dataset.shuffle(buffer_size=3) will allocate a buffer of size 3 for picking random entries. This buffer will be connected to the source dataset.
We could image it like this:
Random buffer
|
| Source dataset where all other elements live
| |
↓ ↓
[1,2,3] <= [4,5,6]
Let's assume that entry 2 was taken from the random buffer. Free space is filled by the next element from the source buffer, that is 4:
2 <= [1,3,4] <= [5,6]
We continue reading till nothing is left:
1 <= [3,4,5] <= [6]
5 <= [3,4,6] <= []
3 <= [4,6] <= []
6 <= [4] <= []
4 <= [] <= []
How ds.repeat() works
As soon as all the entries are read from the dataset and you try to read the next element, the dataset will throw an error.
That's where ds.repeat() comes into play. It will re-initialize the dataset, making it again like this:
[1,2,3] <= [4,5,6]
What will ds.batch() produce
The ds.batch() will take the first batch_size entries and make a batch out of them. So, a batch size of 3 for our example dataset will produce two batch records:
[2,1,5]
[3,6,4]
As we have a ds.repeat() before the batch, the generation of the data will continue. But the order of the elements will be different, due to the ds.random(). What should be taken into account is that 6 will never be present in the first batch, due to the size of the random buffer.
The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.
An example that shows looping over epochs. Upon running this script notice the difference in
dataset_gen1 - shuffle operation produces more random outputs (this may be more useful while running machine learning experiments)
dataset_gen2 - lack of shuffle operation produces elements in sequence
Other additions in this script
tf.data.experimental.sample_from_datasets - used to combine two datasets. Note that the shuffle operation in this case shall create a buffer that samples equally from both datasets.
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # to avoid all those prints
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" # to avoid large "Kernel Launch Time"
import tensorflow as tf
if len(tf.config.list_physical_devices('GPU')):
tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
class Augmentations:
def __init__(self):
pass
#tf.function
def filter_even(self, x):
if x % 2 == 0:
return False
else:
return True
class Dataset:
def __init__(self, aug, range_min=0, range_max=100):
self.range_min = range_min
self.range_max = range_max
self.aug = aug
def generator(self):
dataset = tf.data.Dataset.from_generator(self._generator
, output_types=(tf.float32), args=())
dataset = dataset.filter(self.aug.filter_even)
return dataset
def _generator(self):
for item in range(self.range_min, self.range_max):
yield(item)
# Can be used when you have multiple datasets that you wish to combine
class ZipDataset:
def __init__(self, datasets):
self.datasets = datasets
self.datasets_generators = []
def generator(self):
for dataset in self.datasets:
self.datasets_generators.append(dataset.generator())
return tf.data.experimental.sample_from_datasets(self.datasets_generators)
if __name__ == "__main__":
aug = Augmentations()
dataset1 = Dataset(aug, 0, 100)
dataset2 = Dataset(aug, 100, 200)
dataset = ZipDataset([dataset1, dataset2])
epochs = 2
shuffle_buffer = 10
batch_size = 4
prefetch_buffer = 5
dataset_gen1 = dataset.generator().shuffle(shuffle_buffer).batch(batch_size).prefetch(prefetch_buffer)
# dataset_gen2 = dataset.generator().batch(batch_size).prefetch(prefetch_buffer) # this will output odd elements in sequence
for epoch in range(epochs):
print ('\n ------------------ Epoch: {} ------------------'.format(epoch))
for X in dataset_gen1.repeat(1): # adding .repeat() in the loop allows you to easily control the end of the loop
print (X)
# Do some stuff at end of loop

Solving XOR with 3 data points using Multi-Layered Perceptron

The XOR problem is known to be solved by the multi-layer perceptron given all 4 boolean inputs and outputs, it trains and memorizes the weights needed to reproduce the I/O. E.g.
import numpy as np
np.random.seed(0)
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def cost(predicted, truth):
return truth - predicted
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
X = xor_input
Y = xor_output
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
num_epochs = 10000
learning_rate = 1.0
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = cost(layer2, Y)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += learning_rate * np.dot(layer1.T, layer2_delta)
W1 += learning_rate * np.dot(layer0.T, layer1_delta)
We see that we've fully trained the network to memorize the outputs for XOR:
# On the training data
[int(prediction > 0.5) for prediction in layer2]
[out]:
[0, 1, 1, 0]
If we re-feed the same inputs, we get the same output:
for x, y in zip(X, Y):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
0 [0]
But if we retrain the parameters (W1 and W2) without one of the data points, i.e.
xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T
Let's drop the last row of data and use that as unseen test.
X = xor_input[:-1]
Y = xor_output[:-1]
And with the rest of the same code, regardless of how I change the hyperparameters, it's un-able to learn the XOR function and reproduce the I/O.
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(int(prediction > 0.5), y)
[out]:
0 [0]
1 [1]
1 [1]
1 [0]
Even if we shuffle the in-/output:
# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))
We can't train the XOR function fully:'
for x, y in zip(xor_input, xor_output):
layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
print(x, int(prediction > 0.5), y)
[out]:
[0 0] 1 [0]
[0 1] 1 [1]
[1 0] 1 [1]
[1 1] 0 [0]
So when the literature states that the multi-layered perceptron (Aka the basic deep learning) solves XOR, does it mean that it can fully learn and memorize the weights given the fully set of in-/outputs but cannot generalize the XOR problem given that one of data point is missing?
Here's the link of the Kaggle dataset that answerers can test the network for themselves: https://www.kaggle.com/alvations/xor-with-mlp/
I think learning (generalizing) XOR and memorizing XOR are different things.
A two-layer perceptron can memorize XOR as you have seen, that is there exists a combination of weights where the loss is minimum and equal to 0 (absolute minimum).
If the weights are randomly initialized, you might end up with the situation where you have actually learned XOR and not only memorized.
Note that multi-layer perceptrons are non-convex functions so, there could be multiple minima (multiple global minima even). When data is missing one input, there are multiple minima (and all are equal in value) and there exists minima where the missing point would be correctly classified. Hence, MLP can learn an XOR. (though finding that weight combination might be hard with a missing point).
It is quite often argued that Neural Networks are universal function approximator and can approximate non-sense labels even. In that light, you might want to look at this work https://arxiv.org/abs/1611.03530

Interleaving multiple TensorFlow datasets together

The current TensorFlow dataset interleave functionality is basically a interleaved flat-map taking as input a single dataset. Given the current API, what's the best way to interleave multiple datasets together? Say they have already been constructed and I have a list of them. I want to produce elements from them alternatively and I want to support lists with more than 2 datasets (i.e., stacked zips and interleaves would be pretty ugly).
Thanks! :)
#mrry might be able to help.
EDIT 2: See tf.contrib.data.choose_from_datasets. It performs deterministic dataset interleaving.
EDIT: See tf.contrib.data.sample_from_datasets. Even though it performs random sampling I guess it can be useful.
Even though this is not "clean", it is the only workaround I came up with.
datasets = [tf.data.Dataset...]
def concat_datasets(datasets):
ds0 = tf.data.Dataset.from_tensors(datasets[0])
for ds1 in datasets[1:]:
ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
return ds0
ds = tf.data.Dataset.zip(tuple(datasets)).flat_map(
lambda *args: concat_datasets(args)
)
Expanding user2781994 answer (with edits), here is how I implemented it:
import tensorflow as tf
ds11 = tf.data.Dataset.from_tensor_slices([1,2,3])
ds12 = tf.data.Dataset.from_tensor_slices([4,5,6])
ds13 = tf.data.Dataset.from_tensor_slices([7,8,9])
all_choices_ds = [ds11, ds12, ds13]
choice_dataset = tf.data.Dataset.range(len(all_choices_ds)).repeat()
ds14 = tf.contrib.data.choose_from_datasets(all_choices_ds, choice_dataset)
# alternatively:
# ds14 = tf.contrib.data.sample_from_datasets(all_choices_ds)
iterator = ds14.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
while True:
try:
value=sess.run(next_element)
except tf.errors.OutOfRangeError:
break
print(value)
The output is:
1
4
7
2
5
8
3
6
9
In Tensorflow 2.0
tot_imm_dataset1 = 105
tot_imm_dataset2 = 55
e = tf.data.Dataset.from_tensor_slices(tf.cast([1,0,1],tf.int64)).repeat(int(tot_imm_dataset1/2))
f=tf.data.Dataset.range(1).repeat(int(tot_imm_dataset2-tot_imm_dataset1/2))
choice=e.concatenate(f)
datasets=[dataset2,dataset1]
dataset_rgb_compl__con_patch= tf.data.experimental.choose_from_datasets(datasets, choice)
That works for me

tensorflow reduce_mean with multidimension second argument

I met the usage of the reduce_mean with the vector as the second arguments. I looked through sensor flow manual but can't find the corresponding example. The codes are below:
tf.reduce_mean(train, [0,1,2]
where train is at size batchsize x H x L x 2
I also played with some experiments but can't figure out how this second vector input will be processed
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0,1,2])
Output = 1
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0])
Output = [[2 2 2]
[2 2 0]]
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0,1])
Output = [2 2 1]
Just figure out tf.reduce_mean(train, [0,1,2]) if the second argument is the vector. It will reduce the dimension as the order of the element is the vector. For example, the [0,1,2] will reduce along the axis of 0,1,2

Understanding estimator.evaluate and outputting comparisons

Thanks for your help tensorflow community!
I have a question regarding understanding and visualizing the output of the estimator's evaluate function.
I have a DNNClassifier and have trained it on data with 10 output ranges predictions can go into.
After training and running
accuracy = classifier.evaluate(input_fn = test_input_fn)['accuracy']
I see my accuracy as 33.8%. Which who knows how good that is. (Probably not good)
How can I see the output of each of the comparisons?
As the test_data is ran I would like to see what the estimate is, and what the actual value is. Basically a side by side of y and y'.
something like: [0 0 0 0 0 0 0 0 1] vs [0 0 0 0 0 0 0 0 1 0] 'false'
Rather than just seeing the aggregated overall accuracy.
Thanks!
So in the event that someone reads the question above, and understands what I was trying to do (view the output of predictions), I have a solution.
The solution is to utilize the .predict() method.
A good example is here:
https://www.tensorflow.org/get_started/estimator#classify_new_samples
My code ended up looking like:
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x = {"x": np.array(predict_set.data)},
num_epochs = 1,
shuffle = False)
predictions = list(classifier.predict(input_fn=predict_input_fn))
print("\n Predictions:")
print(len(predictions))
for p in predictions:
print(int(p['classes'][0]))
which outputs the predictions in a column which I can copy / paste into some spread sheet program to examine my data.