What does batch, repeat, and shuffle do with TensorFlow Dataset? - tensorflow

I'm currently learning TensorFlow but I came across a confusion in the below code snippet:
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
I know that first the dataset will hold all the data but what shuffle(),repeat(), and batch() do to the dataset?
Please help me with an example and explanation.

Update: Here is a small collaboration notebook for demonstration of this answer.
Imagine, you have a dataset: [1, 2, 3, 4, 5, 6], then:
How ds.shuffle() works
dataset.shuffle(buffer_size=3) will allocate a buffer of size 3 for picking random entries. This buffer will be connected to the source dataset.
We could image it like this:
Random buffer
| Source dataset where all other elements live
| |
↓ ↓
[1,2,3] <= [4,5,6]
Let's assume that entry 2 was taken from the random buffer. Free space is filled by the next element from the source buffer, that is 4:
2 <= [1,3,4] <= [5,6]
We continue reading till nothing is left:
1 <= [3,4,5] <= [6]
5 <= [3,4,6] <= []
3 <= [4,6] <= []
6 <= [4] <= []
4 <= [] <= []
How ds.repeat() works
As soon as all the entries are read from the dataset and you try to read the next element, the dataset will throw an error.
That's where ds.repeat() comes into play. It will re-initialize the dataset, making it again like this:
[1,2,3] <= [4,5,6]
What will ds.batch() produce
The ds.batch() will take the first batch_size entries and make a batch out of them. So, a batch size of 3 for our example dataset will produce two batch records:
As we have a ds.repeat() before the batch, the generation of the data will continue. But the order of the elements will be different, due to the ds.random(). What should be taken into account is that 6 will never be present in the first batch, due to the size of the random buffer.

The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.

An example that shows looping over epochs. Upon running this script notice the difference in
dataset_gen1 - shuffle operation produces more random outputs (this may be more useful while running machine learning experiments)
dataset_gen2 - lack of shuffle operation produces elements in sequence
Other additions in this script
tf.data.experimental.sample_from_datasets - used to combine two datasets. Note that the shuffle operation in this case shall create a buffer that samples equally from both datasets.
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # to avoid all those prints
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" # to avoid large "Kernel Launch Time"
import tensorflow as tf
if len(tf.config.list_physical_devices('GPU')):
tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
class Augmentations:
def __init__(self):
def filter_even(self, x):
if x % 2 == 0:
return False
return True
class Dataset:
def __init__(self, aug, range_min=0, range_max=100):
self.range_min = range_min
self.range_max = range_max
self.aug = aug
def generator(self):
dataset = tf.data.Dataset.from_generator(self._generator
, output_types=(tf.float32), args=())
dataset = dataset.filter(self.aug.filter_even)
return dataset
def _generator(self):
for item in range(self.range_min, self.range_max):
# Can be used when you have multiple datasets that you wish to combine
class ZipDataset:
def __init__(self, datasets):
self.datasets = datasets
self.datasets_generators = []
def generator(self):
for dataset in self.datasets:
return tf.data.experimental.sample_from_datasets(self.datasets_generators)
if __name__ == "__main__":
aug = Augmentations()
dataset1 = Dataset(aug, 0, 100)
dataset2 = Dataset(aug, 100, 200)
dataset = ZipDataset([dataset1, dataset2])
epochs = 2
shuffle_buffer = 10
batch_size = 4
prefetch_buffer = 5
dataset_gen1 = dataset.generator().shuffle(shuffle_buffer).batch(batch_size).prefetch(prefetch_buffer)
# dataset_gen2 = dataset.generator().batch(batch_size).prefetch(prefetch_buffer) # this will output odd elements in sequence
for epoch in range(epochs):
print ('\n ------------------ Epoch: {} ------------------'.format(epoch))
for X in dataset_gen1.repeat(1): # adding .repeat() in the loop allows you to easily control the end of the loop
print (X)
# Do some stuff at end of loop


Keras data generator predict same number of values

I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.
To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.
Do you have any other solution, or what is the problem in my approach?
My data generator code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
dim2 = 80, n_channels=1, shuffle=True):
self.dim1 = dim1
self.dim2 = dim2
self.batch_size = batch_size
self.data_X = data_X
self.data_Z = data_Z
self.target_y = target_y
self.list_IDs = list_IDs
self.n_channels = n_channels
self.shuffle = shuffle
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim1, self.n_channels))
Z = np.empty((self.batch_size, self.dim2))
y = np.empty((self.batch_size))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('data/' + data_X + ID + '.npy')
Z[i,] = np.load('data/' + data_Z + ID + '.npy')
# Store target
y[i] = np.load('data/' + target_y + ID + '.npy')
How I call model.predict()
predict_params = {'list_IDs': 'indexes',
'data_X': 'images',
'data_Z': 'factors',
'target_y': 'True_values'
'batch_size': 5,
'dim1': (120,120),
'dim2': 80,
'n_channels': 1,
# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)
predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
If we look at your __getitem__ function, we can see this code:
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
This code will always return the same numbers IDs, because the length len of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.
You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.
When you use a generator you specify a batch size. model.predict will produce batch size number of output predictions. If you set steps=1 that is all the predictions you will get. To set the steps you should take the number of samples you have and divide it by the batch size. For example if you have 50 images with a batch size of 5 then you should set steps equal to 10. Ideally you want to go through your test set exactly once. The code below will determine the batch size and steps to do that. In the code b_max is a value you select that limits the maximum batch size . You should set this based on your memory size to avoid a OOM (out of memory) error. In the code below parameter length is equal to the number of test samples you have.
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]
The result will be batch_size= 50 steps=10. Note if length is a prime number the result will be batch_size=1 and steps=length
According to this solution, you need to change your steps to the total number of images you want to test on. Try:
# Assuming test_index is a list
predition_results = model.predict(prediction_generator, steps = len(test_index), verbose=1)

How to update data within a tensor flow dataset tf.data.Dataset?

I have some data (x,y,m) where x and y are tensors with respective dimensions (n x m) and (n x 1) (x is the data and y is the label). The data x comes in two types and the binary tensor m (n x 1) specifies which type each data point is from.
When training my model, I wish to randomly alternate between batches of type 0 or type 1 data. To do this I split the dataset into two:
#create initial dataset
init_data = tf.data.Dataset.from_tensor_slices((x,y,m))
#split dataset into two (based on m)
m0 = init_data.filter(lambda x,y,m : tf.math.equal(m,0) )
m1 = init_data.filter(lambda x,y,m : tf.math.equal(m,1) )
Sometimes the number of points of type 1 or 0 is not very numerous. For my training it does not matter if a batch contains the same datapoint multiple times (only that the batches are chosen randomly at each training epoch). To address this I run:
#batch size for training epochs
batch_size = 100
#large buffer size for shuffling
n = 1000
#shuffle and batch the dataset (allow repeats of datapoints)
m0 = m0.repeat().shuffle(n).batch(batch_size)
m1 = m1.repeat().shuffle(n).batch(batch_size)
I can now randomly choose between which of the two datasets I get my batch from at each training epoch using the following:
#dataset to sample from at each training iteration
traindat = tf.data.experimental.sample_from_datasets([m0,m1], [0.5, 0.5])
In psuedo-code my training loop looks as follows:
#create an iterator over the dataset
it = iter(traindat)
#train for t iterations
for t in range(T):
#sample batch
mysample,labels = next(it)
#forward pass
pred = my_NN(my_sample)
loss = my_loss(pred,labels)
#gradient step
My problem is as follows:
Suppose that each time I complete 10% of my training loop I want to replace my data (x,y,m) with the neural networks prediction at that iteration i.e. (pred,y,m)?
To do this I currently just overwrite the init_data dataset by adding the line:
if t%(t//10) == 0:
init_data =tf.data.Dataset.from_tensor_slices((pred,y,m))
and then run all the previous lines of code again to regenerate traindat and it. However, this seems grossly inefficient. Is there a better way to do this? Many thanks for any help.

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

How do I split Tensorflow datasets?

I have a tensorflow dataset based on one .tfrecord file. How do I split the dataset into test and train datasets? E.g. 70% Train and 30% test?
My Tensorflow Version: 1.8
I've checked, there is no "split_v" function as mentioned in the possible duplicate. Also I am working with a tfrecord file.
You may use Dataset.take() and Dataset.skip():
train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
full_dataset = tf.data.TFRecordDataset(FLAGS.input_file)
full_dataset = full_dataset.shuffle()
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(test_size)
test_dataset = test_dataset.take(test_size)
For more generality, I gave an example using a 70/15/15 train/val/test split but if you don't need a test or a val set, just ignore the last 2 lines.
Creates a Dataset with at most count elements from this dataset.
Creates a Dataset that skips count elements from this dataset.
You may also want to look into Dataset.shard():
Creates a Dataset that includes only 1/num_shards of this dataset.
This question is similar to this one and this one, and I am afraid we have not had a satisfactory answer yet.
Using take() and skip() requires knowing the dataset size. What if I don't know that, or don't want to find out?
Using shard() only gives 1 / num_shards of dataset. What if I want the rest?
I try to present a better solution below, tested on TensorFlow 2 only. Assuming you already have a shuffled dataset, you can then use filter() to split it into two:
import tensorflow as tf
all = tf.data.Dataset.from_tensor_slices(list(range(1, 21))) \
.shuffle(10, reshuffle_each_iteration=False)
test_dataset = all.enumerate() \
.filter(lambda x,y: x % 4 == 0) \
.map(lambda x,y: y)
train_dataset = all.enumerate() \
.filter(lambda x,y: x % 4 != 0) \
.map(lambda x,y: y)
for i in test_dataset:
for i in train_dataset:
The parameter reshuffle_each_iteration=False is important. It makes sure the original dataset is shuffled once and no more. Otherwise, the two resulting sets may have some overlaps.
Use enumerate() to add an index.
Use filter(lambda x,y: x % 4 == 0) to take 1 sample out of 4. Likewise, x % 4 != 0 takes 3 out of 4.
Use map(lambda x,y: y) to strip the index and recover the original sample.
This example achieves a 75/25 split.
x % 5 == 0 and x % 5 != 0 gives a 80/20 split.
If you really want a 70/30 split, x % 10 < 3 and x % 10 >= 3 should do.
As of TensorFlow 2.0.0, above code may result in some warnings due to AutoGraph's limitations. To eliminate those warnings, declare all lambda functions separately:
def is_test(x, y):
return x % 4 == 0
def is_train(x, y):
return not is_test(x, y)
recover = lambda x,y: y
test_dataset = all.enumerate() \
.filter(is_test) \
train_dataset = all.enumerate() \
.filter(is_train) \
This gives no warning on my machine. And making is_train() to be not is_test() is definitely a good practice.

how to get shuffled batch from tfrecords with limited memory but large data set?

Using the tensorflow function tf.train.shuffle_batch we get shuffled batch by reading tfrecord into memory as a queue and shuffling within the queue (Umm, if i get the right understanding). Now I have a highly ordered tfrecords (pics of the same label are written together) and a really large dataset (around 2,550,000 pics). I want to feed my Vgg-net with batch of random labels, but its impossible and ugly to read all pictures into memory and get shuffled. Is there any solution to this?
I thought about maybe first doing shuffling then writing them into TFrecord, but I can't figure out an effective way doing this...
my data are saved in this way:
enter image description here
Here is my code getting TFRecords:
dst = "/Users/cory/Desktop/3_key_frame"
for myclass in os.listdir(dst):
if myclass.find('.DS_Store')==-1:
writer = tf.python_io.TFRecordWriter("train.tfrecords")
for index, name in enumerate(classes):
class_path = dst +'/' + name
for img_seq in os.listdir(class_path):
if img_seq.find('DS_Store')==-1:
seq_pos = class_path +'/' + img_seq
if os.path.isdir(seq_pos):
for img_name in os.listdir(seq_pos):
img_path = seq_pos +'/' + img_name
img = Image.open(img_path)
img = img.resize((64,64))
img_raw = img.tobytes()
#print (img,index)
example = tf.train.Example(features=tf.train.Features(feature={
I am presuming you have the known list of filenames and/or structure of your labelled dataset.
It may be worthwhile iterating through them on a class-by-class basis taking N amount each time. In essence interleaving the datasets so that you don't have sequential issues.
If I am understanding this correctly, your primary concern is when sampling your dataset from the TFRecord that a sub-set of your data may contain entirely 1 class, rather than a good representation?
If you structure it as:
0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 ... etc
this may make the shuffle_batch more likely to create a nicer sample for training.
This is the solution I am following, as there appears to be no additional params for shuffling where you can specify to keep a uniform distribution of class labels amongst the set.
Supposing that your data is stored like this:
Get all the filenames in a flat list and shuffle them:
import glob
import random
filenames = glob.glob('/path/to/images/**/*.jpg)
Create a dictionary to go from label name to numerical label:
class_to_index = {'LABEL_1':0, 'LABEL_2': 1} # more classes I assume...
Now you can loop over all images and retrieve the label
writer = tf.python_io.TFRecordWriter("train.tfrecords")
for f in filenames:
img = Image.open(f)
img = img.resize((64,64))
img_raw = img.tobytes()
label = f.split('/')[-2]
example = tf.train.Example(features=tf.train.Features(feature={
"label":tf.train.Feature(int64_list=tf.train.Int64List(value= class_to_index[label])),
Hope this helps :)