How to use dataset.shard in tensorflow?

How to use dataset.shard in tensorflow? - tensorflow

Recently I am looking into the dataset API in Tensorflow, and there is a method dataset.shard() which is for distributed computations.
This is what's stated in Tensorflow's documentation:
Creates a Dataset that includes only 1/num_shards of this dataset.
d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
This method is said to return a portion of the original dataset. If I have two workers, am I supposed to do:
d_0 = d.shard(FLAGS.num_workers, worker_0)
d_1 = d.shard(FLAGS.num_workers, worker_1)
......
iterator_0 = d_0.make_initializable_iterator()
iterator_1 = d_1.make_initializable_iterator()
for worker_id in workers:
with tf.device(worker_id):
if worker_id == 0:
data = iterator_0.get_next()
else:
data = iterator_1.get_next()
......
Because the documentation did not specify how to make subsequent calls, I am a bit confused here.
Thanks!

You should take a look at the tutorial on Distributed TensorFlow first to better understand how it works.
You have multiple workers, that each run the same code but with a small difference: each worker will have a different FLAGS.worker_index.
When you use tf.data.Dataset.shard, you will supply this worker index and the data will be split between workers equally.
Here is an example with 3 workers.
dataset = tf.data.Dataset.range(6)
dataset = dataset.shard(FLAGS.num_workers, FLAGS.worker_index)
iterator = dataset.make_one_shot_iterator()
res = iterator.get_next()
# Suppose you have 3 workers in total
with tf.Session() as sess:
for i in range(2):
print(sess.run(res))
We will have the output:
0, 3 on worker 0
1, 4 on worker 1
2, 5 on worker 2

Related

predicting using pre-trained model becomes slower and slower

I'm using a very naive way to make predictions based on pre-trained model in keras. But it becomes much slower later. Anyone knows why? I'm very very very new to tensorflow.
count = 0
first = True
for nm in image_names:
img = image.load_img(TEST_PATH + nm, target_size=(299, 299))
img = image.img_to_array(img)
image_batch = np.expand_dims(img, axis=0)
processed_image = inception_v3.preprocess_input(image_batch.copy())
prob = inception_model.predict(processed_image)
df1 = pd.DataFrame({'photo_id': [nm]})
df2 = pd.DataFrame(prob, columns=['feat' + str(j + 1) for j in range(prob.shape[1])])
df = pd.concat([df1, df2], axis=1)
header = first
mode = 'w' if first else 'a'
df.to_csv(outfile, index=False, header=header, mode=mode)
first = False
count += 1
if count % 100 == 0:
print('%d processed' % count)

I doubt the TF is slowing down. However there is another stack overflow question showing that to_csv slows down on append.
Performance: Python pandas DataFrame.to_csv append becomes gradually slower
If the images come batched you may also benefit from making larger batches rather than predicting one image at a time.
You can also explore tf.data for better data pipelining.

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.

Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

What does batch, repeat, and shuffle do with TensorFlow Dataset?

I'm currently learning TensorFlow but I came across a confusion in the below code snippet:
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
I know that first the dataset will hold all the data but what shuffle(),repeat(), and batch() do to the dataset?
Please help me with an example and explanation.

Update: Here is a small collaboration notebook for demonstration of this answer.
Imagine, you have a dataset: [1, 2, 3, 4, 5, 6], then:
How ds.shuffle() works
dataset.shuffle(buffer_size=3) will allocate a buffer of size 3 for picking random entries. This buffer will be connected to the source dataset.
We could image it like this:
Random buffer
|
| Source dataset where all other elements live
| |
↓ ↓
[1,2,3] <= [4,5,6]
Let's assume that entry 2 was taken from the random buffer. Free space is filled by the next element from the source buffer, that is 4:
2 <= [1,3,4] <= [5,6]
We continue reading till nothing is left:
1 <= [3,4,5] <= [6]
5 <= [3,4,6] <= []
3 <= [4,6] <= []
6 <= [4] <= []
4 <= [] <= []
How ds.repeat() works
As soon as all the entries are read from the dataset and you try to read the next element, the dataset will throw an error.
That's where ds.repeat() comes into play. It will re-initialize the dataset, making it again like this:
[1,2,3] <= [4,5,6]
What will ds.batch() produce
The ds.batch() will take the first batch_size entries and make a batch out of them. So, a batch size of 3 for our example dataset will produce two batch records:
[2,1,5]
[3,6,4]
As we have a ds.repeat() before the batch, the generation of the data will continue. But the order of the elements will be different, due to the ds.random(). What should be taken into account is that 6 will never be present in the first batch, due to the size of the random buffer.

The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.

An example that shows looping over epochs. Upon running this script notice the difference in
dataset_gen1 - shuffle operation produces more random outputs (this may be more useful while running machine learning experiments)
dataset_gen2 - lack of shuffle operation produces elements in sequence
Other additions in this script
tf.data.experimental.sample_from_datasets - used to combine two datasets. Note that the shuffle operation in this case shall create a buffer that samples equally from both datasets.
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # to avoid all those prints
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private" # to avoid large "Kernel Launch Time"
import tensorflow as tf
if len(tf.config.list_physical_devices('GPU')):
tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
class Augmentations:
def __init__(self):
pass
#tf.function
def filter_even(self, x):
if x % 2 == 0:
return False
else:
return True
class Dataset:
def __init__(self, aug, range_min=0, range_max=100):
self.range_min = range_min
self.range_max = range_max
self.aug = aug
def generator(self):
dataset = tf.data.Dataset.from_generator(self._generator
, output_types=(tf.float32), args=())
dataset = dataset.filter(self.aug.filter_even)
return dataset
def _generator(self):
for item in range(self.range_min, self.range_max):
yield(item)
# Can be used when you have multiple datasets that you wish to combine
class ZipDataset:
def __init__(self, datasets):
self.datasets = datasets
self.datasets_generators = []
def generator(self):
for dataset in self.datasets:
self.datasets_generators.append(dataset.generator())
return tf.data.experimental.sample_from_datasets(self.datasets_generators)
if __name__ == "__main__":
aug = Augmentations()
dataset1 = Dataset(aug, 0, 100)
dataset2 = Dataset(aug, 100, 200)
dataset = ZipDataset([dataset1, dataset2])
epochs = 2
shuffle_buffer = 10
batch_size = 4
prefetch_buffer = 5
dataset_gen1 = dataset.generator().shuffle(shuffle_buffer).batch(batch_size).prefetch(prefetch_buffer)
# dataset_gen2 = dataset.generator().batch(batch_size).prefetch(prefetch_buffer) # this will output odd elements in sequence
for epoch in range(epochs):
print ('\n ------------------ Epoch: {} ------------------'.format(epoch))
for X in dataset_gen1.repeat(1): # adding .repeat() in the loop allows you to easily control the end of the loop
print (X)
# Do some stuff at end of loop

Tensorflow - shuffling at "batch-level" instead of"example-level"

I have a problem that I will try to explain with an example for easier understanding.
I want to classify oranges (O) and apples (A). For technical/legacy reasons (a component in the network) each batch should have either only O or only A examples. So traditional shuffling at example-level is not possible/adequate, since I cannot afford to have a batch that includes a mixture of O and A examples. However some kind of shuffling is desirable, as it is a common practise to train deep networks.
These are the steps that I take:
I first need to convert raw data/examples into TFRecords.
I shuffle the order of the raw examples, and then I create separate TFRecords that contained either only the shuffled O examples, or only the shuffled A examples. Let's call this "example-level" shuffling. This is something that takes place offline and only once.
At this point I have "clean batches": O-baches that contain only O examples, and A-batches that contain only A examples.
I do not want to first feed the network with all the O-batches and then with all the A-batches sequentially. This would probably not help much in convergence.
Can I shuffle these batches on the "batch-level", i.e. without affecting their interior?

If you use the Dataset api it's fairly straightforward. Just zip the O and A batches, then apply a random selection function with Dataset.map():
ds0 = tf.data.Dataset.from_tensor_slices([0])
ds0 = ds0.repeat()
ds0 = ds0.batch(5)
ds1 = tf.data.Dataset.from_tensor_slices([1])
ds1 = ds1.repeat()
ds1 = ds1.batch(5)
def rand_select(ds0, ds1):
rval = tf.random_uniform([])
return tf.cond(rval<0.5, lambda: ds0, lambda: ds1)
dataset = tf.data.Dataset()
dataset = dataset.zip((ds0, ds1)).map(lambda ds0, ds1: rand_select(ds0, ds1))
iterator = dataset.make_one_shot_iterator()
ds = iterator.get_next()
with tf.Session() as sess:
for _ in range(5):
print(sess.run(ds))
> [0 0 0 0 0]
[1 1 1 1 1]
[1 1 1 1 1]
[0 0 0 0 0]
[0 0 0 0 0]

Interleaving multiple TensorFlow datasets together

The current TensorFlow dataset interleave functionality is basically a interleaved flat-map taking as input a single dataset. Given the current API, what's the best way to interleave multiple datasets together? Say they have already been constructed and I have a list of them. I want to produce elements from them alternatively and I want to support lists with more than 2 datasets (i.e., stacked zips and interleaves would be pretty ugly).
Thanks! :)
#mrry might be able to help.

EDIT 2: See tf.contrib.data.choose_from_datasets. It performs deterministic dataset interleaving.
EDIT: See tf.contrib.data.sample_from_datasets. Even though it performs random sampling I guess it can be useful.
Even though this is not "clean", it is the only workaround I came up with.
datasets = [tf.data.Dataset...]
def concat_datasets(datasets):
ds0 = tf.data.Dataset.from_tensors(datasets[0])
for ds1 in datasets[1:]:
ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
return ds0
ds = tf.data.Dataset.zip(tuple(datasets)).flat_map(
lambda *args: concat_datasets(args)
)

Expanding user2781994 answer (with edits), here is how I implemented it:
import tensorflow as tf
ds11 = tf.data.Dataset.from_tensor_slices([1,2,3])
ds12 = tf.data.Dataset.from_tensor_slices([4,5,6])
ds13 = tf.data.Dataset.from_tensor_slices([7,8,9])
all_choices_ds = [ds11, ds12, ds13]
choice_dataset = tf.data.Dataset.range(len(all_choices_ds)).repeat()
ds14 = tf.contrib.data.choose_from_datasets(all_choices_ds, choice_dataset)
# alternatively:
# ds14 = tf.contrib.data.sample_from_datasets(all_choices_ds)
iterator = ds14.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
while True:
try:
value=sess.run(next_element)
except tf.errors.OutOfRangeError:
break
print(value)
The output is:
1
4
7
2
5
8
3
6
9

In Tensorflow 2.0
tot_imm_dataset1 = 105
tot_imm_dataset2 = 55
e = tf.data.Dataset.from_tensor_slices(tf.cast([1,0,1],tf.int64)).repeat(int(tot_imm_dataset1/2))
f=tf.data.Dataset.range(1).repeat(int(tot_imm_dataset2-tot_imm_dataset1/2))
choice=e.concatenate(f)
datasets=[dataset2,dataset1]
dataset_rgb_compl__con_patch= tf.data.experimental.choose_from_datasets(datasets, choice)
That works for me

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use dataset.shard in tensorflow? - tensorflow

Related

predicting using pre-trained model becomes slower and slower

Look up BernoulliNB Probability in Dataframe

What does batch, repeat, and shuffle do with TensorFlow Dataset?

Tensorflow - shuffling at "batch-level" instead of"example-level"

Interleaving multiple TensorFlow datasets together

Categories

Resources