I read this :
More data and longer pre-training schedule benefit SSL in general.
So I write this:
steps = EPOCHS * (num_training_samples // BATCH_SIZE)
lr_decayed_fn = tf.keras.optimizers.schedules.PolynomialDecay(
initial_learning_rate=0.00001, decay_steps=steps
But I would like to understand the meaning of longer pre-training schedule.


what is the biggest bottleneck in maskrcnn_benchmark repo?

I am working on a repo that make use of the maskrcnn_benchmark repo. I have extensively, explored the bench-marking repo extensively for the cause of its slower performance on a cpu with respect to enter link description here.
In order to create a benchmark for the individual forward passes I have put a time counter for each part and it gives me the time required to calculate each component. I have had a tough time exactly pinpointing as to the slowest component of the entire architecture.I believe it to be BottleneckWithFixedBatchNorm class in the maskrcnn_benchmark/modeling/backbone/ file.
I will really appreciate any help in localisation of the biggest bottle neck in this architecture.
I have faced the same problem, the best possible solution for the same is to look inside the main code, go through the forward pass of each module and have a timer setup to log the time that is spent in the computations of each module. How we worked in it was to create an architecture where we create the time logger for each class, therefore every instance of the class will now be logging its time of execution, after through comparison, atleast in our case we have found that the reason for the delay was the depth of the Resnet module, (which given the computational cost of resnet is not a surprising factor at all, the only solution to the same is more palatalization so either ensure a bigger GPU for performing the task or reduce the depth of the Resnet network ).
I must inform that the maskrcnn_benchmark has been deprecated and an updated version of the same is available in the form of detectron2. Consider moving your code for significant speed improvements in the architecture.
BottleneckWithFixedBatchNorm is not the most expensive operation in the architecture and certainly not creating the bottleneck as all the operations instead of the name. The class isn't as computationally expensive and is computed in parallel even on a lower end CPU machine (at least in the inference stage).
An example of tracking better the performance of each module can be found with the code taken from the path : maskrcnn_benchmark/modeling/backbone/
class ResNet(nn.Module):
def __init__(self, cfg):
super(ResNet, self).__init__()
# If we want to use the cfg in forward(), then we should make a copy
# of it and store it for later use:
# self.cfg = cfg.clone()
# Translate string names to implementations
# Construct the stem module
self.stem = stem_module(cfg)
# Constuct the specified ResNet stages
width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
stage2_bottleneck_channels = num_groups * width_per_group
stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
self.stages = []
self.return_features = {}
for stage_spec in stage_specs:
name = "layer" + str(stage_spec.index)
stage2_relative_factor = 2 ** (stage_spec.index - 1)
bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor
out_channels = stage2_out_channels * stage2_relative_factor
stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1]
module = _make_stage(
first_stride=int(stage_spec.index > 1) + 1,
"stage_with_dcn": stage_with_dcn,
"with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
"deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
in_channels = out_channels
self.add_module(name, module)
self.return_features[name] = stage_spec.return_features
# Optionally freeze (requires_grad=False) parts of the backbone
def _freeze_backbone(self, freeze_at):
if freeze_at < 0:
for stage_index in range(freeze_at):
if stage_index == 0:
m = self.stem # stage 0 is the stem
m = getattr(self, "layer" + str(stage_index))
for p in m.parameters():
p.requires_grad = False
def forward(self, x):
outputs = []
x = self.stem(x)
for stage_name in self.stages:
x = getattr(self, stage_name)(x)
if self.return_features[stage_name]:
print("ResNet time :: ", time.time()-start_timer,file=open("timelogger.log","a"))
return outputs
Only change that has to be made is in the forward pass and all the instance created out of this class will inherit the properties and log time (choose to write the same to the file instead of a simple stdout)

Is it relevant to use both feature normalizer_fn and batch normalization?

Is it relevant to use both feature normalizer_fn and batch normalization like following ?
feature_columns_complex_standardized = [
tf.feature_column.numeric_column("my_feature", normalizer_fn=lambda x: (x - xMean) / xStd)
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
May be you get it wrong, as Normalization is one of the methods used to bring features in a dataset to the same scale, where batch normalization is used for solving the problem of internal covariate shift where each hidden unit’s input distribution changes every time there is a parameter update in the previous layer.
So you can use both at the same time.

Tensorflow Data API - prefetch

I am trying to use new features of TF, namely Data API, and I am not sure how prefetch works. In the code below
def dataset_input_fn(...)
dataset =, compression_type="ZLIB")
dataset = x:parser(...))
dataset = x,y: image_augmentation(...)
, num_parallel_calls=num_threads
dataset = dataset.shuffle(buffer_size)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
does it matter between each lines above I put dataset=dataset.prefetch(batch_size)? Or maybe it should be after every operation that would be using output_buffer_size if the dataset was coming from
In discussion on github I found a comment by mrry:
Note that in TF 1.4 there will be a Dataset.prefetch() method that
makes it easier to add prefetching at any point in the pipeline, not
just after a map(). (You can try it by downloading the current nightly
For example, Dataset.prefetch() will start a background thread to
populate a ordered buffer that acts like a tf.FIFOQueue, so that
downstream pipeline stages need not block. However, the prefetch()
implementation is much simpler, because it doesn't need to support as
many different concurrent operations as a tf.FIFOQueue.
so it means prefetch could be put by any command and it works on the previous command. So far I have noticed the biggest performance gains by putting it only at the very end.
There is one more discussion on Meaning of buffer_size in , Dataset.prefetch and Dataset.shuffle where mrry explains a bit more about the prefetch and buffer.
UPDATE 2018/10/01:
From version 1.7.0 Dataset API (in contrib) has an option to prefetch_to_device. Note that this transformation has to be the last in the pipeline and when TF 2.0 arrives contrib will be gone. To have prefetch work on multiple GPUs please use MultiDeviceIterator (example see #13610)

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as : = tf.train.Supervisor(
# Write summary_ops by hand.
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tag='training/hptuning/metric', simple_value=accuracy_value)
]), eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.

How to implement a double queue structure in tensorflow

I am using TensorFlow to implement a Neutral Network, and want to achieve such architecture: there are 2 queues, namely Q1 and Q2. Q1 is initialised with some file names, and Q2 will be filled with examples later.
Every time the session runs a step, a file name is popped from Q1, and enters a processing part. In the processing part, data is read from the file, and generated some, say 32, different examples from the data. Then, the generated 32 examples will be enqueued into Q2. If Q2 reached some limit, it dequeues a batch of examples.
In particular, I will generated nearly 1M examples every time read from a file, so such process must run in the background and avoid block the main thread, and enqueueing into Q2 must be asynchronously.
I failed to find a solution. I have tried something like the following:
import tensorflow as tf
q1 = tf.FIFOQueue(capacity=32, dtypes=tf.int32)
init_op = q1.enqueue_many(([0, 1, 2],))
q2 = tf.FIFOQueue(capacity=64, dtypes=tf.int32)
r = q1.dequeue()
# mimic generating examples from data read from the file
for i in range(10):
enq_op = q2.enqueue(r * 10 + i)
s = q2.dequeue()
sess = tf.InteractiveSession()
# don't know what to do
Could anyone help!
One problem I see is that you are confusing graph construction and execution. Your for i in range(10) loop creates a bunch of enqueue ops, it won't actually add r*10+i to your queue.
I recommend going through the queue tutorial first to understand the basic concepts -- . Also this