what is the biggest bottleneck in maskrcnn_benchmark repo? - tensorflow

I am working on a repo that make use of the maskrcnn_benchmark repo. I have extensively, explored the bench-marking repo extensively for the cause of its slower performance on a cpu with respect to enter link description here.
In order to create a benchmark for the individual forward passes I have put a time counter for each part and it gives me the time required to calculate each component. I have had a tough time exactly pinpointing as to the slowest component of the entire architecture.I believe it to be BottleneckWithFixedBatchNorm class in the maskrcnn_benchmark/modeling/backbone/resnet.py file.
I will really appreciate any help in localisation of the biggest bottle neck in this architecture.

I have faced the same problem, the best possible solution for the same is to look inside the main code, go through the forward pass of each module and have a timer setup to log the time that is spent in the computations of each module. How we worked in it was to create an architecture where we create the time logger for each class, therefore every instance of the class will now be logging its time of execution, after through comparison, atleast in our case we have found that the reason for the delay was the depth of the Resnet module, (which given the computational cost of resnet is not a surprising factor at all, the only solution to the same is more palatalization so either ensure a bigger GPU for performing the task or reduce the depth of the Resnet network ).
I must inform that the maskrcnn_benchmark has been deprecated and an updated version of the same is available in the form of detectron2. Consider moving your code for significant speed improvements in the architecture.
BottleneckWithFixedBatchNorm is not the most expensive operation in the architecture and certainly not creating the bottleneck as all the operations instead of the name. The class isn't as computationally expensive and is computed in parallel even on a lower end CPU machine (at least in the inference stage).
An example of tracking better the performance of each module can be found with the code taken from the path : maskrcnn_benchmark/modeling/backbone/resnet.py
class ResNet(nn.Module):
def __init__(self, cfg):
super(ResNet, self).__init__()
# If we want to use the cfg in forward(), then we should make a copy
# of it and store it for later use:
# self.cfg = cfg.clone()
# Translate string names to implementations
stem_module = _STEM_MODULES[cfg.MODEL.RESNETS.STEM_FUNC]
stage_specs = _STAGE_SPECS[cfg.MODEL.BACKBONE.CONV_BODY]
transformation_module = _TRANSFORMATION_MODULES[cfg.MODEL.RESNETS.TRANS_FUNC]
# Construct the stem module
self.stem = stem_module(cfg)
# Constuct the specified ResNet stages
num_groups = cfg.MODEL.RESNETS.NUM_GROUPS
width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
in_channels = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS
stage2_bottleneck_channels = num_groups * width_per_group
stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
self.stages = []
self.return_features = {}
for stage_spec in stage_specs:
name = "layer" + str(stage_spec.index)
stage2_relative_factor = 2 ** (stage_spec.index - 1)
bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor
out_channels = stage2_out_channels * stage2_relative_factor
stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1]
module = _make_stage(
transformation_module,
in_channels,
bottleneck_channels,
out_channels,
stage_spec.block_count,
num_groups,
cfg.MODEL.RESNETS.STRIDE_IN_1X1,
first_stride=int(stage_spec.index > 1) + 1,
dcn_config={
"stage_with_dcn": stage_with_dcn,
"with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
"deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
}
)
in_channels = out_channels
self.add_module(name, module)
self.stages.append(name)
self.return_features[name] = stage_spec.return_features
# Optionally freeze (requires_grad=False) parts of the backbone
self._freeze_backbone(cfg.MODEL.BACKBONE.FREEZE_CONV_BODY_AT)
def _freeze_backbone(self, freeze_at):
if freeze_at < 0:
return
for stage_index in range(freeze_at):
if stage_index == 0:
m = self.stem # stage 0 is the stem
else:
m = getattr(self, "layer" + str(stage_index))
for p in m.parameters():
p.requires_grad = False
def forward(self, x):
start_timer=time.time()
outputs = []
x = self.stem(x)
for stage_name in self.stages:
x = getattr(self, stage_name)(x)
if self.return_features[stage_name]:
outputs.append(x)
print("ResNet time :: ", time.time()-start_timer,file=open("timelogger.log","a"))
return outputs
Only change that has to be made is in the forward pass and all the instance created out of this class will inherit the properties and log time (choose to write the same to the file instead of a simple stdout)

Related

Treatment of constraints in SLSQP optimization with openMDAO

With openMDAO, I am using FD derivatives and trying to solve a non-linearly constrained optimization problem with the SLSQP method. Any time the optimizer arrives at a point that violates one of the constraints, it just crashes with the message:
Optimization FAILED. Positive directional derivative for linesearch
For instance, if I intentionally set the initial point to an unfeasible design point, the optimizer performs 1 iteration and exits with the above error (the same happens when I start from a feasible point, but then optimizer arrives at an unfeasible point after a few iterations).
Based on the answer to the question in In OpenMDAO, is there a way to ensure that the constraints are respected before proceeding with a computation?, I'm assuming that raising the AnalysisError exception will not work in my case, is that correct? Is there any other way to prevent the optimizer from going into unfeasible regions or at least backtrack on the linesearch and try a different direction/distance? Or should the SLSQP method be only used when analytic derivatives are available?
Reproducible test case:
import numpy as np
import openmdao.api as om
class d1(om.ExplicitComponent):
def setup(self):
# Global design variables
self.add_input('r', val= [3,3,3])
self.add_input('T', val= 20)
# Coupling output
self.add_output('M', val=0)
self.add_output('cost', val=0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
# define inputs
r = inputs['r']
T = inputs['T'][0]
cost = 174.42 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2])
M = 456.19 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2]) - 599718
outputs['M'] = M
outputs['cost'] = cost
class MDA(om.Group):
class ObjCmp(om.ExplicitComponent):
def setup(self):
# Global Design Variable
self.add_input('cost', val=0)
# Output
self.add_output('obj', val=0.0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
outputs['obj'] = inputs['cost']
class ConCmp(om.ExplicitComponent):
def setup(self):
# Global Design Variable
self.add_input('M', val=0)
# Output
self.add_output('con', val=0.0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='fd')
def compute(self, inputs, outputs):
# assemble outputs
outputs['con'] = inputs['M']
def setup(self):
self.add_subsystem('d1', d1(), promotes_inputs=['r','T'],
promotes_outputs=['M','cost'])
self.add_subsystem('con_cmp', self.ConCmp(), promotes_inputs=['M'],
promotes_outputs=['con'])
self.add_subsystem('obj_cmp', self.ObjCmp(), promotes_inputs=['cost'],
promotes_outputs=['obj'])
# Build the model
prob = om.Problem(model=MDA())
model = prob.model
model.add_design_var('r', lower= [3,3,3], upper= [10,10,10])
model.add_design_var('T', lower= 20, upper= 220)
model.add_objective('obj', scaler=1)
model.add_constraint('con', lower=0)
# Setup the optimization
prob.driver = om.ScipyOptimizeDriver(optimizer='SLSQP', tol=1e-3, disp=True)
prob.setup()
prob.set_solver_print(level=0)
prob.run_driver()
# Printout
print('minimum found at')
print(prob.get_val('T')[0])
print(prob.get_val('r'))
print('constraint')
print(prob.get_val('con')[0])
print('minimum objective')
print(prob.get_val('obj')[0])
Based on your provided test case, the problem here is that your have a really poorly scaled objective and constraint (you also have some very strange coding choices ... which I modified).
Running the OpenMDAO scaling report shows that your objective and constraint values are both around 1e6 in magnitude:
This is quite large, and is the source of your problems. A (very rough) rule of thumb is that your objectives and constraints should be around order 1. Thats not hard and fast rule, but is generally a good starting point. Sometimes other scaling will work better, if you have very very larger or small derivatives ... but there are parts of SQP methods that are sensitive to the scaling of objective and constraint values directly. So trying to keep them roughly in the range of 1 is a good idea.
Adding ref=1e6 to both objective and constraints gave enough resolution for the numerical methods to converge the problem:
Current function value: [0.229372]
Iterations: 8
Function evaluations: 8
Gradient evaluations: 8
Optimization Complete
-----------------------------------
minimum found at
20.00006826587515
[3.61138704 3. 3.61138704]
constraint
197.20821903413162
minimum objective
229371.99547899762
Here is the code I modified (including removing the extra class definitions inside your group that didn't seem to be doing anything):
import numpy as np
import openmdao.api as om
class d1(om.ExplicitComponent):
def setup(self):
# Global design variables
self.add_input('r', val= [3,3,3])
self.add_input('T', val= 20)
# Coupling output
self.add_output('M', val=0)
self.add_output('cost', val=0)
def setup_partials(self):
# Finite difference all partials.
self.declare_partials('*', '*', method='cs')
def compute(self, inputs, outputs):
# define inputs
r = inputs['r']
T = inputs['T'][0]
cost = 174.42 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2])
M = 456.19 * T * (r[0]**2 + 2*r[1]**2 + r[2]**2 + r[0]*r[1] + r[1]*r[2]) - 599718
outputs['M'] = M
outputs['cost'] = cost
class MDA(om.Group):
def setup(self):
self.add_subsystem('d1', d1(), promotes_inputs=['r','T'],
promotes_outputs=['M','cost'])
# self.add_subsystem('con_cmp', self.ConCmp(), promotes_inputs=['M'],
# promotes_outputs=['con'])
# self.add_subsystem('obj_cmp', self.ObjCmp(), promotes_inputs=['cost'],
# promotes_outputs=['obj'])
# Build the model
prob = om.Problem(model=MDA())
model = prob.model
model.add_design_var('r', lower= [3,3,3], upper= [10,10,10])
model.add_design_var('T', lower= 20, upper= 220)
model.add_objective('cost', ref=1e6)
model.add_constraint('M', lower=0, ref=1e6)
# Setup the optimization
prob.driver = om.ScipyOptimizeDriver(optimizer='SLSQP', tol=1e-3, disp=True)
prob.setup()
prob.set_solver_print(level=0)
prob.set_val('r', 7.65)
prob.run_driver()
# Printout
print('minimum found at')
print(prob.get_val('T')[0])
print(prob.get_val('r'))
print('constraint')
print(prob.get_val('M')[0])
print('minimum objective')
print(prob.get_val('cost')[0])
Which SLSQP method are you using? There is one implementation in pyOptSparse and one in ScipyOptimizer. The one in pyoptsparse is older and doesn't respect bounds constraints. The one in Scipy is newer and does. (Yes, its very confusing that they have the same name and share some lineage... but are not the same optimizer any more)
You shouldn't raise an analysis error when you go outside the bounds. If you need strict bounds respecting, I suggest using IPopt from within pyoptsparse (if you can get it to compile) or switching to ScipyOptimizerDriver and its SLSQP implementation.
Based on your question, its not totally clear to me if you're talking about bounds constraints or inequality/equality constraints. If its the latter, then then there isn't any optimizer that would guarantee you remain in a feasible region all the time. Interior point methods like IPopt will stay inside the region much better, but not 100% of the time.
In general, with gradient based optimization its pretty critical that you make your problem smooth and continuous even when its outside the constraint areas. If there are parts of the space that you absolutely can not go into, then you need to make those variables into design variables and use bound constraints. This sometimes requires reformulating your problem formulation a little bit, possibly by adding a kind of compatibility constraint that says "design variable = computed_value". Then you can make sure that the design variable is passed into anything that requires the value to be strictly within a bound, and (hopefully) a converged answer will also satisfy your compatibility constraint.
If you provide some kind of a test case or example, I can amend my answer with a more specific suggestion.

Speeding up Inference time on GPT2 - optimizing tf.sess.run()

I am trying to optimize the inference time on GPT2. The current time to generate a sample after calling the script is 55 secs on Google Colab. I put in timestamps to try to isolate where the bottleneck is.
This is the code:
for _ in range(nsamples // batch_size):
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
The line
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
is where the complexity lies. Does anyone have any way I can improve this piece of code ? Thank you so much!
batch_size is set to 1 in GPT2 and there is no way to change that without crashing the process. So "[context_tokens for _ in range(batch_size)]" means "[context_tokens for _ in range(1)]" means "[context_tokens]" which will not improve speed by much but is safe to implement and makes looking at the code a bit more sensible. The real complexty is you have a 6 gigabyte bohemoth in your ram that you are accessing in that session.
As a practical matter, the less tokens you send over and the less processing those tokens take the faster this part will execute. As each token needs to be sent through the GPT2 AI. But consequently the less 'intelligent' the response will be.
By the way // is an integer division operation, so nsamples // batch_size = nsamples/1 = nsamples size. And from what I have seen the nsamples was 1 when I printed its value in print(nsamples). So that for loop is another loop of one item, which means the loop can be removed.
GPT2 is just a implementation of tensorflow. Look up: how to make a graph in tensorflow; how to call a session for that graph; how to make a saver save the variables in that session and how to use the saver to restore the session. You will learn about checkpoints, meta files and other implementation that will make your files make more sense.
The tensorflow module is found in Lib, site-packages, tensorflow_core (at least in the AI Dungeon 2 Henk717 fork). Most of the processing is happening in sub directories python/ops and framework. You will see these pop up if your coding breaks the hooks tf was expecting.
If this question regards the implementation in AI Dungeon the best I have been able to implement is a recursive call to generator.generate that is exited by a try except KeyboardInterrupt: with a print(token, end = '', flush = True) for each token as they are generated. This way you are able to view each token as the AI generates it, rather that waiting for 55 sec for a ping sound.
Also, the Cuda warnings need a single quote, not double so,
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
not "3"
That will take off the cuda warnings when tensorflow is imported.
Next there are depreciations that popup from the implementation of GPT2 in tensorflow versions above 1.5.
To shut those off
tfv = tf.compat.v1
tfv.set_verbosity(tfv.logging.Error)
Is all you need. You don't need to import warnings.
Even so it is a long load time between the tf initialization, the sample initial generation and the loading of the module into ram. I added in model.shape_list(x):
the followin line
print("_",end ='',flush=True)
And at least for the module being built to localize it to the machine you can view a "progress bar" of sorts.

Restored model in tensorflow gives different results for relu operation

The weights retrieved from restored model doesn't change and the input is also constant
But the output of 'Relu:0' operation is giving different results each time.
Below is my code:
sess=tf.Session()
saver = tf.train.import_meta_graph('checkpoints/checkpoints_otherapproach_1/cameranetwork_RAID_CNN-3100.meta')
saver.restore(sess,tf.train.latest_checkpoint(checkpoint_dir='checkpoints/checkpoints_otherapproach_1/'))
images = tf.get_default_graph().get_tensor_by_name('images:0')
phase = tf.get_default_graph().get_tensor_by_name('phase:0')
Activ = tf.get_default_graph().get_tensor_by_name('network/siamese_model/convolution_1/conv_1/Relu:0')
image_array = np.zeros(shape = [1,3,128,64,3]) #*******
imagepath = 'RAiD_Dataset' + '/images_afterremoving_persons_notinallcameras/'+'test'+'/camera_'+str(1)
fullfile_name = imagepath+"/"+ 'camera_1_person_23_index_1.jpg'
image_array[0][0] = cv2.imread(fullfile_name)
image_array[0][1] = image_array[0][0]
image_array[0][2] = image_array[0][0]
image_array = image_array.astype(np.float32)
feed_dict_values ={images: image_array, phase:False}
temp2 = sess.run(Activ, feed_dict =feed_dict_values)
temp1 = sess.run(Activ, feed_dict =feed_dict_values)
print (temp1==temp2).all() #output is false
There are two possible reasons for this:
Some of the tensorflow ops inherit non-deterministic behavior from CUDA. This results in small numerical errors (which might be amplified by non-linearities). See this answer on how to try running your model on a single CPU thread. If the two arrays will turn out to be identical in this condition, then this is the case.
I'm assuming that you know the graph you are loading, but the graph itself might produce inconsistent results 'by design' due to operations deliberately introducing either randomness or inconstant data. For example, consider operations that use the random number generator or operations that update variables (e.g., tf.assign) each time Activ is evaluated.

In dynamic_rnn() is it valid to include variables in the state?

I'm implementing a RNN cell around BasicLSTMCell where I want to be able to look back on past hidden states (across batch boundaries). I'm using dynamic_rnn() and the basic pattern I use is:
def __call__(self, inputs, old_state, scope=None):
mem = old_state[2]
# [do something with with mem]
cell_out, new_state = self.cell(inputs,
(old_state[0],
old_state[1]))
h_state = new_state.h
c_state = new_state.c
# control dependency required because of self.buf_index
with tf.get_default_graph().control_dependencies([cell_out]):
new_mem = write_to_buf(self.out_buf,
cell_out,
self.buf_index)
# update the buffer index
with tf.get_default_graph().control_dependencies(new_mem):
inc_step = tf.assign(self.buf_index, (self.buf_index + 1) %
self.buf_size)
with tf.get_default_graph().control_dependencies([inc_step]):
h_state = tf.identity(h_state)
t = [c_state, h_state, new_mem]
return cell_out, tuple(t)
self.buf and self.buf_index are variables. write_to_buf() is a function that uses scatter_update() to write the new hidden states to the buffer and returns the result.
I rely on the assumption that accesses to scatter updates return value guarantee that the new variable value is used (similar to this) so that caching of variables does not mess things up.
From debug prints it seems to work but it would be nice to get some confirmation or suggestions on alternatives.

Can cond support TF ops with side effects?

The (source code) documentation for tf.cond is unclear on whether the functions to be performed when the predicate is evaluated can have side effects or not. I've done some tests but I'm getting conflicting results. For example the code below does not work:
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
adder = count.assign_add(1)
subtractor = count.assign_sub(2)
my_op = control_flow_ops.cond(pred, lambda: adder, lambda: subtractor)
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
I.e. no matter what value the predicate evaluates to, both functions are getting run, and so the net result is a subtraction of 1. On the other hand, this code snippet does work, where the only difference is that I add new ops to the graph every time my_op is called:
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
my_op = control_flow_ops.cond(pred, lambda:count.assign_add(1), lambda:count.assign_sub(2))
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
Not sure why creating new ops every time works while the other case doesn't, but I'd obviously rather not be adding nodes as the graph will eventually become too big.
Your second version—where the assign_add() and assign_sub() ops are creating inside the lambdas passed to cond()—is the correct way to do this. Fortunately, each of the two lambdas is only evaluated once, during the call to cond(), so your graph will not grow without bound.
Essentially what cond() does is the following:
Create a Switch node, which forwards its input to only one of two outputs, depending on the value of pred. Let's call the outputs pred_true and pred_false. (They have the same value as pred but that's unimportant since this is never directly evaluated.)
Build the subgraph corresponding to the if_true lambda, where all of the nodes have a control dependency on pred_true.
Build the subgraph corresponding to the if_false lambda, where all of the nodes have a control dependency on pred_false.
Zip together the lists of return values from the two lambdas, and create a Merge node for each of these. A Merge node takes two inputs, of which only one is expected to be produced, and forwards it to its output.
Return the tensors that are the outputs of the Merge nodes.
This means you can run your second version, and be content that the graph remains a fixed size, regardless of how many steps you run.
The reason your first version doesn't work is that, when a Tensor is captured (like adder or subtractor in your example), an additional Switch node is added to enforce the logic that the value of the tensor is only forwarded to the branch that actually executes. This is an artifact of how TensorFlow combines feed-forward dataflow and control flow in its execution model. The result is that the captured tensors (in this case the results of the assign_add and assign_sub) will always be evaluated, even if they aren't used, and you'll see their side effects. This is something we need to document better, and as Michael says, we're going to make this more usable in future.
The second case works because you have added the ops within the cond: this causes them to conditionally execute.
The first case it is analogous to saying:
adder = (count += 1)
subtractor = (count -= 2)
if (cond) { adder } else { subtractor }
Since adder and subtractor are outside the conditional, they are always executed.
The second case is more like saying
if (cond) { adder = (count += 1) } else { subtractor = (count -= 2) }
which in this case does what you expected.
We realize that the interaction between side effects and (somewhat) lazy evaluation is confusing, and we have a medium-term goal to make things more uniform. But the important thing to understand for now is that we do not do true lazy evaluation: the conditional acquires a dependency on every quantity defined outside the conditional that is used within either branch.