Speeding up Inference time on GPT2 - optimizing tf.sess.run() - tensorflow

I am trying to optimize the inference time on GPT2. The current time to generate a sample after calling the script is 55 secs on Google Colab. I put in timestamps to try to isolate where the bottleneck is.
This is the code:
for _ in range(nsamples // batch_size):
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
The line
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
is where the complexity lies. Does anyone have any way I can improve this piece of code ? Thank you so much!

batch_size is set to 1 in GPT2 and there is no way to change that without crashing the process. So "[context_tokens for _ in range(batch_size)]" means "[context_tokens for _ in range(1)]" means "[context_tokens]" which will not improve speed by much but is safe to implement and makes looking at the code a bit more sensible. The real complexty is you have a 6 gigabyte bohemoth in your ram that you are accessing in that session.
As a practical matter, the less tokens you send over and the less processing those tokens take the faster this part will execute. As each token needs to be sent through the GPT2 AI. But consequently the less 'intelligent' the response will be.
By the way // is an integer division operation, so nsamples // batch_size = nsamples/1 = nsamples size. And from what I have seen the nsamples was 1 when I printed its value in print(nsamples). So that for loop is another loop of one item, which means the loop can be removed.
GPT2 is just a implementation of tensorflow. Look up: how to make a graph in tensorflow; how to call a session for that graph; how to make a saver save the variables in that session and how to use the saver to restore the session. You will learn about checkpoints, meta files and other implementation that will make your files make more sense.
The tensorflow module is found in Lib, site-packages, tensorflow_core (at least in the AI Dungeon 2 Henk717 fork). Most of the processing is happening in sub directories python/ops and framework. You will see these pop up if your coding breaks the hooks tf was expecting.
If this question regards the implementation in AI Dungeon the best I have been able to implement is a recursive call to generator.generate that is exited by a try except KeyboardInterrupt: with a print(token, end = '', flush = True) for each token as they are generated. This way you are able to view each token as the AI generates it, rather that waiting for 55 sec for a ping sound.
Also, the Cuda warnings need a single quote, not double so,
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
not "3"
That will take off the cuda warnings when tensorflow is imported.
Next there are depreciations that popup from the implementation of GPT2 in tensorflow versions above 1.5.
To shut those off
tfv = tf.compat.v1
tfv.set_verbosity(tfv.logging.Error)
Is all you need. You don't need to import warnings.
Even so it is a long load time between the tf initialization, the sample initial generation and the loading of the module into ram. I added in model.shape_list(x):
the followin line
print("_",end ='',flush=True)
And at least for the module being built to localize it to the machine you can view a "progress bar" of sorts.

Related

Ray Tune Stuck with multiple runs

Hi I am trying hyper parameter optimization with ray tune.
Below is my code implementation.
However I get stuck and can't get the result back even though there aren't any error messages.
#ray.remote
def main:
do_somthing
return loss
def ray_pick_best_hypter(config):
runs = 10
loss_avg = np.mean(ray.get([main.remote(config,run=x) for x in range(runs)]))
tune.report(loss_avg=loss_avg)
config = load_config()
analysis = ray.tune.run(ray_pick_best_hypter, config=config,progress_reporter=reporter)
The below code works fine, but I want to run multiple experiments and get the mean value.
def ray_pick_best_hypter(config):
loss_avg = ray.get([main.remote(config,run=x))
tune.report(loss_avg=loss_avg)
What is the problem in the code?
It seems you are starting multiple distributed training processes from within your trainable. Each call to main.remote() will start a new distributed task. Since you're starting 10 of them at the same time, they will try to run in parallel.
However, the default resource allocation for each trial is usually just 1 CPU - so the remote tasks cannot be scheduled.
What you can do to resolve this is to pass resources_per_trial={"cpu": 11} - that way each of your remote tasks will have their own CPU to run on.

what is the biggest bottleneck in maskrcnn_benchmark repo?

I am working on a repo that make use of the maskrcnn_benchmark repo. I have extensively, explored the bench-marking repo extensively for the cause of its slower performance on a cpu with respect to enter link description here.
In order to create a benchmark for the individual forward passes I have put a time counter for each part and it gives me the time required to calculate each component. I have had a tough time exactly pinpointing as to the slowest component of the entire architecture.I believe it to be BottleneckWithFixedBatchNorm class in the maskrcnn_benchmark/modeling/backbone/resnet.py file.
I will really appreciate any help in localisation of the biggest bottle neck in this architecture.
I have faced the same problem, the best possible solution for the same is to look inside the main code, go through the forward pass of each module and have a timer setup to log the time that is spent in the computations of each module. How we worked in it was to create an architecture where we create the time logger for each class, therefore every instance of the class will now be logging its time of execution, after through comparison, atleast in our case we have found that the reason for the delay was the depth of the Resnet module, (which given the computational cost of resnet is not a surprising factor at all, the only solution to the same is more palatalization so either ensure a bigger GPU for performing the task or reduce the depth of the Resnet network ).
I must inform that the maskrcnn_benchmark has been deprecated and an updated version of the same is available in the form of detectron2. Consider moving your code for significant speed improvements in the architecture.
BottleneckWithFixedBatchNorm is not the most expensive operation in the architecture and certainly not creating the bottleneck as all the operations instead of the name. The class isn't as computationally expensive and is computed in parallel even on a lower end CPU machine (at least in the inference stage).
An example of tracking better the performance of each module can be found with the code taken from the path : maskrcnn_benchmark/modeling/backbone/resnet.py
class ResNet(nn.Module):
def __init__(self, cfg):
super(ResNet, self).__init__()
# If we want to use the cfg in forward(), then we should make a copy
# of it and store it for later use:
# self.cfg = cfg.clone()
# Translate string names to implementations
stem_module = _STEM_MODULES[cfg.MODEL.RESNETS.STEM_FUNC]
stage_specs = _STAGE_SPECS[cfg.MODEL.BACKBONE.CONV_BODY]
transformation_module = _TRANSFORMATION_MODULES[cfg.MODEL.RESNETS.TRANS_FUNC]
# Construct the stem module
self.stem = stem_module(cfg)
# Constuct the specified ResNet stages
num_groups = cfg.MODEL.RESNETS.NUM_GROUPS
width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
in_channels = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS
stage2_bottleneck_channels = num_groups * width_per_group
stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
self.stages = []
self.return_features = {}
for stage_spec in stage_specs:
name = "layer" + str(stage_spec.index)
stage2_relative_factor = 2 ** (stage_spec.index - 1)
bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor
out_channels = stage2_out_channels * stage2_relative_factor
stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1]
module = _make_stage(
transformation_module,
in_channels,
bottleneck_channels,
out_channels,
stage_spec.block_count,
num_groups,
cfg.MODEL.RESNETS.STRIDE_IN_1X1,
first_stride=int(stage_spec.index > 1) + 1,
dcn_config={
"stage_with_dcn": stage_with_dcn,
"with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
"deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
}
)
in_channels = out_channels
self.add_module(name, module)
self.stages.append(name)
self.return_features[name] = stage_spec.return_features
# Optionally freeze (requires_grad=False) parts of the backbone
self._freeze_backbone(cfg.MODEL.BACKBONE.FREEZE_CONV_BODY_AT)
def _freeze_backbone(self, freeze_at):
if freeze_at < 0:
return
for stage_index in range(freeze_at):
if stage_index == 0:
m = self.stem # stage 0 is the stem
else:
m = getattr(self, "layer" + str(stage_index))
for p in m.parameters():
p.requires_grad = False
def forward(self, x):
start_timer=time.time()
outputs = []
x = self.stem(x)
for stage_name in self.stages:
x = getattr(self, stage_name)(x)
if self.return_features[stage_name]:
outputs.append(x)
print("ResNet time :: ", time.time()-start_timer,file=open("timelogger.log","a"))
return outputs
Only change that has to be made is in the forward pass and all the instance created out of this class will inherit the properties and log time (choose to write the same to the file instead of a simple stdout)

How to report algorithm running time?

I am running a variational auto-encoder in TensorFlow, which could take a long time. Thus I want to report the time the algorithm has been running for as a scalar on TensorBoard.
One dirty way is to hard-code the start time of the compilation into a global variable, or pass it as an argument to the model function and compute the difference with current time.
Does Tensorflow have a native way to do it?
There is the tf.train.ProfilerHook. Comes with release 1.14.
Example usage:
estimator = tf.estimator.LinearClassifier(...)
hooks = [tf.train.ProfilerHook(output_dir=model_dir, save_secs=600, show_memory=False)]
estimator.train(input_fn=train_input_fn, hooks=hooks)
Executing the hook will generate files timeline-xx.json in output_dir.
Then open chrome://tracing/ in chrome browser and load the file. You will get a time usage timeline like below.

tf.nn.rnn_cell.GRUCell were built on CPU device

I'm training a 2-layer seq2seq model now and gru_cell is used.
def create_rnn_cell():
encoDecoCell = tf.contrib.rnn.GRUCell(emb_dim)
encoDecoCell = tf.contrib.rnn.DropoutWrapper(
encoDecoCell,
input_keep_prob=1.0,
output_keep_prob=0.7
)
return encoDecoCell
encoder_mutil = tf.contrib.rnn.MultiRNNCell(
[create_rnn_cell() for _ in range(num_layers)],
)
query_encoder_emb = tf.contrib.rnn.EmbeddingWrapper(
encoder_mutil,
embedding_classes=vocab_size,
embedding_size=word_embedding
)
Timeline object is used to get the time of execution for each node in the graph and I found most operations inside GRU_cell (including MatMul) happened on CPU device which made it very slow. I installed the gpu version of tf-1.8. Any comments about this? Did I miss something here?
I guess there is something wrong with tf.variable_scope because I'm using different buckets for the training data. This is how I reuse the variable between different bucktes:
for i, bucket in enumerate(buckets):
with tf.variable_scope(name_or_scope="RNN_encoder", reuse=True if i > 0 else None) as var_scope:
query_output, query_state = tf.contrib.rnn.static_rnn(query_encoder_emb,inputs=self.query[:bucket[0]],dtype=tf.float32)
execution time screenshot
I found the problem. In the source code of EmbeddingWrapper, CPU is used.
tf.contrib.rnn.EmbeddingWrapper
I rewrote this function and now it works on GPU and is much faster. So be careful if you want to use tf.contrib.rnn.EmbeddingWrapper.

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course