Java Tensorflow Serving Prediction Client Too Slow - tensorflow

I have created a tensorflow object detection model and served it using tensorflow serving. I have created a python client to test the serving model and that takes around 40ms time to receive all the prediction.
t1 = datetime.datetime.now()
result = stub.Predict(request, 60.0) # 60 secs timeout
t2 = datetime.datetime.now()
print ((t2 - t1).microseconds / 1000)
Now, my problem is when I do the same on java, it takes way too much time (about 10 times) of 450 to 500ms.
ManagedChannel channel = ManagedChannelBuilder.forAddress("localhost", 9000)
.usePlaintext(true)
.build();
PredictionServiceGrpc.PredictionServiceBlockingStub stub = PredictionServiceGrpc.newBlockingStub(channel);
......
Instant pre = Instant.now();
Predict.PredictResponse response = stub.predict(request);
Instant curr = Instant.now();
System.out.println("time " + ChronoUnit.MILLIS.between(pre,curr));

The actual issue was that,
I was sending all the image pixels over the network (which was a bad idea). Now, changing the input to an encoded image made it faster.

Related

Improve performance broadcast multiplication and 1 - broadcast

How can I improve the last two "operations" of this code in terms of time? (minus1_power_x_t*x and 1-minus1_power_x). They are taking 1.73 and 1.47 seconds respectively. I asked for the last two operations because the others ones will be constants.
import time
from multiprocessing import Pool
import multiprocessing
import numpy as np
def create_ext_component_function(dim):
x = np.array([i for i in range(dim)], dtype=float)
t3 = time.time()
y_shift = np.fromfunction(lambda i,j: (i)>>j, ((2**dim), dim), dtype=np.uint32)
elapsed3 = time.time() - t3
print('y_sfhit Elapsed: %s' % elapsed3)
t3 = time.time()
um=np.ones(((2**dim), dim), dtype=np.uint8);
elapsed3 = time.time() - t3
print('um Elapsed: %s' % elapsed3)
t3 = time.time()
and_list = np.bitwise_and(y_shift,um)
elapsed3 = time.time() - t3
print('and_list Elapsed: %s' % elapsed3)
t3 = time.time()
minus1_power_x_t = np.power(-1,and_list)
elapsed3 = time.time() - t3
print('minus1_power_x_t Elapsed: %s' % elapsed3)
# I need to improve the last two operations
t3 = time.time()
minus1_power_x = minus1_power_x_t*x
elapsed3 = time.time() - t3
print('minus1_power*x Elapsed: %s' % elapsed3)
t3 = time.time()
um_minus_minus1_power = 1-minus1_power_x
elapsed3 = time.time() - t3
print('um_minus_minus1_power Elapsed: %s' % elapsed3)
return um_minus_minus1_power
if __name__ == '__main__':
dim = 24
print(create_ext_component_function(dim))
EDIT: Taking in account that minus1_power_x_t are values -1 or 1.
The problem with this code is that it creates many big temporary arrays for basic memory-bound operations. Big temporary arrays are slow to fill because of the speed of the RAM (unsurprisingly) and also because page fault.
Indeed, the operating system (OS) perform the mapping between the the virtual memory page and physical memory pages when the big temporary array are filled for the first time. This process is slow because the Numpy code is sequential (so the page faults), pages are generally small and so there is a lot of pages to map (typically 4 KB on a x86 system) and most OS pre-fill pages to zero for security reasons (so that the page is not filled with your bank account cumming from a recently closed browser tab). Note that there are (transparent) solutions to reduce the number of pages (using huge pages), but there are costly in this case too due to the pre-fill.
The best thing to do to solve this problem is to minimize the number of temporary buffers. This can be done thanks to the out argument of Numpy many functions (eg. subtract). You can also compute the arrays in-place for better performance. This solution also reduce the memory footprint so that the memory is not swapped on your slow storage device (swap memory). An alternative solution is to use a parallel implementation of Numpy or to write a parallel Numba/Cython code (Numba is probably the best option here). Another solution is to use Numexpr.
Note that using smaller data-types also help to improve the performance of the code (as the raw buffer in memory will be smaller and so faster to read/write/pre-fill). Using float32 is faster than float64 although is may not fit your needs. The same applies for integers (eg. int8 vs int32 vs int64 for and_list and minus1_power_x_t).
Here is an example:
# [...]
# The dtype parameter is important to reduce the size in memory
minus1_power_x_t = np.power(-1,and_list, dtype=np.int8)
# Pre-fill a buffer in memory with 0.0 (can be reused multiple times)
buffer = np.full(minus1_power_x_t.shape, 0.0)
# Note that minus1_power_x is overwritten once um_minus_minus1_power has been computed.
# If you need both, you can use 2 pre-filled buffers (only usefull if used multiple times like in a loop)
minus1_power_x = np.multiply(minus1_power_x_t, x, out=buffer)
um_minus_minus1_power = np.subtract(1.0, minus1_power_x, out=buffer)
With this method, the multiply is about 2.5 times faster on my (Intel Xeon) machine and the subtract is about 4 times faster.
Numexpr can be used to fuse the multiply and the subtract. It also support user-defined output buffers. Moreover, it can parallelize the computation. Here is an example:
um_minus_minus1_power = numexpr.evaluate('1.0 - minus1_power_x_t * x', out=buffer)
The Numexpr code is about 12.3 times faster on my machine. Note that using a float32 arrays instead of float64 ones should be about 2 times faster.

Speeding up Inference time on GPT2 - optimizing tf.sess.run()

I am trying to optimize the inference time on GPT2. The current time to generate a sample after calling the script is 55 secs on Google Colab. I put in timestamps to try to isolate where the bottleneck is.
This is the code:
for _ in range(nsamples // batch_size):
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
The line
out = sess.run(output, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
is where the complexity lies. Does anyone have any way I can improve this piece of code ? Thank you so much!
batch_size is set to 1 in GPT2 and there is no way to change that without crashing the process. So "[context_tokens for _ in range(batch_size)]" means "[context_tokens for _ in range(1)]" means "[context_tokens]" which will not improve speed by much but is safe to implement and makes looking at the code a bit more sensible. The real complexty is you have a 6 gigabyte bohemoth in your ram that you are accessing in that session.
As a practical matter, the less tokens you send over and the less processing those tokens take the faster this part will execute. As each token needs to be sent through the GPT2 AI. But consequently the less 'intelligent' the response will be.
By the way // is an integer division operation, so nsamples // batch_size = nsamples/1 = nsamples size. And from what I have seen the nsamples was 1 when I printed its value in print(nsamples). So that for loop is another loop of one item, which means the loop can be removed.
GPT2 is just a implementation of tensorflow. Look up: how to make a graph in tensorflow; how to call a session for that graph; how to make a saver save the variables in that session and how to use the saver to restore the session. You will learn about checkpoints, meta files and other implementation that will make your files make more sense.
The tensorflow module is found in Lib, site-packages, tensorflow_core (at least in the AI Dungeon 2 Henk717 fork). Most of the processing is happening in sub directories python/ops and framework. You will see these pop up if your coding breaks the hooks tf was expecting.
If this question regards the implementation in AI Dungeon the best I have been able to implement is a recursive call to generator.generate that is exited by a try except KeyboardInterrupt: with a print(token, end = '', flush = True) for each token as they are generated. This way you are able to view each token as the AI generates it, rather that waiting for 55 sec for a ping sound.
Also, the Cuda warnings need a single quote, not double so,
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
not "3"
That will take off the cuda warnings when tensorflow is imported.
Next there are depreciations that popup from the implementation of GPT2 in tensorflow versions above 1.5.
To shut those off
tfv = tf.compat.v1
tfv.set_verbosity(tfv.logging.Error)
Is all you need. You don't need to import warnings.
Even so it is a long load time between the tf initialization, the sample initial generation and the loading of the module into ram. I added in model.shape_list(x):
the followin line
print("_",end ='',flush=True)
And at least for the module being built to localize it to the machine you can view a "progress bar" of sorts.

what is the biggest bottleneck in maskrcnn_benchmark repo?

I am working on a repo that make use of the maskrcnn_benchmark repo. I have extensively, explored the bench-marking repo extensively for the cause of its slower performance on a cpu with respect to enter link description here.
In order to create a benchmark for the individual forward passes I have put a time counter for each part and it gives me the time required to calculate each component. I have had a tough time exactly pinpointing as to the slowest component of the entire architecture.I believe it to be BottleneckWithFixedBatchNorm class in the maskrcnn_benchmark/modeling/backbone/resnet.py file.
I will really appreciate any help in localisation of the biggest bottle neck in this architecture.
I have faced the same problem, the best possible solution for the same is to look inside the main code, go through the forward pass of each module and have a timer setup to log the time that is spent in the computations of each module. How we worked in it was to create an architecture where we create the time logger for each class, therefore every instance of the class will now be logging its time of execution, after through comparison, atleast in our case we have found that the reason for the delay was the depth of the Resnet module, (which given the computational cost of resnet is not a surprising factor at all, the only solution to the same is more palatalization so either ensure a bigger GPU for performing the task or reduce the depth of the Resnet network ).
I must inform that the maskrcnn_benchmark has been deprecated and an updated version of the same is available in the form of detectron2. Consider moving your code for significant speed improvements in the architecture.
BottleneckWithFixedBatchNorm is not the most expensive operation in the architecture and certainly not creating the bottleneck as all the operations instead of the name. The class isn't as computationally expensive and is computed in parallel even on a lower end CPU machine (at least in the inference stage).
An example of tracking better the performance of each module can be found with the code taken from the path : maskrcnn_benchmark/modeling/backbone/resnet.py
class ResNet(nn.Module):
def __init__(self, cfg):
super(ResNet, self).__init__()
# If we want to use the cfg in forward(), then we should make a copy
# of it and store it for later use:
# self.cfg = cfg.clone()
# Translate string names to implementations
stem_module = _STEM_MODULES[cfg.MODEL.RESNETS.STEM_FUNC]
stage_specs = _STAGE_SPECS[cfg.MODEL.BACKBONE.CONV_BODY]
transformation_module = _TRANSFORMATION_MODULES[cfg.MODEL.RESNETS.TRANS_FUNC]
# Construct the stem module
self.stem = stem_module(cfg)
# Constuct the specified ResNet stages
num_groups = cfg.MODEL.RESNETS.NUM_GROUPS
width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
in_channels = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS
stage2_bottleneck_channels = num_groups * width_per_group
stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
self.stages = []
self.return_features = {}
for stage_spec in stage_specs:
name = "layer" + str(stage_spec.index)
stage2_relative_factor = 2 ** (stage_spec.index - 1)
bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor
out_channels = stage2_out_channels * stage2_relative_factor
stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1]
module = _make_stage(
transformation_module,
in_channels,
bottleneck_channels,
out_channels,
stage_spec.block_count,
num_groups,
cfg.MODEL.RESNETS.STRIDE_IN_1X1,
first_stride=int(stage_spec.index > 1) + 1,
dcn_config={
"stage_with_dcn": stage_with_dcn,
"with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
"deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
}
)
in_channels = out_channels
self.add_module(name, module)
self.stages.append(name)
self.return_features[name] = stage_spec.return_features
# Optionally freeze (requires_grad=False) parts of the backbone
self._freeze_backbone(cfg.MODEL.BACKBONE.FREEZE_CONV_BODY_AT)
def _freeze_backbone(self, freeze_at):
if freeze_at < 0:
return
for stage_index in range(freeze_at):
if stage_index == 0:
m = self.stem # stage 0 is the stem
else:
m = getattr(self, "layer" + str(stage_index))
for p in m.parameters():
p.requires_grad = False
def forward(self, x):
start_timer=time.time()
outputs = []
x = self.stem(x)
for stage_name in self.stages:
x = getattr(self, stage_name)(x)
if self.return_features[stage_name]:
outputs.append(x)
print("ResNet time :: ", time.time()-start_timer,file=open("timelogger.log","a"))
return outputs
Only change that has to be made is in the forward pass and all the instance created out of this class will inherit the properties and log time (choose to write the same to the file instead of a simple stdout)

TensorflowJS takes too long

When I run TensorflowJS on the browser, especially on the phone, it takes really long to predict and sometimes doesn't even work. I'm using the optimized graph already. I want to know if there is any way to speed it up, whether by running a prediction before the page loads so that the second is faster or anything else.
I am using InceptionV3 architecture, and the image size is 299 by 299; if I could make that smaller perhaps it could go faster, but that would mean I would have to retrain my model. Note: I am not training using Tensorflowjs, only making predictions. Here is the relevant code:
var ctx = canvas.getContext('2d');
var file = ctx.getImageData(0,0,120,120);
const raw = tf.fromPixels(file).toFloat();
const resized = tf.image.resizeBilinear(raw, [299, 299])
const offset = tf.scalar(127);
const normalized = resized.sub(offset).div(offset);
batched = normalized.expandDims(0);
f = model.execute(batched).dataSync();

Scipy, Numpy: Audio classifier,Voice/Speech Activity Detection

I am writting a program to automatically classify recorded audio phone calls files (wav files) which contain atleast some Human Voice or not (only DTMF, Dialtones, ringtones, noise).
My first approach was implementing simple VAD (voice activity detector) using ZCR (zero crossing rate) & calculating Energy, but both of these paramters confuse DTMF, Dialtones with Voice. This techquie failed so I implemented a trivial method to calculate variance of FFT inbetween 200Hz and 300Hz. My numpy code is as follows
wavefft = np.abs(fft(frame))
n = len(frame)
fx = np.arange(0,fs,float(fs)/float(n))
stx = np.where(fx>=200)
stx = stx[0][0]
endx = np.where(fx>=300)
endx = endx[0][0]
return np.sqrt(np.var(wavefft[stx:endx]))/1000
This resulted in 60% accuracy.
Next, I tried implementing a machine learning based approach using SVM (Support Vector Machine) and MFCC (Mel-frequency cepstral coefficients). The results were totally incorrect, almost all samples were incorrectly marked. How should one train a SVM with MFCC feature vectors? My rough code using scikit-learn is as follows
[samplerate, sample] = wavfile.read ('profiles/noise.wav')
noiseProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/ring.wav')
ringProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/voice.wav')
voiceProfile = MFCC(samplerate, sample)
machineData = []
for noise in noiseProfile:
machineData.append(noise)
for voice in voiceProfile:
machineData.append(voice)
dataLabel = []
for i in range(0, len(noiseProfile)):
dataLabel.append (0)
for i in range(0, len(voiceProfile)):
dataLabel.append (1)
clf = svm.SVC()
clf.fit(machineData, dataLabel)
I want to know what alternative approach I could implement?
If you don't have to use scipy/numpy, you might checkout webrtvad, which is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code. WebRTC uses Gaussian Mixture Models (GMMs), works well, and is very fast.
Here's an example of how you might use it:
import webrtcvad
# audio must be 16 bit PCM, at 8 KHz, 16 KHz or 32 KHz.
def audio_contains_voice(audio, sample_rate, aggressiveness=0, threshold=0.5):
# Frames must be 10, 20 or 30 ms.
frame_duration_ms = 30
# Assuming split_audio is a function that will split audio into
# frames of the correct size.
frames = split_audio(audio, sample_rate, frame_duration)
# aggressiveness tells the VAD how aggressively to filter out non-speech.
# 0 will have the most false-positives for speech, 3 the least.
vad = webrtc.Vad(aggressiveness)
num_voiced = len([f for f in frames if vad.is_voiced(f, sample_rate)])
return float(num_voiced) / len(frames) > threshold