Is there a way of compiling (i.e. caching) a cupy RawKernel before calling it? - cupy

I'm writing a python application that processes a lot of images.
The computation speed of the application is important, thus I'm trying to minimize the execution time by writing cupy kernels.
For the sake of simplicity, assume that I have a cupy raw kernel below.
import cupy as cp
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void add_one(float* dimg, float* y) {
int j = threadIdx.x;
int i = blockIdx.x;
int k = blockDim.x;
int tid = k*i+j;
y[tid] = dimg[tid] + 1;
''', 'add_one')
if __name__ == '__main__':
h, w = 192, 256
dimg_cp = cp.zeros(shape=(h, w), dtype=cp.float32)
y = cp.zeros(shape=(h, w), dtype=cp.float32)
add_kernel((h,), (w,), (dimg_cp, y))
Here, 'add_kernel' simply copies an input matrix and add one to every element of the copied matrix then return it. It works great but I believe the code can be further optimized in terms of execution speed.
According to the link, when the kernel is called for the first time (i.e. not cached), there will be an overhead for compilation.
I want to avoid this compilation time.
So I want to ask if there is a way of compiling cp.RawKernel prior to calling the kernel for the first time?
Thanks in advance.

There is currently no explicit way to precompile the kernel without calling it. One easy solution is just calling it once with a small input. Note that the compiled kernel is also cached to a file, so the overhead only exists at the first execution of the script in the environment.


OpenCL Kernel and traditional loops

I'm studying OpenCL and I don't understand the relationship between traditional loop in a C/C++ code and kernel code.
Just for be clear a situation like that:
So my question is: In the traditional loops I have n variable as my boundary while in kernel code I don't have it but I have get_global_id(0) that indicates the memory scope of my array, this means that I start from 0, and iterate until get_global_id matches with the maximum size of the array, n in this case? Or is something different?
Because in this other example I don't know how to write the correspond kernel code
I hope my question is clear because I'm not very well in english, sorry.
Thanks in advance for the help, if there are problems let me know!
An OpenCL kernel is coded like a single iteration of a for-loop, but all iterations are run in parallel with random order.
Consider this vector addition example in C++, where for i=0..N-1, you add each element of the vectors one after the other:
for(int i=0; i<N; i++) { // loop index i
C[i] = A[i]+B[i]; // compute one after the other
In OpenCL, the vector addition looks like the inside of this for-loop, but as a function with the kernel keyword and all vectors as parameters:
kernel void add_kernel(const global float* A, const global float* B, global float* C) {
const int i = get_global_id(0);
C[i] = A[i]+B[i]; // compute all loop indices i in parallel
You might be wondering: Where is N? You give N to the kernel on the C++ side as its "global range", so the kernel knows how much elements i to calculate in parallel.
Because in the OpenCL kernel every iteration runs in parallel, there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other). In your second example with A[i] = B[i-1]+B[i]+B[i+1] you do exactly that: only read from B, only write to A. The implementation with periodic boundaries can be done branch-less, see here.

Speeding up Inference time on GPT2 - optimizing

I am trying to optimize the inference time on GPT2. The current time to generate a sample after calling the script is 55 secs on Google Colab. I put in timestamps to try to isolate where the bottleneck is.
This is the code:
for _ in range(nsamples // batch_size):
out =, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
for i in range(batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print("=" * 80)
The line
out =, feed_dict={
context: [context_tokens for _ in range(batch_size)]
})[:, len(context_tokens):]
is where the complexity lies. Does anyone have any way I can improve this piece of code ? Thank you so much!
batch_size is set to 1 in GPT2 and there is no way to change that without crashing the process. So "[context_tokens for _ in range(batch_size)]" means "[context_tokens for _ in range(1)]" means "[context_tokens]" which will not improve speed by much but is safe to implement and makes looking at the code a bit more sensible. The real complexty is you have a 6 gigabyte bohemoth in your ram that you are accessing in that session.
As a practical matter, the less tokens you send over and the less processing those tokens take the faster this part will execute. As each token needs to be sent through the GPT2 AI. But consequently the less 'intelligent' the response will be.
By the way // is an integer division operation, so nsamples // batch_size = nsamples/1 = nsamples size. And from what I have seen the nsamples was 1 when I printed its value in print(nsamples). So that for loop is another loop of one item, which means the loop can be removed.
GPT2 is just a implementation of tensorflow. Look up: how to make a graph in tensorflow; how to call a session for that graph; how to make a saver save the variables in that session and how to use the saver to restore the session. You will learn about checkpoints, meta files and other implementation that will make your files make more sense.
The tensorflow module is found in Lib, site-packages, tensorflow_core (at least in the AI Dungeon 2 Henk717 fork). Most of the processing is happening in sub directories python/ops and framework. You will see these pop up if your coding breaks the hooks tf was expecting.
If this question regards the implementation in AI Dungeon the best I have been able to implement is a recursive call to generator.generate that is exited by a try except KeyboardInterrupt: with a print(token, end = '', flush = True) for each token as they are generated. This way you are able to view each token as the AI generates it, rather that waiting for 55 sec for a ping sound.
Also, the Cuda warnings need a single quote, not double so,
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
not "3"
That will take off the cuda warnings when tensorflow is imported.
Next there are depreciations that popup from the implementation of GPT2 in tensorflow versions above 1.5.
To shut those off
tfv = tf.compat.v1
Is all you need. You don't need to import warnings.
Even so it is a long load time between the tf initialization, the sample initial generation and the loading of the module into ram. I added in model.shape_list(x):
the followin line
print("_",end ='',flush=True)
And at least for the module being built to localize it to the machine you can view a "progress bar" of sorts.

what is the biggest bottleneck in maskrcnn_benchmark repo?

I am working on a repo that make use of the maskrcnn_benchmark repo. I have extensively, explored the bench-marking repo extensively for the cause of its slower performance on a cpu with respect to enter link description here.
In order to create a benchmark for the individual forward passes I have put a time counter for each part and it gives me the time required to calculate each component. I have had a tough time exactly pinpointing as to the slowest component of the entire architecture.I believe it to be BottleneckWithFixedBatchNorm class in the maskrcnn_benchmark/modeling/backbone/ file.
I will really appreciate any help in localisation of the biggest bottle neck in this architecture.
I have faced the same problem, the best possible solution for the same is to look inside the main code, go through the forward pass of each module and have a timer setup to log the time that is spent in the computations of each module. How we worked in it was to create an architecture where we create the time logger for each class, therefore every instance of the class will now be logging its time of execution, after through comparison, atleast in our case we have found that the reason for the delay was the depth of the Resnet module, (which given the computational cost of resnet is not a surprising factor at all, the only solution to the same is more palatalization so either ensure a bigger GPU for performing the task or reduce the depth of the Resnet network ).
I must inform that the maskrcnn_benchmark has been deprecated and an updated version of the same is available in the form of detectron2. Consider moving your code for significant speed improvements in the architecture.
BottleneckWithFixedBatchNorm is not the most expensive operation in the architecture and certainly not creating the bottleneck as all the operations instead of the name. The class isn't as computationally expensive and is computed in parallel even on a lower end CPU machine (at least in the inference stage).
An example of tracking better the performance of each module can be found with the code taken from the path : maskrcnn_benchmark/modeling/backbone/
class ResNet(nn.Module):
def __init__(self, cfg):
super(ResNet, self).__init__()
# If we want to use the cfg in forward(), then we should make a copy
# of it and store it for later use:
# self.cfg = cfg.clone()
# Translate string names to implementations
# Construct the stem module
self.stem = stem_module(cfg)
# Constuct the specified ResNet stages
width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
stage2_bottleneck_channels = num_groups * width_per_group
stage2_out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
self.stages = []
self.return_features = {}
for stage_spec in stage_specs:
name = "layer" + str(stage_spec.index)
stage2_relative_factor = 2 ** (stage_spec.index - 1)
bottleneck_channels = stage2_bottleneck_channels * stage2_relative_factor
out_channels = stage2_out_channels * stage2_relative_factor
stage_with_dcn = cfg.MODEL.RESNETS.STAGE_WITH_DCN[stage_spec.index -1]
module = _make_stage(
first_stride=int(stage_spec.index > 1) + 1,
"stage_with_dcn": stage_with_dcn,
"with_modulated_dcn": cfg.MODEL.RESNETS.WITH_MODULATED_DCN,
"deformable_groups": cfg.MODEL.RESNETS.DEFORMABLE_GROUPS,
in_channels = out_channels
self.add_module(name, module)
self.return_features[name] = stage_spec.return_features
# Optionally freeze (requires_grad=False) parts of the backbone
def _freeze_backbone(self, freeze_at):
if freeze_at < 0:
for stage_index in range(freeze_at):
if stage_index == 0:
m = self.stem # stage 0 is the stem
m = getattr(self, "layer" + str(stage_index))
for p in m.parameters():
p.requires_grad = False
def forward(self, x):
outputs = []
x = self.stem(x)
for stage_name in self.stages:
x = getattr(self, stage_name)(x)
if self.return_features[stage_name]:
print("ResNet time :: ", time.time()-start_timer,file=open("timelogger.log","a"))
return outputs
Only change that has to be made is in the forward pass and all the instance created out of this class will inherit the properties and log time (choose to write the same to the file instead of a simple stdout)

GPU Array multiplications using Pycuda on Numpy arrays

I have tried to implement Element-wise multiplication of two numpy arrays by making similar GPU arrays and performing the operations. However, the resulting execution time is much slower than the original numpy pointwise multiplication. I was hoping to get a good speedup using the GPU. zz0 is complex128 type, (64,256,16) shape numpy array and xx0 is float64 type,(16,151) shape numpy array. Can someone please help me figure out what I am doing wrong with respect to the implementation:
import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time
import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg
# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
zzg = gpuarray.to_gpu(zz_stretch)
xxg = gpuarray.to_gpu(xx_stretch)
zz_Hypg = linalg.multiply(zzg,xxg)
zz_Hyp = zz_Hypg.get()
return zz_Hyp
zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')
xx0_exp = np.exp(-1j*xx0)
t1 = time.time()
#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])'zz0_Hyp',zz0_Hyp)
t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))
#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]'zz0_Hyp_actual',zz0_Hyp_actual)
t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))
The first issue is that your timing metrics are not accurate.
Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:
Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633
Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254
However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:
REAL Time taken with GPU:0.6461708545684814
You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:
t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
REAL Time taken with GPU:0.3689603805541992
Wow, that contributed a lot. But we still are slower than the numpy version? Why?
Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:
not Tile
broadcast dimensions
implement this in a kernel.
Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):
#include <pycuda-complex.hpp>
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order:
// x axis contiguous out,
// y axis contiguous in zz,
// x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory.
void element_wise_multiplication(
pycuda::complex<double>* __restrict__ array_zz,
pycuda::complex<double>* __restrict__ array_xx,
pycuda::complex<double>* __restrict__ out_array,
unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
unsigned array_zz_z,
unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
unsigned array_xx_x){
// z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead.
unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
//blockIdx.z stores both z and w and should not over shoot, and aren't used
//shown for the sake of how to get these dimensions.
unsigned z_idx = blockIdx.z % array_zz_z;
unsigned w_idx = blockIdx.z / array_zz_z;
//we already know this part of the indexing calculation.
unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
// since our input array is actually 3D, this is a different calcualation
unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
//ensures if our launch dimensions don't exactly match our input size, we don't
//accidently access out of bound memory, while branching can be bad, this isn't
// because 99.999% of the time no branch will occur and our instruction pointer
//will be the same per warp, meaning virtually zero cost.
if(x_idx < array_xx_x){
//moving over y axis to coalesce memory accesses in the x dimension per warp.
for(int i = 0; i < ilp_factor; ++i){
//need to also check y, these checks are virtually cost-less
// because memory access dominates time in such simple calculations,
// and arithmetic will be hidden by overlapping execution
if((y_idx+i) < array_zz_xx_y){
//splitting up calculation for simplicity sake
out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
array_zz_idx = array_zz_zw + (y_idx+i);
array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
//actual final output.
out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
You will have to make the launch dimensions something like:
thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)
And there are several further optimizations you may be able to take advantage of:
store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos) to achieve the same result, but utilizing way less memory bandwidth (see here).
But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.

How to speed up ExternalOptimizerInterface in Tensorflow?

I just did a benchmark and ExternalOptimizerInterface from the Tensorflow optimization package is almost twice as slow as a normal optimizer.
It makes me wonder what the point of it is. ExternalOptimizerInterface clearly unfeasible for modern deep learning. Is there anyway to speed up ExternalOptimizerInterface?
Here's a snippet of my ExternalOptimizer:
def _minimize(self, initial_val, loss_grad_func, equality_funcs,
equality_grad_funcs, inequality_funcs, inequality_grad_funcs,
step_callback, optimizer_kwargs, packed_bounds=None):
self.t += 1
current_val = initial_val
_, grad = loss_grad_func(current_val)
delta = - grad * self.learning_rate
new_val = current_val + delta
return new_val
You have not mentioned if you are running it on a GPU or CPU. Nevertheless, the performance difference comes from the fact that GradientDescentOptimizer uses a single optimized kernel. Your implementation using ExternalOptimizerInterface is implemented using primitive operations and Tensorflow cannot optimize across kernels.
The underlying kernel ApplyGradientDescentOp is defined here )
you can run both the implementations and compare them using a profiler such as tf-prof for more details