How to create a kernel where Linear kernel is raised to a fraction value?
I know it can be done in sklearn.gaussian_process as below.
kernel = DotProduct() ** 0.5
How to create this kernel in GPy ?
Related
I'm a beginner of Metal and trying to understand the Metal implementation of convolution in TFLite. After having read this line of code and found all usages, I'm really confused that why the work_group_launch_order matters. How does this mechanism work indeed? Is it related to the way GPU linearizes the 3D threadgroup?
[Tensorflow GitHub]
For simplicity, let me try to explain the thread dispatch strategy for a certain kernel (named ConvolutionGeneric) in TFLite.
For a given 3D thread number t=<tx, ty, tz> and thread group shape s=<sx, sy, sz>, TFLite calculates the number of desired thread groups n=<nx, ny, nz> by nx=ceil(tx/sx), ny=ceil(tx/sy), and nz=ceil(tz/sz). Which is absolutely normal.
In the normal way, we can dispatch <nx, ny, nz> threadgroups for 3 dimensions, and acquire the thread position in grid in Metal kernel function by the argument gid with attribute [[thread_position_in_grid]]. The thread position can decide which area the current thread is responsible for.
However, TFLite chose a weird way, dispatches <nz, nx, ny> threadgroups for 3 dimensions, and acquire the thread position in grid in Metal kernel function by calculate from tid3d [[thread_position_in_threadgroup]] and group_id [[threadgroup_position_in_grid]] as
gid_x = group_id.y * sx + tid3d.x;
gid_y = group_id.z * sy + tid3d.y;
gid_z = group_id.x * sz + tid3d.z
What surprises me is that this strategy really has a boost on performance (~10% speed up).
Can someone help me to explain the underlying mechanism behind the weird threadgroup dispatch strategy?
I’m running an object detection routine on a server.
I set the context to the GPU, and I'm loading the model, the parameters and the data on the GPU. The program is reading from a video file or from a rtsp stream, using OpenCV.
When using nvidia-smi, I see that the selected GPU usage is at 20%, which is reasonable. However, the object detection routine is still using 750-1200 % of the CPU (basically, all of the available cores of the server).
This is the code:
def main():
ctx = mx.gpu(3)
# -------------------------
# Load a pretrained model
# -------------------------
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
while(True):
print(f"Frame: {count_frame}")
# Load frame from the camera
ret, frame = cap.read()
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
frame_nd.wait_to_read()
# Run frame through network
frame_nd = frame_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(frame_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()
count_frame += 1
cv2.destroyAllWindows()
cap.release()
This is the output of nvidia-smi:
while this is the output of top:
The pre-processing operations are running on the CPU:
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
but is it enough to justify such a high CPU usage? In case, can I run them on GPU as well?
EDIT: I modified and copied the whole code, in response to Olivier_Cruchant's comment (thanks!)
Your CPU is likely busy because of the pre-processing load and frequent back-and-forth from memory to GPU because inference seems to be running frame-by-frame
I would suggest to try the following:
Run a batched inference (send a batch of N frames to the network) to
increase GPU usage and reduce communication
Try using NVIDIA DALI to
better use GPU for data ingestion and pre-processing (DALI MXNet reference, DALI mp4 ingestion pytorch example)
I add a new gpu kernel op(MyGpuOp), the use op in python, and specify device id by tf.device('/gpu:0'
example :
with tf.device('/gpu:0') :
out1 = my_gpu_op(input1)
with tf.device('/gpu:1') :
out2 = my_gpu_op(input2)
then i use sess.run([out1, out2], the seens won't run concurrent in different Gpu Device. beacause the run time of sess.run[out1, out2] is twice of sess.run(out1).
in c++ op wraps(MyGpuOp), i run cuda kernel by pass steam(CudaStream_t)
like this :
//ctx is OpKernelContext
GPUDevice d = ctx->eigen_device<GPUDevice>();
CudaStream_t stream = d.stream();
MyKernelName<<<grid, block, 0, stream>>> (....);
Should I need get device id, then use CudaSetDevice(device_id) before run kernel fun?
In tensorflow I register an op like so:
REGISTER_OP("RimeBSqrt")
.Input("stokes: FT")
.Input("alpha: FT")
.Input("frequency: FT")
.Input("ref_freq: FT")
.Output("b_sqrt: CT")
.Attr("FT: {float, double} = DT_FLOAT")
.Attr("CT: {complex64, complex128} = DT_COMPLEX64");
All of the above inputs are tensors,
but ref_freq is a scalar or 0-D tensor.
In the Compute() method of my CPU kernel
I can do the following to extract the scalar:
const Tensor & in_ref_freq = context->input(3);
FT ref_freq = in_ref_freq.tensor<FT, 1>()(0);
However, the same kind of code generates a segfault
in the Compute() method of my GPU kernel, because
the CPU now tries to access a block of memory on the
GPU device. Is there anyway to intercept this scalar
value before sending it into the GPU? I'd like to avoid
the following extra level of memory indirection in
a CUDA kernel:
template <typename FT>
__global__ void kernel(..., FT * ref_freq, ...)
{
FT value = ref_freq[0];
}
I don't think Attr is the approach to use for ref_freq since it is changeable, configurable value.
CPU Tensorflow kernel code is here.
GPU Tensorflow kernel code is here.
Python variable setup code is here
You can specify that one or more of the inputs to (or outputs from) a TensorFlow OpKernel are in "host memory", which allows you to access the value in the Compute() method. To do this you would modify your REGISTER_KERNEL_BUILDER() call to add a .HostMemory("ref_freq") instruction:
REGISTER_KERNEL_BUILDER(
Name("RimeBSqrt")
.Device(tensorflow::DEVICE_GPU)
.TypeConstraint<float>("FT")
.TypeConstraint<tensorflow::complex64>("CT")
.HostMemory("ref_freq"),
RimeBSqrt<tensorflow::GPUDevice, float, tensorflow::complex64>);
Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?
(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract
Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.
There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))
The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory
tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.
as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)