I'm using the below gradient descent implementation in Octave for ML.
I tried first to increase number of CPU cores and run Octave multithreaded using OpenBlas but still I didn't get the results I'm looking for, so I tried using Nvidia's toolkit and their Tesla K80 GPU
I'm loading Octave using the drop in nvblas following instructions in this article:
Drop-in Acceleration of GNU Octave
When I checked nvidia-smi I found the GPU to be idle although my testing using a matrix matrix multiplication is yielding ~9 teraflops
Later I came to understand that the matrix vector multiplication used for the above mentioned implementation is not supported as per the nvblas documentation
So my question is there is a gradient descent implementation that uses matrix matrix multiplication or something equivalent that can replace the gradient descent implementation I have?
Related
I have a computation which has for loops and calls to Tensorflow matrix algorithms such as tf.lstsq and Tensorflow iteration with tf.map_fn. I would like to profile this to see how much parallelism I am getting in the tf.map_fn and matrix algorithms that get called.
This doesn't seem to be the use case at all for the Tensorflow Profiler which is organized around the neural network model training loop.
Is there a way to use Tensorflow Profiler for arbitrary Tensorflow computations, or is the go-to move in this case to use NVidia tools like nvprof?
I figured out that the nvprof and nvvp and nsight tools I was looking for are available as a Conda install of cudatoolkit-dev. Uses are described in this gist.
I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).
So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.
It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.
GEMM
In the article you provided there is as the top:
[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra
Subprograms)
So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.
Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)
Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.
GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations.
It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)
Direct convolutions
It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.
I am reading the Apex AMP documentation:
A Python-only build omits:
Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels
required to use apex.normalization.FusedLayerNorm.
Fused kernels that
improve the performance and numerical stability of
apex.parallel.SyncBatchNorm.
Fused kernels that improve the
performance of apex.parallel.DistributedDataParallel and apex.amp.
DistributedDataParallel, amp, and SyncBatchNorm will still be usable,
but they may be slower.
There also seems to be a "FusedAdam" optimizer:
The Adam optimizer in Pytorch (like all Pytorch optimizers) carries
out optimizer.step() by looping over parameters, and launching a
series of kernels for each parameter. This can require hundreds of
small launches that are mostly bound by CPU-side Python looping and
kernel launch overhead, resulting in poor device utilization.
Currently, the FusedAdam implementation in Apex flattens the
parameters for the optimization step, then carries out the
optimization step itself via a fused kernel that combines all the Adam
operations. In this way, the loop over parameters as well as the
internal series of Adam operations for each parameter are fused such
that optimizer.step() requires only a few kernel launches.
The current implementation (in Apex master) is brittle and only works
with Amp opt_level O2. I’ve got a WIP branch to make it work for any
opt_level (https://github.com/NVIDIA/apex/pull/351). I recommend
waiting until this is merged then trying it.
This partially explains it. I'm left with more questions:
What is meant by kernel? A layer or an optimizer?
Is the idea of fused layer the same as a fused optimizer?
"Kernel" here is for computation kernels: https://en.wikipedia.org/wiki/Compute_kernel
Operations like convolution are often implemented using compute kernels for better efficiency. Compute kernels can be written using C, CUDA, OpenCL or even assembly for maximum efficiency. It is therefore not surprizing that "a Python-only build" does not support...
"Fusing" means commonalization of computation steps. Basically, it's an implementation trick to run code more efficiently by combining similar operations in a single hardware (GPU, CPU or TPU) operation. Therefore, a "fusedLayer" is a layer where operations benefit from a "fused" implementation.
I need to get fastest possible matmul operation in TF for the case when one of the matrices is lower triangular. The cuBLAS and the BLAS have trmm functions, but looks like TensorFlow doesn't benefit from it.
I checked LinearOperators implementation for LowerTriangular case. But, it is not clear either it utilizes BLAS implementation or not.
Can anyone confirm that most optimized version is implemented by LinearOperators?
Thanks!
I've read some article about using other distribution to modeling a stochastic policy in Reinforcement Learning. Usually we use a Gaussian distribution but some used Beta distribution : https://en.wikipedia.org/wiki/Beta_distribution
There is already a Beta distribution class inside Tensorflow, allow people to use it as Tensors.
But for some policy gradient methods, they are using constraint on the optimization process, using the Kullback Leiber Divergence.
In the formula, there is the digamma function, already implemented in Tensorflow. But I can't find the beta function (nor the gamma function since they're linked) in Tensorflow. Only log gamma or incomplete gamma. And I cannot use the scipy.special.beta function because it cannot manipulate tensors (since my alpha and beta parameters are produced by a neural network)
I'm not specialist enough in this field, perhaps my question is foolish, but I'd really like an explanation there.
Thanks a lot