Why does the order of threadgroup num (work_group_launch_order) in TFLite Metal matter? - tensorflow

I'm a beginner of Metal and trying to understand the Metal implementation of convolution in TFLite. After having read this line of code and found all usages, I'm really confused that why the work_group_launch_order matters. How does this mechanism work indeed? Is it related to the way GPU linearizes the 3D threadgroup?
[Tensorflow GitHub]
For simplicity, let me try to explain the thread dispatch strategy for a certain kernel (named ConvolutionGeneric) in TFLite.
For a given 3D thread number t=<tx, ty, tz> and thread group shape s=<sx, sy, sz>, TFLite calculates the number of desired thread groups n=<nx, ny, nz> by nx=ceil(tx/sx), ny=ceil(tx/sy), and nz=ceil(tz/sz). Which is absolutely normal.
In the normal way, we can dispatch <nx, ny, nz> threadgroups for 3 dimensions, and acquire the thread position in grid in Metal kernel function by the argument gid with attribute [[thread_position_in_grid]]. The thread position can decide which area the current thread is responsible for.
However, TFLite chose a weird way, dispatches <nz, nx, ny> threadgroups for 3 dimensions, and acquire the thread position in grid in Metal kernel function by calculate from tid3d [[thread_position_in_threadgroup]] and group_id [[threadgroup_position_in_grid]] as
gid_x = group_id.y * sx + tid3d.x;
gid_y = group_id.z * sy + tid3d.y;
gid_z = group_id.x * sz + tid3d.z
What surprises me is that this strategy really has a boost on performance (~10% speed up).
Can someone help me to explain the underlying mechanism behind the weird threadgroup dispatch strategy?

Related

GPU What are the proper thread dimensions for a compute shader with a very large work load?

I'm working on a heightmap erosion compute shader in unity, where each point on the map is eroded separately. This is working well for small maps, but the project I'm working on requires 4096x4096 maps. This means 4096^2 = 16777216 points to simulate. With the default thread dimensions of [64,1,1], this creates 262144 thread groups, way more than the allowed limit of 65535.
My question is:
Can I simply raise the thread dimensions, and what do I have to consider in terms of performance when I do?
Is it maybe possible to simply run the shader multiple times, with different ranges of heightmap coordinates?
This is my first time working with shaders. The tutorials I've seen online quickly go too in depth into gpu hardware specifications, so I didn't pick up much from that.
With 64x64 threads per work group, you can Dispatch 64x64 work groups to do what you need : remember that 64x64 threads will be invoked for each work group you dispatch, so you will have 64x64 work groups x 64x64 threads = 4096 workgroups x 4096 threads executed.
computeShader.Dispatch(computeShader.FindKernel("kernel"), 64, 64, 1);
[numthreads(64, 64, 1)]
void kernel(uint3 id : SV_DispatchThreadID)
{
// ...
// 0 <= id.x < 4096
// 0 <= id.y < 4096
}
As for the performance implication, the general answer is "try it out !" : run your kernel with different sizes for threads and work groups. The results may vary depending on your computations and on your hardware.
But, in case you need to bypass the 65535 limit, you can use DispatchIndirect. Basically, it's the same as Dispatch but the arguments are passed through a ComputeBuffer.
ComputeBuffer argsBuffer = new ComputeBuffer(3, sizeof(uint), ComputeBufferType.IndirectArguments);
uint[] args = { 64, 64, 1 }; // work groups
argsBuffer.SetData(args);
computeShader.DispatchIndirect(computeShader.FindKernel("kernel"), argsBuffer);
Ps : working on a GPU requires understanding its architecture because (1) you work at a low level, close to the hardware and many of the features you work with are actually hardware implemented (e.g. textures); (2) you want to make the best performances out of your programs (e.g. make best use of blocks and warps and cache ...) ;)

SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6
You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

How to create a loss-function for an unsupervised-learning model, where the ouput resembles the direct input for a game agent?

I'm trying to setup a deep neuronal network, which predicts the next move for a game agent to navigate a world. To control the game agent it takes two float inputs. The first one controls the speed (0.0 = stop/do not move, 1.0 = max. speed). The second controls the steering (-1.0 = turn left, 0.0 = straight, +1.0 = turn right).
I designed the network so the it has two output neurons one for the speed (it has a sigmoid activation applied) and on for the steering (has a tanh activation). The actual input I want to feed the network is the pixel data and some game state values.
To train the network I would simply run a whole game (about 2000frames/samples). When the game is over I want to train the model. Here is where I struggle, how would my loss-function look like? While playing I collect all actions/ouputs from the network, the game state and rewards per frame/sample. When the game is done I also got the information if the agent won or lost.
Edit:
This post http://karpathy.github.io/2016/05/31/rl/ got me inspired. Maybe I could use the discounted (move, turn) value-pairs, multiply them by (-1) if game agent lost and (+1) if it won. Now I can use these values as gradients to update the networks weights?
It would be nice if someone could help me out here.
All the best,
Tobs.
The problem you are talking is belong to reinforcement-learning, where agent interact with environment and collect data that is game state, its action and reward/score it got at end. Now there are many approaches.
The one you are talking is policy-gradient method, And loss function is as E[\sum r], where r is score, which has to be maximized. And its gradient will be A*grad(log(p_theta)), where A is advantage function i.e. +1/-1 for winning/losing. And p_theta is the probability of choosing action parameterized by theta(neural network). Now if it has win, the gradient will be update in favor of that policy because of +1 and vice-versa.
Note: There are many methods to design A, in this case +1/-1 is chosen.
More you can read here in more detail.

Is there any good reason to transpose a tensor from NHWC to NCHW?

I often see the transpose implementation in tensorflow code. I wonder why one would want to transpose the NHWC tensor to NCHW. Please give me the good example and the reason behind it.
Rather than citing the documentation. You should read into how CUDA works and think about how to implement most operations.
The reason for NCHW generally being faster than NHWC is how the CUDA kernels are written. In CUDA you need to specify what each thread is doing like
const int threads = 32;
dim3 block(threads, threads);
dim3 grid(up2(W / 2, threads), up2(H, threads), B);
kernel<Dtype> <<< grid, block>>> (args ...)
Here you get 3 indices threadId.z, threadId.y, threadId.x. And these threads are organized in warps (hardware design).
And you want to have coalesced memory transaction, which means the threads are ordered in such a way, that the GPU can nicely operate in a fast way.
To sum it up:
You want to have "threadId.x" being the most inner-loop and you should organize the data layout such that it reading them in coalesced way. The ideal data structure should accessible by
b * C * H * W + c * H * W + h * W + w
where lower letters denote the index and capitalized letters denotes the shape (e.g., 0 <= w < W).
In convolution operations (a part of the most used layer) what you are essentially doing is cropping a region in each channel computing a dot production with a region in another channel (from another tensor). So the indices which need to run crazy fast are the height-idx and width-idx. In the end, you are adding along the channel axis (like the convolution formulae suggest). This also explains, why it makes no difference to consider NWHC, NCWH.
This has an impact on how you order the data. And it is the reason you want to have the memory layout I described above.
The worst layout would be:
H, C, B, in threadId.z, threadId.y, threadId.x
The best layout would be:
B, C, H in threadId.z, threadId.y, threadId.x
The same is (mostly) true for GEMM as well (here one matrix should be transpose). There is no source for CuDNN available. But you might be interested in looking into cutlass.
From the performance guide of Tensorflow:
NHWC is the TensorFlow default and NCHW is the optimal format to use
when training on NVIDIA GPUs using cuDNN. [...] The brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. In the long term, we are working on tools to auto rewrite graphs to make switching between the formats transparent and take advantages of micro optimizations where a GPU Op may be faster using NHWC than the normally most efficient NCHW.
Essentially, cuDNN is optimized for NCHW, while CPU-only tensorflow is optimized for NHWC. Switching from one to the other is just a matter of performance maximization and/or unavailability of certain operations in a specific data format.

How could I read information off the graphics card after processing?

Supposing, for example, that I had a 10x10 "cloth" mesh, with each square being two triangles. Now, if I wanted to animate this, I could do the spring calculations on the CPU. Each vertex would have its own "spring" data and would, hopefully, bounce like whatever type of "cloth" it was supposed to represent.
However, that would involve a minimum of about 380? spring calculations per frame. Happily, the per-vertex calculations are "embarrassingly parallel" - Had I one CPU per vertex, each vertex could be run on a single CPU. GPUs, therefore, are theoretically an excellent choice for running such calculations on.
Except (and this is using DirectX/SlimDX) - I have no idea/am not sure how I would/should:
1) Send all this vertex data to the graphics card (yes, I know how to render stuff and have even written my own per-pixel and texture-blending global lighting effect file; however, it is necessary for each vertex to be able to access the position data of at least three other vertices). I suppose I could stick the relevant vertex positions and number of vertex positions in TextureCoords, but there may be a different, standard solution.
2) Read all the vertex data afterwards, so I can update the mesh in memory. Otherwise, each update will act on the exact same data to the exact same result, rather like adding 2 + 3 = 5, 2 + 3 = 5, 2 + 3 = 5 when you want is 2 + 3 = 5 - 2 = 3 + 1.5 = 4.5.
And it may be that I'm looking in the wrong direction to do this.
Thanks.
You could use the approach you have described to pack the data into textures and write special HLSL shaders to compute spring forces and then update vertex positions. That approach is totally valid but can be troublesome when you try to debug problems because you are using texture pixels in an unconventional way (you can draw the texture and maybe write some code to watch the values in a given pixel). It is probably easier in the long run to use something like CUDA, DirectCompute, or OpenCL. CUDA allows you to "bind" the DirectX vertex-buffer for access in CUDA. Then in a CUDA kernel you can use the positions to calculate forces and then write new positions to the vertex-buffer (in parallel on the GPU) before you render the updated positions.
There is a cloth demo that uses DirectCompute in the DirectX 10/11 DirectX SDK.