Can anyone shed any light on the output of intel_gpu_top? Specifically, what is task GAM, VS etc (The man page isn't much help.)
What does bitstream busy mean? It always seems to be zero...
render busy: 45%: █████████ render space: 83/131072
bitstream busy: 0%: bitstream space: 0/131072
blitter busy: 0%: blitter space: 0/131072
task percent busy
GAM: 43%: ████████▋ vert fetch: 0 (0/sec)
VS: 35%: ███████ prim fetch: 0 (0/sec)
CL: 33%: ██████▋ VS invocations: 2101845324 (1427552/sec)
SF: 33%: ██████▋ GS invocations: 0 (0/sec)
VF: 33%: ██████▋ GS prims: 0 (0/sec)
GAFS: 33%: ██████▋ CL invocations: 701123988 (475776/sec)
SOL: 32%: ██████▌ CL prims: 701708489 (475888/sec)
GS: 32%: ██████▌ PS invocations: 1254669239424 (116548992/sec)
DS: 32%: ██████▌ PS depth pass: 604287310764 (222384008/sec)
TDG: 2%: ▌
URBM: 2%: ▌
GAFM: 1%: ▎
HS: 0%:
SVG: 0%:
VFE: 0%:
I was curious as well, so here are just a few things I could grab from the reference manuals. Also of interest is the intel-gpu-tools source, and especially lib/instdone.c which describes what can appear in all Intel GPU models. This patch was also hugely helpful in translating all those acronyms!
Some may be wrong, I'd love it if somebody more knowledgeable could chime in! I'll come back to update the answer with more as I learn this stuff.
First, the three lines on the right :
The render space is probably used by regular 3D operations.
The bitstream section refers to the BSD (Bit-Stream Decoder), which handles hardware acceleration for media decoding. It does not appear on my GPU though (Skylake HD 530), so it might not be enabled/visible everywhere.
The blitter is described in vol. 11 and seems responsible for hardware acceleration of 2D operations (blitting).
Fixed function (FF) pipeline units (old-school GPU features) :
VF: Vertex Fetcher (vol. 1), the first FF unit in the 3D Pipeline responsible for fetching vertex data from memory.
VS: Vertex Shader (vol.1), computes things on the vertices of each primitive drawn by the GPU. Pretty standard operation on GPUs.
HS: Hull Shader
TE: Tessellation Engine
DS: Domain Shader
GS: Geometry Shader
SOL: Stream Output Logic
CL: Clip Unit
SF: Strips and Fans (vol.1), FF unit whose main function is to decompose primitive topologies such as strips and fans into primitives or objects.
Units used for thread and pipeline management, for both FF units and GPGPU (see Intel Open Source HD Graphics Programmers Manual for a lot of info on how this all works) :
CS: Command Streamer (vol.1), functional unit of the Graphics Processing Engine that fetches commands, parses them, and routes them to the appropriate pipeline.
TDG: Thread Dispatcher
VFE: Video Front-End
TSG: Thread Spawner
URBM: Unified Return Buffer Manager
Other stuff :
GAM: see GFX Page Walker (vol. 5), also called Memory Arbiter, has to do with how the GPU keeps track of its memory pages, seems quite similar to what the TLB (see also SLAT) does for your RAM.
SDE: South Display Engine ; according to vol. 12, "the South Display Engine supports Hot Plug Detection, GPIO, GMBUS, Panel Power Sequencing, and Backlight Modulation".
Related
I'm working on a heightmap erosion compute shader in unity, where each point on the map is eroded separately. This is working well for small maps, but the project I'm working on requires 4096x4096 maps. This means 4096^2 = 16777216 points to simulate. With the default thread dimensions of [64,1,1], this creates 262144 thread groups, way more than the allowed limit of 65535.
My question is:
Can I simply raise the thread dimensions, and what do I have to consider in terms of performance when I do?
Is it maybe possible to simply run the shader multiple times, with different ranges of heightmap coordinates?
This is my first time working with shaders. The tutorials I've seen online quickly go too in depth into gpu hardware specifications, so I didn't pick up much from that.
With 64x64 threads per work group, you can Dispatch 64x64 work groups to do what you need : remember that 64x64 threads will be invoked for each work group you dispatch, so you will have 64x64 work groups x 64x64 threads = 4096 workgroups x 4096 threads executed.
computeShader.Dispatch(computeShader.FindKernel("kernel"), 64, 64, 1);
[numthreads(64, 64, 1)]
void kernel(uint3 id : SV_DispatchThreadID)
{
// ...
// 0 <= id.x < 4096
// 0 <= id.y < 4096
}
As for the performance implication, the general answer is "try it out !" : run your kernel with different sizes for threads and work groups. The results may vary depending on your computations and on your hardware.
But, in case you need to bypass the 65535 limit, you can use DispatchIndirect. Basically, it's the same as Dispatch but the arguments are passed through a ComputeBuffer.
ComputeBuffer argsBuffer = new ComputeBuffer(3, sizeof(uint), ComputeBufferType.IndirectArguments);
uint[] args = { 64, 64, 1 }; // work groups
argsBuffer.SetData(args);
computeShader.DispatchIndirect(computeShader.FindKernel("kernel"), argsBuffer);
Ps : working on a GPU requires understanding its architecture because (1) you work at a low level, close to the hardware and many of the features you work with are actually hardware implemented (e.g. textures); (2) you want to make the best performances out of your programs (e.g. make best use of blocks and warps and cache ...) ;)
I'm doing an experiment to see which ARM SVE vector length would be the best for my chip design, or to help select which chip has the optimal vector length for my application.
How to change the vector length in a gem5 simulation to see how it affects workload performance?
For SE:
se.py --param 'system.cpu[:].isa[:].sve_vl_se = 2'
For FS:
fs.py --param 'system.sve_vl = 2'
where the values are given in multiples of 128 bits, so 2 means length 256.
You can test this easily with the ADDVL instruction as shown in this example.
The name of those parameters can be easily determined by looking at a m5out/config.ini generated from a previous run.
Note however that this value is architecturally visible, and so it might not be possible to checkpoint after Linux boot, and restore with a different vector length than the boot, to speed up experiments. This is likely true in general even though the kernel itself does not run vector instructions, because there is software control of the effective vector length. Maybe it is possible to set a big vector length on the simulator to start with and then tell Linux to reduce it somehow in software, but I'm not sure what's the API.
Tested in gem5 3126e84db773f64e46b1d02a9a27892bf6612d30.
To change the vector length, one can use command line option:
--arm-sve-vl=<vl in quadwords: one of {1, 2, 4, 8, 16}>
where vl is a multiple of 128. So for a simulation of 512-bit SVE machine, one should use:
--arm-sve-vl=4
This works both for Syscall-Emulation mode and Full System mode.
If one wants to quickly explore the space of different vector lengths, one can also change it during the simulation (only in Full system mode). For example, to change the SVE length to 256, put the following line in your bootscript, before running the benchmark:
echo 256 >/proc/sys/abi/sve_default_vector_length
You can get more information on https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf.
I'm using RTL-SDR generic dongle for receiving frames of Z-Wave protocol. I use real Z-Wave devices. I'm using scapy-radio and I've also downloaded EZ-Wave. However, none of them implements blocks for all Z-Wave data rates, modulations and codings. I've received some frames using original solution of EZ-Wave, however I assume I can't receive frames at all data rates, codings and modulations. Now I'm trying to implement solution according to their blocks to implement all of them.
Z-Wave procotol uses these modulations, data rates and coding:
9.6 kbps - FSK - Manchester
40 kbps - FSK - NRZ
100 kbps - GFSK - NRZ
These are my actual blocks (not able receving anything at all right now):
For example, I will explain my view on blocks for receiving at
9.6 kbps - FSK - Manchester
RTL-SDR Source
variable center_freq = 869500000
variable r1_freq_offset = 800e3
Ch0: Frequency: center_freq_3-r1_freq_offset, so I've got 868.7 Mhz on RTL-SDR Source block.
Frequency Xlating FIR Filter
Center frequency = - 800Khz to get frequency 868.95 Mhz (Europe). To be honest, I'm not sure why I do this and I need an explanation. I'm trying to implement those blocks according to EZ-Wave implementation of blocks for 40 kbps-FSK-NRZ (as I assume). They use sample rate 2M and different configurations, which I did not understand.
Taps = firdes.low_pass(1,samp_rate_1,samp_rate_1/2,5e3,firdes.WIN_HAMMING). I don't understand, what should be transition bw (5e3 in my case)
Sample rate = 19.2e3, because data rate/baud is 9.6 Kbps and according to Nyquist–Shannon sampling theorem, sampling rate should be at least double to data rate, so 2*9.6=19.2. So I'm trying to resample default 2M from source to 19.2 Kbps.
Simple squelch
I use default value (-40) and I'm not sure, if I should change this or not.
Quadrature Demod
should do the FSK demodulation and I use default value of gain. I'm not sure if this is a right way to do FSK demodulation.
Gain = 2(samp_rate_1)/(2*math.pi*20e3/8.0)*
Low Pass Filter
Sample rate = 19.2k to use the same new sample rate
Cuttoff Freq = 9.6k, I assume this according to https://nccgroup.github.io/RFTM/fsk_receiver.html
Transition width = 4.8 which is also sample_rate/2
Clock Recovery MM
Most of the parameters are default.
Omega = 2, because samp_rate/baud
Binary Slicer
is for getting binary code of signal
Zwave PacketSink 9.6
should the the Manchester decoding.
I would like to ask, what should I change on my blocks to achieve proper receiving of Z-Wave frames at all data rates, modulation and coding. When I start receiving, I'm able to see messages from my devices at FFT sink and Waterfall sink. The message debug doesn't print packets (like from original EZ-Wave solution) but only
Looking for sync : 575555aa
Looking for sync : 565555aa
Looking for sync : aa5555aa
what should be value in frame_shift_register, according to C code for Manchester decoding (ZWave PacketSink 9.6). I've seen similar post, however this is a bit different and to be honest, I'm stuck here.
I will be grateful for any help.
Let's look at the GFSK case. First of all, the sampling rate of the RTL source, 2M Baud is pretty high. For the maximum data rate, 100 kbps - GFSK, a sample rate of say 400 ~ 500kbaud will do just fine. There is also the power squelch block. This block prevents signals below a certain threshold to pass. This is not good because it filters low power signals that may contain information. There is also the sample rate issue between the lowpass filter and the MM clock recovery block. The output of the symbol recovery block should be 100kbaud (because for GFSK, sample rate = symbol rate). Using the omega value of 2 and working backward, the input to the MM block should be 200kbaud. But, the lowpass filter produces samples at 2Mbaud, 10 times than expected. You have to do proper decimation.
I implemented a GFSK receiver once for our CubeSat. Timing recovery was done by the PFB block, which is more reliable than the MM one. You can find the paper here:
https://www.researchgate.net/publication/309149646_Software-defined_radio_transceiver_for_QB50_CubeSat_telemetry_and_telecommand?_sg=HvZBpQBp8nIFh6mIqm4yksaAwTpx1V6QvJY0EfvyPMIz_IEXuLv2pODOnMToUAXMYDmInec76zviSg.ukZBHrLrmEbJlO6nZbF4X0eyhFjxFqVW2Q50cSbr0OHLt5vRUCTpaHi9CR7UBNMkwc_KJc1PO_TiGkdigaSXZA&_sgd%5Bnc%5D=1&_sgd%5Bncwor%5D=0
Some more details on the receiver could also be found here:
GFSK modulation/demodulation with GNU Radio and USRP
M.
I appreciate your answer, I've changed my sample rates. Now I'm still working on 9.6Kbps, FSK demodulation and Manchester decoding. Currently, output from my M&M clock recovery looks like this:
I would like to ask you what do think about this signal. As I said, it should be FSK demodulation and then I should use Manchester decoding. Do I still need usage of PCB block? Primary, I have to do 9.6kbps, FSK and Manchester, so I will look at 100Kbps GFSK NRZ if there will be some time left.
Sample rate is 1M because of RTL-SDR dongle limitations (225001 to 300000 and 900001 to 3200000).
Current blocks:
I don't understand :
Taps of Frequency Xlating FIR Filter firdes.low_pass(1,samp_rate_1,40e3,20e3,firdes.WIN_HAMMING)
Cuttoff Freq and Transition Width of Low Pass filter
Clock Recovery M&M aswell, so consider its values "random".
ClockRecovery Output:
I was trying to use PCB block according to your work at ResearchGate. However, I was unsuccessful because I still don't understand all that science behind the clock recovery.
Doing Low-pass filtering twice is because original Z-Wave blocks from scapy-radio for 40Kbps, FSK and NRZ coding are made like this (and it works):
So I thought I will be just about changing few parameters and decoder (Zwave PacketSink9.6).
I also uploaded my current blocks here.
Moses Browne Mwakyanjala, I'm also trying to implement that thing according to your work.
Maybe there is a problem with a clock recovery and Manchester decoding. Manchester decoding use transitions 0->1 and 1->0 to encode 0s and 1s. How can I properly configure clock recovery to achieve correct sample rate and transitions for Manchester decoding? Manchester decoder (Z-Wave PacketSink 9.6) is able to find the preamble and ends only with looking for sync.
I would like to also ask you, where can I find my modulation index "h" mentioned in your work?
Thank you
Edit: Proposed solutions results are added at the end of the question.
I'm starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, which have an "int neighbors 6 " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
}
}
elements[id] = elem;
}
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
}
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem?
I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Edit: Third and fifth optimizations proposed by Huseyin Tugrul were already in the code. As mentioned here, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.
Edit 1: Applying the seventh optimization proposed by Huseyin Tugrul performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.
Edit 2: Applying the first optimization proposed by Huseyin Tugrul performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.
Edit 3: Applying the 8th and 9th optimizations proposed by Huseyin Tugrul didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.
Edit 4: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned here, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.
Edit 5: Applying the second optimization proposed by Huseyin Tugrul, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors & 1 != 0, top neighbor is present, if neighbors & 2 != 0, bot neighbor is present, if neighbors & 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.
Edit 6: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimize the global memory access and local memory allocation. Performance increased by a factor of x1.25, which is very good. Worth doing this, despite the programmatic complexity and unreadability.
According to your step1 and step2, you are not making your gpu core work hard. What is your kernel's complexity? What is your gpu usage? Did you check with monitoring programs like afterburner? Mid-range desktop gaming cards can get 10k threads each doing 10k iterations.
Since you are working with only neighbours, data size/calculation size is too big and your kernels may be bottlenecked by vram bandiwdth. Your main system ram could be as fast as your pci-e bandwidth and this could be the issue.
1) Use of Dedicated Cache could be getting you thread's actual grid cell into private registers that is fastest. Then neighbours into __local array so the comparisons/calc only done in chip.
Load current cell into __private
Load neighbours into __local
start looping for local array
get next neighbour into __private from __local
compute
end loop
(if it has many neighbours, lines after "Load neighbours into __local" can be in another loop that gets from main memory by patches)
What is your gpu? Nice it is GTX660. You should have 64kB controllable cache per compute unit. CPUs have only registers of 1kB and not addressable for array operations.
2) Shorter Indexing could be using a single byte as index of neighbour stored instead of int. Saving precious L1 cache space from "id" fetches is important so that other threads can hit L1 cache more!
Example:
0=neighbour from left
1=neighbour from right
2=neighbour from up
3=neighbour from down
4=neighbour from front
5=neighbour from back
6=neighbour from upper left
...
...
so you can just derive neighbour index from a single byte instead of 4-byte int which decreases main memory accessing for at least neighbour accessing. Your kernel will derive neighbour index from upper table using its compute power, not memory power because you would make this from core registers(__privates). If your total grid size is constant, this is very easy such as just adding 1 actual cell id, adding 256 to id or adding 256*256 to id or so.
3) Optimum Object Size could be making your struct/cell-object size a multiple of 4 bytes. If your total object size is around 200-bytes, you can pad it or augment it with some empty bytes to make exactly 200 bytes, 220Bytes or 256 bytes.
4) Branchless Code (Edit: depends!) using less if-statements. Using if-statement makes computation much slower. Rather than checking for -1 as end of neightbour index , you can use another way . Becuase lightweight core are not as capable of heavyweight. You can use surface-buffer-cells to wrap the surface so computed-cells will have always have 6-neighbours so you get rid of if (elem.neighbors[i] != -1) . Worth a try especially for GPU.
Just computing all neighbours are faster rather than doing if-statement. Just multiply the result change with zero when it is not a valid neighbour. How can we know that it is not a valid neighbour? By using a byte array of 6-elements per cell(parallel to neighbour id array)(invalid=0, valid=1 -->multiply the result with this)
The if-statement is inside a loop which counting for six times. Loop unrolling gives similar speed-up if the workload in the loop is relatively easy.
But, if all threads within same warp goes into same if-or-else branch, they don't lose performance. So this depends wheter your code diverges or not.
5) Data Elements Reordering you can move the int[8] element to uppermost side of struct so memory accessing may become more yielding so smaller sized elements to lower side can be read in a single read-operation.
6) Size of Workgroup trying different local workgroup size can give 2-3x performance. Starting from 16 until 512 gives different results. For example, AMD GPUs like integer multiple of 64 while NVIDIA GPUs like integer multiple of 32. INTEL does fine at 8 to anything since it can meld multiple compute units together to work on same workgroup.
7) Separation of Variables(only if you cant get rid of if-statements) Separation of comparison elements from struct. This way you dont need to load a whole struct from main memory just to compare an int or a boolean. When comparison needs, then loads the struct from main memory(if you have local mem optimization already, then you should put this operation before it so loading into local mem is only done for selected neighbours)
This optimisation makes best case(no neighbour or only one eighbour) considerably faster. Does not affect worst case(maximum neighbours case).
8a) Magic Using shifting instead of dividing by power of 2. Doing similar for modulo. Putting "f" at the end of floating literals(1.0f instead of 1.0) to avoid automatic conversion from double to float.
8b) Magic-2 -cl-mad-enable Compiler option can increase multiply+add operation speed.
9) Latency Hiding Execution configuration optimization. You need to hide memory access latency and take care of occupancy.
Get maximum cycles of latency for instructions and global memory access.
Then divide memory latency by instruction latency.
Now you have the ratio of: arithmetic instruction number per memory access to hide latency.
If you have to use N instructions to hide mem latency and you have only M instructions in your code, then you will need N/M warps(wavefronts?) to hide latency because a thread in gpu can do arithmetics while other thread getting things from mem.
10) Mixed Type Computing After memory access is optimized, swap or move some instructions where applicable to get better occupancy, use half-type to help floating point operations where precision is not important.
11) Latency Hiding again Try your kernel code with only arithmetics(comment out all mem accesses and initiate them with 0 or sometihng you like) then try your kernel code with only memory access instructions(comment out calculations/ ifs)
Compare kernel times with original kernel time. Which is affeecting the originatl time more? Concentrate on that..
12) Lane & Bank Conflicts Correct any LDS-lane conflicts and global memory bank conflicts because same address accessings can be done in a serialed way slowing process(newer cards have broadcast ability to reduce this)
13) Using registers Try to replace any independent locals with privates since your GPU can give nearly 10TB/s throughput using registers.
14) Not Using Registers Dont use too many registers or they will spill to global memory and slow the process.
15) Minimalistic Approach for Occupation Look at local/private usage to get an idea of occupation. If you use much more local and privates then less threads can be utilized in same compute unit and leading lesser occupation. Less resource usage leads higher chance of occupation(if you have enough total threads)
16) Gather Scatter When neighbours are different particles(like an nbody NNS) from random addresses of memory, its maybe hard to apply but, gather read optimization can give 2x-3x speed on top of before optimizations (needs local memory optimization to work) so it reads in an order from memory instead of randomly and reorders as needed in the local memory to share between (scatter) to threads.
17) Divide and Conquer Just in case when buffer is too big and copied between host and device so makes gpu wait idle, then divide it in two, send them separately, start computing as soon as one arrives, send results back concurrently in the end. Even a process-level parallelism could push a gpu to its limits this way. Also L2 cache of GPU may not be enough for whole of data. Cache-tiled computing but implicitly done instead of direct usage of local memory.
18) Bandwidth from memory qualifiers. When kernel needs some extra 'read' bandwidth, you can use '__constant'(instead of __global) keyword on some parameters which are less in size and only for reading. If those parameters are too large then you can still have good streaming from '__read_only' qualifier(after the '__global' qualifier). Similary '__write_only' increases throughput but these give mostly hardware-specific performance. If it is Amd's HD5000 series, constant is good. Maybe GTX660 is faster with its cache so __read_only may become more usable(or Nvidia using cache for __constant?).
Have three parts of same buffer with one as __global __read_only, one as __constant and one as just __global (if building them doesn't penalty more than reads' benefits).
Just tested my card using AMD APP SDK examples, LDS bandwidth shows 2TB/s while constant is 5TB/s(same indexing instead of linear/random) and main memory is 120 GB/s.
Also don't forget to add restrict to kernel parameters where possible. This lets compiler do more optimizations on them(if you are not aliasing them).
19) Modern hardware transcendental functions are faster than old bit hack (like Quake-3 fast inverse square root) versions
20) Now there is Opencl 2.0 which enables spawning kernels inside kernels so you can further increase resolution in a 2d grid point and offload it to workgroup when needed (something like increasing vorticity detail on edges of a fluid dynamically)
A profiler can help for all those, but any FPS indicator can do if only single optimization is done per step.
Even if benchmarking is not for architecture-dependent code paths, you could try having a multiple of 192 number of dots per row in your compute space since your gpu has multiple of that number of cores and benchmark that if it makes gpu more occupied and have more gigafloatingpoint operations per second.
There must be still some room for optimization after all these options, but idk if it damages your card or feasible for production time of your projects. For example:
21) Lookup tables When there is 10% more memory bandwidth headroom but no compute power headroom, offload 10% of those workitems to a LUT version such that it gets precomputed values from a table. I didn't try but something like this should work:
8 compute groups
2 LUT groups
8 compute groups
2 LUT groups
so they are evenly distributed into "threads in-flight" and get advantage of latency hiding stuff. I'm not sure if this is a preferable way of doing science.
21) Z-order pattern For traveling neighbors increases cache hit rate. Cache hit rate saves some global memory bandwidth for other jobs so that overall performance increases. But this depends on size of cache, data layout and some other things I don't remember.
22) Asynchronous Neighbor Traversal
iteration-1: Load neighbor 2 + compute neighbor 1 + store neighbor 0
iteration-2: Load neighbor 3 + compute neighbor 2 + store neighbor 1
iteration-3: Load neighbor 4 + compute neighbor 3 + store neighbor 2
so each body of loop doesn't have any chain of dependency and fully pipelined on GPU processing elements and OpenCL has special instructions for asynchronously loading/storing global variables using all cores of a workgroup. Check this:
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
Maybe you can even divide computing part into two and have one part use transcandental functions and other part use add/multiply so that add/multiply operations don't wait for a slow sqrt. If there are at least several neighbors to traveerse, this should hide some latency behind other iterations.
Supposing, for example, that I had a 10x10 "cloth" mesh, with each square being two triangles. Now, if I wanted to animate this, I could do the spring calculations on the CPU. Each vertex would have its own "spring" data and would, hopefully, bounce like whatever type of "cloth" it was supposed to represent.
However, that would involve a minimum of about 380? spring calculations per frame. Happily, the per-vertex calculations are "embarrassingly parallel" - Had I one CPU per vertex, each vertex could be run on a single CPU. GPUs, therefore, are theoretically an excellent choice for running such calculations on.
Except (and this is using DirectX/SlimDX) - I have no idea/am not sure how I would/should:
1) Send all this vertex data to the graphics card (yes, I know how to render stuff and have even written my own per-pixel and texture-blending global lighting effect file; however, it is necessary for each vertex to be able to access the position data of at least three other vertices). I suppose I could stick the relevant vertex positions and number of vertex positions in TextureCoords, but there may be a different, standard solution.
2) Read all the vertex data afterwards, so I can update the mesh in memory. Otherwise, each update will act on the exact same data to the exact same result, rather like adding 2 + 3 = 5, 2 + 3 = 5, 2 + 3 = 5 when you want is 2 + 3 = 5 - 2 = 3 + 1.5 = 4.5.
And it may be that I'm looking in the wrong direction to do this.
Thanks.
You could use the approach you have described to pack the data into textures and write special HLSL shaders to compute spring forces and then update vertex positions. That approach is totally valid but can be troublesome when you try to debug problems because you are using texture pixels in an unconventional way (you can draw the texture and maybe write some code to watch the values in a given pixel). It is probably easier in the long run to use something like CUDA, DirectCompute, or OpenCL. CUDA allows you to "bind" the DirectX vertex-buffer for access in CUDA. Then in a CUDA kernel you can use the positions to calculate forces and then write new positions to the vertex-buffer (in parallel on the GPU) before you render the updated positions.
There is a cloth demo that uses DirectCompute in the DirectX 10/11 DirectX SDK.