Long latency instruction - optimization

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
Currently I'm using fsqrt, but I'm wondering is there is something better.
Ideally, the instruction will score well on the following criteria:
Long latency
Stable/fixed latency
One or a few uops (especially: not microcoded)
Consumes as few uarch resources as possible (load/store buffers, page walkers, etc)
Able to chain (latency-wise) with itself
Able to chain input and out with GP registers
Doesn't interfere with normal OoO execution (beyond whatever ROB, RS, etc, resources it consumes)
So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.

Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).

vsqrtss might be somewhat better than fsqrt since it at least satisfies relatively easy chaining with GP registers (since GP <-> vector is just a movd away).

Related

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

Does multiply by 1.0 take less time than usual multiplications

Does x86(64 too) processor optimise away the multiplication if one of the operands of multiplication happens to be 1.0?
PS:I do not mean compiler optimising a constant multiplication of 1.0.
That's not something I've seen mentioned in docs about instruction latencies or microarchitectures of Intel or AMD CPUs.
I suspect it doesn't happen, because variable latency would interfere with pipelined execution units. (multiple results coming out of the same execution unit in the same clock cycle = extra complexity). Also, there are probably other bits of logic (uop scheduling / queueing, result forwarding networks) that are designed around every uop having known latency. (except for special cases like division / sqrt).
IIRC, one analyst, maybe Agner Fog or David Kanter, suggested that some uops might have been possible to implement with 2 cycle latency, but instead take 3 cycles to match the other uops that their execution port can handle. So constant latency for operations appears to be a big deal for Intel CPU designs, to the extent that it was worth making an operation slower.
Note that we're only talking about latency here. If your multiply isn't part of a loop-carried dependency chain, or you have enough independent multiplies, you can keep the multiplier(s) going with one operation per clock.
Haswell CPUs can sustain a throughput of 2 FP vector multiplies per clock. (256b vectors of 4 doubles or 8 floats). Latency = 5 clock cycles for the result to be ready, regardless of input. Or 1 vector integer multiply per clock. (The vector multiply ALU is on port 0. The vector FP multipliers are on port 0 and port 1).
Avoid multiplying when you can, it leads to long dependency chains. (Usually this comes up for integer multiplies to calculate loop indices. Compilers do a lot better when you write your loop to increment the counter by 16, instead of multiplying i++ by 16 as an array index.)

SCrypt Lookup Gap Negative Effect

I'm developing a Litecoin Miner for a processor that has only 32KB of internal memory. So I was looking at SCrypt algorithms and for Litecoin it uses N = 1024, that gives me 2^10 * 1 * 128 = 128KB memory use aproximate.
So I was looking into GPU Algorithms that has the parameter Lookup Gap. For reading I'm using kepler code from CudaMiner:
https://github.com/cbuchner1/CudaMiner/blob/master/kepler_kernel.cu (Line 535)
So I understand that lookup gap is a tradeoff between CPU and Memory. So higher is it, higher is my CPU use and lower my memory. What I didnt understand is how it works exactly.
In the code I have
int pos = c_N_1/LOOKUP_GAP, loop = 1 + (c_N_1-pos*LOOKUP_GAP);
That will make it look the scratchpad every LOOKUP_GAP byte (if its 2, it will be 0,2,4,6,8,10), but where is the more CPU Use of the algorithm?
My implementation will not be highly optimized, is something like try to run.
I also saw a FPGA Implementation that uses Interpolation ( https://github.com/kramble/FPGA-Litecoin-Miner ) this is more strange to me. I dunno how they could do interpolation of the values in scratchpad.
Thanks!
The increased CPU usage comes if you do not hit a pre-calculated entry. With LOOKUP 2 you are calculating 0-1023, but only storing 0, 2, 4, etc... So if you need the data for scratch-pad entry 3 you have to calculate it on the fly using the data from 2. This is an extra calculation vs. having them all stored permanently. As the lookup gap increases the amount of on the fly calculations you will do will increase.

How to optimize OpenCL code for neighbors accessing?

Edit: Proposed solutions results are added at the end of the question.
I'm starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, which have an "int neighbors 6 " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
}
}
elements[id] = elem;
}
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
}
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem?
I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Edit: Third and fifth optimizations proposed by Huseyin Tugrul were already in the code. As mentioned here, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.
Edit 1: Applying the seventh optimization proposed by Huseyin Tugrul performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.
Edit 2: Applying the first optimization proposed by Huseyin Tugrul performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.
Edit 3: Applying the 8th and 9th optimizations proposed by Huseyin Tugrul didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.
Edit 4: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned here, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.
Edit 5: Applying the second optimization proposed by Huseyin Tugrul, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors & 1 != 0, top neighbor is present, if neighbors & 2 != 0, bot neighbor is present, if neighbors & 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.
Edit 6: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimize the global memory access and local memory allocation. Performance increased by a factor of x1.25, which is very good. Worth doing this, despite the programmatic complexity and unreadability.
According to your step1 and step2, you are not making your gpu core work hard. What is your kernel's complexity? What is your gpu usage? Did you check with monitoring programs like afterburner? Mid-range desktop gaming cards can get 10k threads each doing 10k iterations.
Since you are working with only neighbours, data size/calculation size is too big and your kernels may be bottlenecked by vram bandiwdth. Your main system ram could be as fast as your pci-e bandwidth and this could be the issue.
1) Use of Dedicated Cache could be getting you thread's actual grid cell into private registers that is fastest. Then neighbours into __local array so the comparisons/calc only done in chip.
Load current cell into __private
Load neighbours into __local
start looping for local array
get next neighbour into __private from __local
compute
end loop
(if it has many neighbours, lines after "Load neighbours into __local" can be in another loop that gets from main memory by patches)
What is your gpu? Nice it is GTX660. You should have 64kB controllable cache per compute unit. CPUs have only registers of 1kB and not addressable for array operations.
2) Shorter Indexing could be using a single byte as index of neighbour stored instead of int. Saving precious L1 cache space from "id" fetches is important so that other threads can hit L1 cache more!
Example:
0=neighbour from left
1=neighbour from right
2=neighbour from up
3=neighbour from down
4=neighbour from front
5=neighbour from back
6=neighbour from upper left
...
...
so you can just derive neighbour index from a single byte instead of 4-byte int which decreases main memory accessing for at least neighbour accessing. Your kernel will derive neighbour index from upper table using its compute power, not memory power because you would make this from core registers(__privates). If your total grid size is constant, this is very easy such as just adding 1 actual cell id, adding 256 to id or adding 256*256 to id or so.
3) Optimum Object Size could be making your struct/cell-object size a multiple of 4 bytes. If your total object size is around 200-bytes, you can pad it or augment it with some empty bytes to make exactly 200 bytes, 220Bytes or 256 bytes.
4) Branchless Code (Edit: depends!) using less if-statements. Using if-statement makes computation much slower. Rather than checking for -1 as end of neightbour index , you can use another way . Becuase lightweight core are not as capable of heavyweight. You can use surface-buffer-cells to wrap the surface so computed-cells will have always have 6-neighbours so you get rid of if (elem.neighbors[i] != -1) . Worth a try especially for GPU.
Just computing all neighbours are faster rather than doing if-statement. Just multiply the result change with zero when it is not a valid neighbour. How can we know that it is not a valid neighbour? By using a byte array of 6-elements per cell(parallel to neighbour id array)(invalid=0, valid=1 -->multiply the result with this)
The if-statement is inside a loop which counting for six times. Loop unrolling gives similar speed-up if the workload in the loop is relatively easy.
But, if all threads within same warp goes into same if-or-else branch, they don't lose performance. So this depends wheter your code diverges or not.
5) Data Elements Reordering you can move the int[8] element to uppermost side of struct so memory accessing may become more yielding so smaller sized elements to lower side can be read in a single read-operation.
6) Size of Workgroup trying different local workgroup size can give 2-3x performance. Starting from 16 until 512 gives different results. For example, AMD GPUs like integer multiple of 64 while NVIDIA GPUs like integer multiple of 32. INTEL does fine at 8 to anything since it can meld multiple compute units together to work on same workgroup.
7) Separation of Variables(only if you cant get rid of if-statements) Separation of comparison elements from struct. This way you dont need to load a whole struct from main memory just to compare an int or a boolean. When comparison needs, then loads the struct from main memory(if you have local mem optimization already, then you should put this operation before it so loading into local mem is only done for selected neighbours)
This optimisation makes best case(no neighbour or only one eighbour) considerably faster. Does not affect worst case(maximum neighbours case).
8a) Magic Using shifting instead of dividing by power of 2. Doing similar for modulo. Putting "f" at the end of floating literals(1.0f instead of 1.0) to avoid automatic conversion from double to float.
8b) Magic-2 -cl-mad-enable Compiler option can increase multiply+add operation speed.
9) Latency Hiding Execution configuration optimization. You need to hide memory access latency and take care of occupancy.
Get maximum cycles of latency for instructions and global memory access.
Then divide memory latency by instruction latency.
Now you have the ratio of: arithmetic instruction number per memory access to hide latency.
If you have to use N instructions to hide mem latency and you have only M instructions in your code, then you will need N/M warps(wavefronts?) to hide latency because a thread in gpu can do arithmetics while other thread getting things from mem.
10) Mixed Type Computing After memory access is optimized, swap or move some instructions where applicable to get better occupancy, use half-type to help floating point operations where precision is not important.
11) Latency Hiding again Try your kernel code with only arithmetics(comment out all mem accesses and initiate them with 0 or sometihng you like) then try your kernel code with only memory access instructions(comment out calculations/ ifs)
Compare kernel times with original kernel time. Which is affeecting the originatl time more? Concentrate on that..
12) Lane & Bank Conflicts Correct any LDS-lane conflicts and global memory bank conflicts because same address accessings can be done in a serialed way slowing process(newer cards have broadcast ability to reduce this)
13) Using registers Try to replace any independent locals with privates since your GPU can give nearly 10TB/s throughput using registers.
14) Not Using Registers Dont use too many registers or they will spill to global memory and slow the process.
15) Minimalistic Approach for Occupation Look at local/private usage to get an idea of occupation. If you use much more local and privates then less threads can be utilized in same compute unit and leading lesser occupation. Less resource usage leads higher chance of occupation(if you have enough total threads)
16) Gather Scatter When neighbours are different particles(like an nbody NNS) from random addresses of memory, its maybe hard to apply but, gather read optimization can give 2x-3x speed on top of before optimizations (needs local memory optimization to work) so it reads in an order from memory instead of randomly and reorders as needed in the local memory to share between (scatter) to threads.
17) Divide and Conquer Just in case when buffer is too big and copied between host and device so makes gpu wait idle, then divide it in two, send them separately, start computing as soon as one arrives, send results back concurrently in the end. Even a process-level parallelism could push a gpu to its limits this way. Also L2 cache of GPU may not be enough for whole of data. Cache-tiled computing but implicitly done instead of direct usage of local memory.
18) Bandwidth from memory qualifiers. When kernel needs some extra 'read' bandwidth, you can use '__constant'(instead of __global) keyword on some parameters which are less in size and only for reading. If those parameters are too large then you can still have good streaming from '__read_only' qualifier(after the '__global' qualifier). Similary '__write_only' increases throughput but these give mostly hardware-specific performance. If it is Amd's HD5000 series, constant is good. Maybe GTX660 is faster with its cache so __read_only may become more usable(or Nvidia using cache for __constant?).
Have three parts of same buffer with one as __global __read_only, one as __constant and one as just __global (if building them doesn't penalty more than reads' benefits).
Just tested my card using AMD APP SDK examples, LDS bandwidth shows 2TB/s while constant is 5TB/s(same indexing instead of linear/random) and main memory is 120 GB/s.
Also don't forget to add restrict to kernel parameters where possible. This lets compiler do more optimizations on them(if you are not aliasing them).
19) Modern hardware transcendental functions are faster than old bit hack (like Quake-3 fast inverse square root) versions
20) Now there is Opencl 2.0 which enables spawning kernels inside kernels so you can further increase resolution in a 2d grid point and offload it to workgroup when needed (something like increasing vorticity detail on edges of a fluid dynamically)
A profiler can help for all those, but any FPS indicator can do if only single optimization is done per step.
Even if benchmarking is not for architecture-dependent code paths, you could try having a multiple of 192 number of dots per row in your compute space since your gpu has multiple of that number of cores and benchmark that if it makes gpu more occupied and have more gigafloatingpoint operations per second.
There must be still some room for optimization after all these options, but idk if it damages your card or feasible for production time of your projects. For example:
21) Lookup tables When there is 10% more memory bandwidth headroom but no compute power headroom, offload 10% of those workitems to a LUT version such that it gets precomputed values from a table. I didn't try but something like this should work:
8 compute groups
2 LUT groups
8 compute groups
2 LUT groups
so they are evenly distributed into "threads in-flight" and get advantage of latency hiding stuff. I'm not sure if this is a preferable way of doing science.
21) Z-order pattern For traveling neighbors increases cache hit rate. Cache hit rate saves some global memory bandwidth for other jobs so that overall performance increases. But this depends on size of cache, data layout and some other things I don't remember.
22) Asynchronous Neighbor Traversal
iteration-1: Load neighbor 2 + compute neighbor 1 + store neighbor 0
iteration-2: Load neighbor 3 + compute neighbor 2 + store neighbor 1
iteration-3: Load neighbor 4 + compute neighbor 3 + store neighbor 2
so each body of loop doesn't have any chain of dependency and fully pipelined on GPU processing elements and OpenCL has special instructions for asynchronously loading/storing global variables using all cores of a workgroup. Check this:
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
Maybe you can even divide computing part into two and have one part use transcandental functions and other part use add/multiply so that add/multiply operations don't wait for a slow sqrt. If there are at least several neighbors to traveerse, this should hide some latency behind other iterations.

Calculating the maximum physical rate (Nyquist performance limitation) of an ADC onboard a microcontroller

I'm trying to evaluate the maximum physical rate (Nyquist performance limit) of the A/Ds integrated on board various PIC microcontrollers.
However, to do the calculation requires parameters that I'm not finding explicitly stated in the datasheets, specifically Tacq, Fosc, TAD, and divisor parameters.
I've proceeded by making some assumptions but would be helpful to have a sanity check -- am I doing the maximum physical rate calculations correctly?
For illustration purposes only, I've taken the simplest possible PIC10F220 that has an ADC. This is to focus specifically on the interpretation of Tacq, Fosc, TAD, and divisor parameters, and not to suggest that any practical functionality could be implemented on this very basic chip. (This is to Clifford's points in the comments below.)
Calculation:
Nyquist Performance Analysis of PIC10F220
- Runs at clock speed of 8MHz.
- Has an instruction cycle of 0.5us [4 clock steps per instruction]
So:
- Get Tacq = 6.06 us [acquisition time for ADC, assuming chip temp. = 50*C]
[from datasheet p34]
- Set Fosc := 8MHz [? should this be internal clock speed ?]
- Set divisor := 4 [? assuming this is 4 from 4 clock steps per CPU instruction ?]
- This gives TAD = 0.5us [TAD = 1/(Fosc/divisor) ]
- Get conversion time is 13*TAD [from datasheet p31]
- This gives conversion time 6.5 us
- So ADC duration is 12.56 us [? Tacq + 13*TAD]
Assuming 10 instructions for a simple load/store/threshold done in real-time before the next sample (this is just a stub -- the point is the rest of the calculation):
- This adds another 5 us [0.5 us per instruction]
- To give total ADC and handling time of 17.56 us [ 12.56us + 1us + 4us ]
- before the sampling loop repeats [? Again Tacq ? + 13*TAD + handling ]
- If this is correct, then the max sampling rate is 56.9 ksps [ 1/ total time ]
- So the Nyquist frequency for this sampling rate is 28 kHz. [1/2 sampling rate]
Which means the (theoretical) performance of this system --- chip's A/D with the hypothetical real-time handling code --- is for signals that are bandlimited to 28 kHz.
Is this a correct assignment / interpretation of the data sheet in obtaining Tacq, Fosc, TAD, and divisor parameters and using them to obtain the maximum physical rate, or Nyquist performance limit, of this chip?
Thanks,
You're not going to be able to do much processing in 8 instructions, but assuming you're just doing something simple like storing the incoming samples to a buffer, or detecting a threshold, then your analysis looks good.
The actual chips I'm considering for the design are the dsPIC33FJ128MC804 (with 16b A/D) or dsPIC30F3014 (with 12b A/D).
That is an important distinction; the dsPIC ADC supports ping-pong DMA transfers of multiple channels simultaneously, so can minimise the effective software overhead per sample. That makes the calculation a somewhat different one. You need to determine from the sample rate and the DMA buffer size the time between sample buffer interrupts; that is how much processing time you have to deal with each buffer. If you are using Microchip's DSP library, it gives precise cycle time formulae for each algorithm, and block processing is considerably more efficient that sample-by-sample processing.
My last project was on a dsPIC33 with two channels sampled at 48KHz and 32word sample buffers (giving 667us to process each pair of buffers). The software processing was therefore entirely independent of the sampling since by using DMA they take place simultaneously.