Is there a relation between single and double precision in NVIDIA Tesla? - gpu

In the model Tesla K20 the peak single-precision floating point performance is about 3.52 TFlops but the double-precision is 1.17 TFlops,so the ratio is 3. The Tesla K20X has 3.95 and 1.31, and Tesla K40 has 4.29 and 1.43 TFlops, the ratio seems to repeat. My question is if there is a reason to the ratio be 3 and not 2, that seems logical to me because the difference between single and double precision. I am learning about GPUS and GPGPUS, so i don't know very much about it.
In the second page of this pdf there is a specs table.
NVIDIA-Tesla-Kepler-Family-Datasheet.pdf

The models you listed are all based on Kepler architecture, which has peak double precision rate equal to 1/3 of peak single precision rate. This is the way NVIDIA has built this piece of hardware. For comparison, Fermi, which is the previous hardware generation, had the ratio of 1/2 between peak double and single precision rate.
You may refer to NVIDIA documentation for instruction throughput, by instruction type and hardware generation:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput
You will notice that consumer-grade products (GeForce GTX) typically have much lower double-to-single precision rate - 1/8, 1/12, 1/24, and even 1/32, depending on hardware version.

Related

Performance drop in matrix multiplication for certain sizes on AMD Polaris

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)
I run it on Polaris 560, 16CU GPU.
Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl
I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.
The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).
I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.
I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.
There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.
I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.
Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache
https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228
I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.
https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109
To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.
z-order improved performance x4 times.

Long latency instruction

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
Currently I'm using fsqrt, but I'm wondering is there is something better.
Ideally, the instruction will score well on the following criteria:
Long latency
Stable/fixed latency
One or a few uops (especially: not microcoded)
Consumes as few uarch resources as possible (load/store buffers, page walkers, etc)
Able to chain (latency-wise) with itself
Able to chain input and out with GP registers
Doesn't interfere with normal OoO execution (beyond whatever ROB, RS, etc, resources it consumes)
So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.
Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).
vsqrtss might be somewhat better than fsqrt since it at least satisfies relatively easy chaining with GP registers (since GP <-> vector is just a movd away).

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

Precision of up to 1 gram

I know what precision means. But what does it mean to have a precision of up to 1 gram when talking about weighing machines? Does it mean if actual weight is 100 grams, it may show 99 grams, 100 next time and 101 the third time?
Thanks in advance
It means that the different values you would get, differs by atmost 1 gram.
The key here is the difference between precision and accuracy
If you fired 100 arrows, high precision would mean each arrow falls on the same point, regardless of whether that point is the bullseye. In other words, it refers to the variability of the distribution
Accuracy, on the other hand, is whether the mean/center of that distribution is located around your intended target (the bullseye in the arrow example)
High precision + low accuracy would mean each arrow will be tightly grouped, but wont necessarily be grouped at the target
Low precision + high accuracy means your arrows will be distributed widely in an area, but the center of that area/distribution will be the intended target/bullseye
So in your weighing machine example, precision of 1 gram refers to variability of all your weighings. The lower the number, the more consistent the measurement

Does multiply by 1.0 take less time than usual multiplications

Does x86(64 too) processor optimise away the multiplication if one of the operands of multiplication happens to be 1.0?
PS:I do not mean compiler optimising a constant multiplication of 1.0.
That's not something I've seen mentioned in docs about instruction latencies or microarchitectures of Intel or AMD CPUs.
I suspect it doesn't happen, because variable latency would interfere with pipelined execution units. (multiple results coming out of the same execution unit in the same clock cycle = extra complexity). Also, there are probably other bits of logic (uop scheduling / queueing, result forwarding networks) that are designed around every uop having known latency. (except for special cases like division / sqrt).
IIRC, one analyst, maybe Agner Fog or David Kanter, suggested that some uops might have been possible to implement with 2 cycle latency, but instead take 3 cycles to match the other uops that their execution port can handle. So constant latency for operations appears to be a big deal for Intel CPU designs, to the extent that it was worth making an operation slower.
Note that we're only talking about latency here. If your multiply isn't part of a loop-carried dependency chain, or you have enough independent multiplies, you can keep the multiplier(s) going with one operation per clock.
Haswell CPUs can sustain a throughput of 2 FP vector multiplies per clock. (256b vectors of 4 doubles or 8 floats). Latency = 5 clock cycles for the result to be ready, regardless of input. Or 1 vector integer multiply per clock. (The vector multiply ALU is on port 0. The vector FP multipliers are on port 0 and port 1).
Avoid multiplying when you can, it leads to long dependency chains. (Usually this comes up for integer multiplies to calculate loop indices. Compilers do a lot better when you write your loop to increment the counter by 16, instead of multiplying i++ by 16 as an array index.)