Impact of disabled branch prediction on AMD Ryzen, regarding SPECTRE - branch

AMD states here that the are not affected by Meltdown (i.e. Variant 3), likely not by Variant 2 but by Variant 1 (both SPECTRE). Therefore AMD has released an Microcodeupdate for FAM 17H (i.e. Ryzen) and switched OFF branch prediction. What kind of branch prediction is here turned OFF?
Sadly I only own a mobile SandyBridge* and an AMD Phenomen X4 905e (the updated microcode is only for Ryzen). Can some execute benchmarks like here with a sorted array?
The mentioned update from SUSE just speaks about branch prediction which could mean either Variant 1 (if/else) or Variant 2 (branch jump buffer). Also the people from Google use the term branch prediction for Variant 1 and Variant 2. So it is ambigious what actually was turned OFF.
Why should AMD turn OFF Variant 2 (branch jump buffer), if AMD is likely not affected, due a different implementation. Or is it Variant 1 (if/else), which should slow down probably a lot of work cases dramatically. A pre sorted array is a likely usage case, where branch predicition improves performance.
Thanks
*Add here some finnish profanity...

Seems like AMD has Indirect Branch Restricted Speculation (ibrs) now in usage, restriced but not turned off completely.
AMD Defaults:
Due to the differences in underlying hardware implementation, AMD X86 systems are not vulnerable to variant #3. The correct default values will be set on AMD hardware based on dynamic checks during the boot sequence.
pti 0 ibrs 0 ibpb 2 -> fix variant #1 #2 if the microcode update is applied
pti 0 ibrs 2 ibpb 1 -> fix variant #1 #2 on older processors that can disable indirect branch prediction without microcode updates
https://access.redhat.com/articles/3311301
So the notice from SUSE isn't really correct.

Related

Performance drop in matrix multiplication for certain sizes on AMD Polaris

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)
I run it on Polaris 560, 16CU GPU.
Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl
I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.
The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).
I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.
I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.
There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.
I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.
Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache
https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228
I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.
https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109
To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.
z-order improved performance x4 times.

SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6
You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

Is there any way to make Z3 use multiple cores (multithreaded) for big problems?

I am attempting to move from a commercial solver to Z3 for large integer satisfiability problem. By "large" I mean that the model I am trying to solve has on the order of 300,000 integers and 300,000 (assert (=... statements, each with a combination of perhaps 8-16 variables.
Our commercial solver took 1353 seconds to solve the big problem. Our commercial solver is actually an optimizer and this was solved as a mixed integer optimization problem. The problem transformed into an integer problem with 5,093,121 variables, 9901 constraints, 63,450,472 zeros, 5,093,120 integers, and it was solved in 4690 iterations. However, it was a simple SAT problem, so I'm hoping to move this to Z3 and ditch the commercial optimizer.
As I indicated, the commercial optimizer took 1353 seconds, but it was also allowed to use 32 cores and indications are that I used many of them (I didn't track how many cores it ended up using).
I would like Z3 to be able to use multiple cores. At the present time it doesn't seem that it does. Is there any way to make it do so? Failing that, is there another SMT solver that will?
Z3 does support parallel processing, see: https://theory.stanford.edu/~nikolaj/programmingz3.html#sec-parallel-z3
Parameters can be set on the command line. So to make z3 use 4 threads and process file solve.z3, use:
z3 parallel.enable=true parallel.threads.max=4 solve.z3
Note that if parallel.enable is set to true, Z3 will default to the number of processors.
Unfortunately this feature is rather poorly documented. Please do report your findings if you try it out!

Long latency instruction

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
Currently I'm using fsqrt, but I'm wondering is there is something better.
Ideally, the instruction will score well on the following criteria:
Long latency
Stable/fixed latency
One or a few uops (especially: not microcoded)
Consumes as few uarch resources as possible (load/store buffers, page walkers, etc)
Able to chain (latency-wise) with itself
Able to chain input and out with GP registers
Doesn't interfere with normal OoO execution (beyond whatever ROB, RS, etc, resources it consumes)
So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.
Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).
vsqrtss might be somewhat better than fsqrt since it at least satisfies relatively easy chaining with GP registers (since GP <-> vector is just a movd away).

Does multiply by 1.0 take less time than usual multiplications

Does x86(64 too) processor optimise away the multiplication if one of the operands of multiplication happens to be 1.0?
PS:I do not mean compiler optimising a constant multiplication of 1.0.
That's not something I've seen mentioned in docs about instruction latencies or microarchitectures of Intel or AMD CPUs.
I suspect it doesn't happen, because variable latency would interfere with pipelined execution units. (multiple results coming out of the same execution unit in the same clock cycle = extra complexity). Also, there are probably other bits of logic (uop scheduling / queueing, result forwarding networks) that are designed around every uop having known latency. (except for special cases like division / sqrt).
IIRC, one analyst, maybe Agner Fog or David Kanter, suggested that some uops might have been possible to implement with 2 cycle latency, but instead take 3 cycles to match the other uops that their execution port can handle. So constant latency for operations appears to be a big deal for Intel CPU designs, to the extent that it was worth making an operation slower.
Note that we're only talking about latency here. If your multiply isn't part of a loop-carried dependency chain, or you have enough independent multiplies, you can keep the multiplier(s) going with one operation per clock.
Haswell CPUs can sustain a throughput of 2 FP vector multiplies per clock. (256b vectors of 4 doubles or 8 floats). Latency = 5 clock cycles for the result to be ready, regardless of input. Or 1 vector integer multiply per clock. (The vector multiply ALU is on port 0. The vector FP multipliers are on port 0 and port 1).
Avoid multiplying when you can, it leads to long dependency chains. (Usually this comes up for integer multiplies to calculate loop indices. Compilers do a lot better when you write your loop to increment the counter by 16, instead of multiplying i++ by 16 as an array index.)