Ignite avoids rebalancing using secretly kept offheap data - ignite

Steps to create situation:
configure Ignite 2.14.0 with backups=1 and no persistence
start 3 server nodes (N1, N2, N3)
fill it with data somehow, occupying about 25% of heap (with onheap cache enabled)
kill N1
see (via grafana or something like that) that CacheSize and OffheapUsedSize and OffHeapEntriesCount on N2 and N3 raised by 50%, which is fine. TotalRebalancedBytes metric on N2 and N3 increased during this operation.
start N1 again
see that CacheSize and OffHeapEntriesCount of N2 and N3 are decreased back by 1/3, which is clear why, but OffheapUsedSize on N2 and N3 - isn't, it stays at 150% of initial value (!). TotalRebalancedBytes on N2 and N3 stay still.
kill N1 again
see that CacheSize and OffheapUsedSize on N2 and N3 raised by 50% again, but OffHeapEntriesCount stays at the same 150% of initial value. TotalRebalancedBytes stay still (!)
raise N1 again
restart N2 and N3, wait for rebalancing
kill N1 again
see (via grafana or something like that) that CacheSize and OffheapUsedSize and OffHeapEntriesCount on N2 and N3 raised by 50%, which is fine. TotalRebalancedBytes metric on N2 and N3 increased during this operation. exactly as in #5
So, we see that second and next loss of N1 doesn't require rebalancing of N2 and N3 until N2 and N3 keeps staying. and something occupy offheap space sneakily that time.
Looks like after first N1 loss, N2 and N3 remembers N1's data in their offheap even after its return. And this "memory" is reflected only in OffheapUsedSize, but not in OffHeapEntriesCount.
I've googled and browsed docs and haven't seen relevant information. What is the name of this feature? Where can I read about this feature? Is it configurable?

OffheapUsedSize, i.e. offheap size doesn't shrink automatically. Ignite doesn't return memory back to the OS, instead it marks a memory page as free and ready for reuse. That's why there is no metrics' increase in step #9.
If you need your memory back to the OS, you can use defragmentation feature. Not sure if it's available in Apache Ignite though.
OffHeapEntriesCount represents actual number of records, despite the real memory consumption at the moment. That's the most informative metrics out of the rest to track real-time data distribution.
I suppose everything went as expected till the step #9. Personally, I'd expect TotalRebalancedBytes to increase on step #9 for N2 and N3, just like it was in #5. But, that might not be a case if you have manually configured baseline topology. For in memory cache, auto-adjustment delay is 0 by default, meaning immediate baseline change and a rebalance trigger. If there is a non-zero delay, and it was a short restart, probably you could have skipped a rebalance.
So, we see that second and next loss of N1 doesn't require rebalancing
of N2 and N3 until N2 and N3 keeps staying. and something occupy
offheap space sneakily that time.
But that diverge from the observation: third restart of N1 causes TotalRebalancedBytes to increase as you mentioned in the last step, doesn't it?
TotalRebalancedBytes metric on N2 and N3 increased during this
operation. exactly as in #5

Related

Long latency instruction

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
Currently I'm using fsqrt, but I'm wondering is there is something better.
Ideally, the instruction will score well on the following criteria:
Long latency
Stable/fixed latency
One or a few uops (especially: not microcoded)
Consumes as few uarch resources as possible (load/store buffers, page walkers, etc)
Able to chain (latency-wise) with itself
Able to chain input and out with GP registers
Doesn't interfere with normal OoO execution (beyond whatever ROB, RS, etc, resources it consumes)
So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.
Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).
vsqrtss might be somewhat better than fsqrt since it at least satisfies relatively easy chaining with GP registers (since GP <-> vector is just a movd away).

Vector functions of STREAM benchmark

I am currently doing a small research project for school, where I am to test the memory performance bandwidth of a Hypervisor, compared to the virtualised machines it creates and manages.
Due to the timeframe of the project, only one of the vector functions tested by STREAM will be analysed. My thoughtprocess is to look at the results from the "Copy" function, since this is the most basic function, which performs no arithmetic, as stated at the bottom of https://www.cs.virginia.edu/stream/ref.html
After all, this is a memory bandwidth performance test.
I have yet though to find any google post that proves, or disproves my theory. Is there anyone here who can shine some light on this topic?
STREAM Copy and other three tests are usually written in plain C without explicit vectorization. But the loops are simple and most compilers are able to optimize them to vectorized variant. The kernel line in https://www.cs.virginia.edu/stream/ref.html is the full code of loop, and there are three arrays: a, b, c of the same size; preinitialized with some floating point data. Element of vector is double (8 bytes typical).
The table below shows how many Bytes and FLOPs are counted in each iteration of the STREAM loops.
The test consists of multiple repetitions of four the kernels, and the best results of (typically) 10 trials are chosen.
------------------------------------------------------------------
name kernel bytes/iter FLOPS/iter
------------------------------------------------------------------
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
------------------------------------------------------------------
More recent variants of the test are NERSC: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/stream/ and HPCC: http://icl.cs.utk.edu/hpcc/ both based on http://www.cs.virginia.edu/stream/

O(n log n) with input of n^2

Could somebody explain me why when you have an algorithm A that has a time complexity of O(n log n) and give it input of size n^2 it gives the following: O(n^2 log n).
I understand that it becomes O(n^2 log n2) and then O(n^2 * 2 * log n) but why does the 2 disappear?
It disappears because time complexity does not care about things that have no effect when n increases (such as a constant multiplier). In fact, it often doesn't even care about things that have less effect.
That's why, if your program runtime can be calculated as n3 - n + 7, the complexity is the much simpler O(n3). You can think of what happens as n approaches infinity. In that case, all the other terms become totally irrelevant compared to the first. That's when you're adding terms.
It's slightly different when multiplying since even lesser terms will still have a substantial effect (because they're multiplied by the thing having the most effect, rather than being added to).
For your specific case, O(n2 log n2) becomes O(n2 2 log n). Then you can remove all terms that have no effect on the outcome as n increases. That's the 2.

How to orchestrate members in a cluster to read new input from a single file once the current job is done?

I am working on a global optimization using brutal force. I am wondering if it is possible to complete the following task with Fortran MPI file I/O:
I have three nodes, A, B, C. I want these nodes to search for the optima over six sets of parameter inputs, which are arranged in the following matrix:
0.1 0.2 0.3
0.4 0.5 0.6
0.7 0.8 0.9
1.1 1.2 1.3
1.4 1.5 1.6
1.7 1.8 1.9
A row vector represents a set of parameter inputs. The order of which node reading in which set of parameter inputs does not matter. All I need is to orchestrate nodes A, B, C to run through the six sets of parameters, obtain the corresponding value of penalty function, and save the output to a single file.
For example, node A pulls the first set, node B the second, and node C the third. Each node takes a while to finish respective computation. Since the computation time varies across nodes, it is possible that C is the first that finishes the first-round computation, and followed by B and then A. In such a case, I want node C to subsequently pull the forth set of inputs, node B to pull the fifth and node A to read in the last set.
A <--- 0.1 0.2 0.3
B <--- 0.4 0.5 0.6
C <--- 0.7 0.8 0.9
C <--- 1.1 1.2 1.3
B <--- 1.4 1.5 1.6
A <--- 1.7 1.8 1.9
What troubles me is that the order of which node to read which set for the second-round computation is not known in advance due to the uncertainty in the run time of respective node. So I would like to know if there is a way to dynamically program my code with MPI file I/O to attain such a parallel need. Can anyone show me a code template to solve this problem?
Thank you very much.
Lee
As much as it pains me to suggest it, this might be the one good use of MPI "Shared file pointers". These work in fortran, too, but I'm going to get the syntax wrong.
Each process can read a row from the file with MPI_File_read_shared This independent I/O routine will update a global "shared file pointer" bit of state. Should B or C finish their work quickly, they can call MPI_File_read_shared again. If A is slow, whenver it calls MPI_File_read_shared it will read whatever has not been dealt with yet.
Some warnings:
shared file pointers don't get a lot of attention.
The global bit of shared state is typically... a hidden file. So yeah, it might not scale terribly well. Should be fine for a few tens of processes, though.
the global bit of shared state is stored on a file system. Some file systems like PVFS do not support the locking required to ensure this shared state is always correct.

Master theorem with logn

Here's a problem.
I am really confused about the c being equal to 0.5 part. Actually overall I am confused how the logn can become n^(0.5). Couldn't I just let c be equal to 100 which would mean 100 < d which results in a different case being used? What am I missing here?
You of course could set c = 100 so that n^c is a (very, veeery) rough asymptotical upper bound to log(n), but this would give you a horrendous and absolutely useless estimate on your runtime T(n).
What it tells you, is that: every polynomial function n^c grows faster than the logarithm, no matter how small c is, as long as it remains positive. You could take c=0.0000000000001, it would seem to grow ridiculously small in the beginning, but at some point it would become larger than log(n) and diverge to infinity much faster than log(n) does. Therefore, in order to get rid of the n^2 log(n) term and being able to apply the polynomial-only version of the Master theorem, you upper bound the logarithmic term by something that grows slowly enough (but still faster than log(n)). In this example, n^c with c=0.5 is sufficient, but you could also take c=10^{-10000} "just to make sure".
Then you apply the Master theorem, and get a reasonable (and sharp) asymptotic upper bound for your T(n).