MIPS Branch Instructions - branch

I am learning MIPS right now, and as I was reading the documentation, it said:
An 18-bit signed offset (the 16-bit offset field shifted left 2 bits)
I was wondering why exactly for branch instructions the offset is being multiplied by 4? The documentation also stated that this makes the range for branch instructions 128 kb because the 32kb is multiplied by 4. Does this multiplication only apply to branch instructions or does it also apply to Jump instructions as well?
Thanks!

I was wondering why exactly for branch instructions the offset is being multiplied by 4?
All instructions must be word-aligned. From that it follows that both the origin and the destination are word-aligned, which in turn means that the offset also always will be word-aligned. So it would be a waste to store the two least significant bits of the offset since they always will be 0. Instead we can use the available bits in the instruction word to encode 18-bit offsets by storing only the 16 most significant bits.
Does this multiplication only apply to branch instructions or does it also apply to Jump instructions as well?
It's the same for jump instructions. Though jump instructions differ in other ways; the offset for a jump is not PC-relative, but relative to the start of the 256MB-aligned region that the PC currently is in.

Related

Homomorphic encryption using Palisade library

To all homomorphic encryption experts out there:
I'm using the PALISADE library:
int plaintextModulus = 65537;
float sigma = 3.2;
SecurityLevel securityLevel = HEStd_128_classic;
uint32_t depth = 2;
//Instantiate the crypto context
CryptoContext<DCRTPoly> cc = CryptoContextFactory<DCRTPoly>::genCryptoContextBFVrns(
plaintextModulus, securityLevel, sigma, 0, depth, 0, OPTIMIZED);
could you please explain (all) the parameters especially intrested in ptm, depth and sigma.
Secondly I am trying to make a Packed Plaintext with the cc above.
cc->MakePackedPlaintext(array);
What is the maximum size of the array? On my local machine (8GB RAM) when the array is larger than ~8000 int64 I get an free(): invalid next size (normal) error
Thank you for asking the question.
Plaintext modulus t (denoted as t here) is a critical parameter for BFV as all operations are performed mod t. In other words, when you choose t, you have to make sure that all computations do not wrap around, i.e., do not exceed t. Otherwise you will get an incorrect answer unless your goal is to compute something mod t.
sigma is the distribution parameter (used for the underlying Learning with Errors problem). You can just set to 3.2. No need to change it.
Depth is the multiplicative depth of the circuit you are trying to compute. It has nothing to with the size of vectors. Basically, if you have AxBxCxD, you have a depth 3 with a naive approach. BFV also supports more efficient binary tree evaluation, i.e., (AxB)x(CxD) - this option will reduce the depth to 2.
BFV is a scheme that supports packing. By default, the size of packed ciphertext is equal to the ring dimension (something like 8192 for the example you mentioned). This means you can pack up to 8192 integers in your case. To support larger arrays/vectors, you would need to break them into batches of 8192 each and encrypt each one separately.
Regarding your application, the CKKS scheme would probably be a much better option (I will respond on the application in more detail in the other thread).
I have some experience with the SEAL library which also uses the BFV encryption scheme. The BFV scheme uses modular arithmetic and is able to encrypt integers (not real numbers).
For the parameters you're asking about:
The Plaintext Modulus is an upper bound for the input integers. If this parameter is too low, it might cause your integers to overflow (depending on how large they are of course)
The Sigma is the distribution parameter for Gaussian noise generation
The Depth is the circuit depth which is the maximum number of multiplications on a path
Also for the Packed Plaintext, you should use vectors not arrays. Maybe that will fix your problem. If not, try lowering the size and make several vectors if necessary.
You can determine the ring dimension (generated by the crypto context based on your parameter settings) by using cc->GetRingDimension() as shown in line 113 of https://gitlab.com/palisade/palisade-development/blob/master/src/pke/examples/simple-real-numbers.cpp

Efficient way to create masking kreg values [duplicate]

This question already has answers here:
BMI for generating masks with AVX512
(2 answers)
Closed 3 years ago.
One of the benefits of Intel's AVX-512 extension is that nearly all operations can be masked by providing in addition to the vector register a kreg which specifies a mask to apply to the operation: elements excluded by the mask may either be set to zero or retain their previous value.
A particularly common use of the kreg is to create a mask that excludes N contiguous elements at the beginning or end of a vector, e.g., as the first or final iteration in a vectorized loop where less than a full vector would be processed. E.g., for a loop over 121 int32_t values, the first 112 elements could be handled by 7 full 512-bit vectors, but that leaves 9 elements left over which could be handled by masked operations which operate only on the first 9 elements.
So the question is, given a (runtime valued) integer r which is some value in the range 0 - 16 representing remaining elements, what's the most efficient way to load a 16-bit kreg such that the low r bits are set and the remaining bits unset? KSHIFTLW seems unsuitable for the purpose because it only takes an immediate.
BMI2 bzhi does exactly what you want: Zero High Bits Starting with Specified Bit Position. Every CPU with AVX512 so far has BMI2.
__mmask16 k = _bzhi_u32(-1UL, r);
This costs 2 instructions, both single-uop: mov-immediate and bzhi. It's even single-cycle latency. (Or 3 cycles on KNL)
For r=0, it zeros all the bits giving 0.
For r=1, it leaves only the low bit (bit #0) giving 1
For r=12, it zeros bit #12 and higher, leaving 0x0FFF (12 bits set)
For r>=32 BZHI leaves all 32 bits set (and sets CF)
The INDEX is specified by bits 7:0 of the second source operand
If you had a single-vector-at-a-time cleanup loop that runs after an unrolled vector loop, you could even use this every loop iterations, counting the remaining length down towards zero, instead of a separate last-vector cleanup. It leaves all bits set for high lengths. But this costs 2 uops inside the loop, including port 5 kmovw, and means your main loop would have to use masked instructions. This only works for r<=255 because it only looks at the low byte, not the full integer index. But the mov reg, -1 can be hoisted because bzhi doesn't destroy it.
PS. Normally I think you'd want to arrange your cleanup to handle 1..16 elements, (or 0..15 if you branch to maybe skip it). But the full 17-possibility 0..16 makes sense if this cleanup also handles small lengths that never enter the main loop at all, and len=0 is possible. (And your main loop exits with length remaining = 1..16 so the final iteration can be unconditional)

How to change the gem5 ARM SVE vector length?

I'm doing an experiment to see which ARM SVE vector length would be the best for my chip design, or to help select which chip has the optimal vector length for my application.
How to change the vector length in a gem5 simulation to see how it affects workload performance?
For SE:
se.py --param 'system.cpu[:].isa[:].sve_vl_se = 2'
For FS:
fs.py --param 'system.sve_vl = 2'
where the values are given in multiples of 128 bits, so 2 means length 256.
You can test this easily with the ADDVL instruction as shown in this example.
The name of those parameters can be easily determined by looking at a m5out/config.ini generated from a previous run.
Note however that this value is architecturally visible, and so it might not be possible to checkpoint after Linux boot, and restore with a different vector length than the boot, to speed up experiments. This is likely true in general even though the kernel itself does not run vector instructions, because there is software control of the effective vector length. Maybe it is possible to set a big vector length on the simulator to start with and then tell Linux to reduce it somehow in software, but I'm not sure what's the API.
Tested in gem5 3126e84db773f64e46b1d02a9a27892bf6612d30.
To change the vector length, one can use command line option:
--arm-sve-vl=<vl in quadwords: one of {1, 2, 4, 8, 16}>
where vl is a multiple of 128. So for a simulation of 512-bit SVE machine, one should use:
--arm-sve-vl=4
This works both for Syscall-Emulation mode and Full System mode.
If one wants to quickly explore the space of different vector lengths, one can also change it during the simulation (only in Full system mode). For example, to change the SVE length to 256, put the following line in your bootscript, before running the benchmark:
echo 256 >/proc/sys/abi/sve_default_vector_length
You can get more information on https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf.

gnu radio - bit rate

I have propably very stupid/simple question to GnuRadio users.
I have a Random Source as a source of bits [-1, 1]. And I want to multiply every bit with cosinus to make bpsk modulator.
Problem is that Bits are generated as fast as possible... (dont have enything in common with samp_rate). When I have 1 period of cosinus, there are generated many bits from Random Source.
Question is, how can I slow down bit rate generation ??
Thanks for any help
(I dont want to use DPSK Mod :))
Strictly speaking you can not delay the generation of bits. However you can increase the duration of each symbol. This can be done with the repeat block of GNU Radio. This block takes a parameter called interpolation that corresponds to the number of times an input item will be repeated at the output.
So you find the period of your cosine in samples, lets say p. For each random bit produced by the Random source block, you repeated it p times with the repeat block. With this way you increase the duration of your random symbol. Then you pass the resulting samples to the multiply block of your flowgraph.

Encoding - Efficiently send sparse boolean array

I have a 256 x 256 boolean array. These array is constantly changing and set bits are practically randomly distributed.
I need to send a current list of the set bits to many clients as they request them.
Following numbers are approximations.
If I send the coordinates for each set bit:
set bits data transfer (bytes)
0 0
100 200
300 600
500 1000
1000 2000
If I send the distance (scanning from left to right) to the next set bit:
set bits data transfer (bytes)
0 0
100 256
300 300
500 500
1000 1000
The typical number of bits that are set in this sparse array is around 300-500, so the second solution is better.
Is there a way I can do better than this without much added processing overhead?
Since you say "practically randomly distributed", let's assume that each location is a Bernoulli trial with probability p. p is chosen to get the fill rate you expect. You can think of the length of a "run" (your option 2) as the number of Bernoulli trials necessary to get a success. It turns out this number of trials follows the Geometric distribution (with probability p). http://en.wikipedia.org/wiki/Geometric_distribution
What you've done so far in option #2 is to recognize the maximum length of the run in each case of p, and reserve that many bits to send all of them. Note that this maximum length is still just a probability, and the scheme will fail if you get REALLY REALLY unlucky, and all your bits are clustered at the beginning and end.
As #Mike Dunlavey recommends in the comment, Huffman coding, or some other form of entropy coding, can redistribute the bits spent according to the frequency of the length. That is, short runs are much more common, so use fewer bits to send those lengths. The theoretical limit for this encoding efficiency is the "entropy" of the distribution, which you can look up on that Wikipedia page, and evaluate for different probabilities. In your case, this entropy ranges from 7.5 bits per run (for 1000 entries) to 10.8 bits per run (for 100).
Actually, this means you can't do much better than you're currently doing for the 1000 entry case. 8 bits = 1 byte per value. For the case of 100 entries, you're currently spending 20.5 bits per run instead of the theoretically possible 10.8, so that end has the highest chance for improvement. And in the case of 300: I think you haven't reserved enough bits to represent these sequences. The entropy comes out to 9.23 bits per pixel, and you're currently sending 8. You will find many cases where the space between true exceeds 256, which will overflow your representation.
All of this, of course, assumes that things really are random. If they're not, you need a different entropy calculation. You can always compute the entropy right out of your data with a histogram, and decide if it's worth pursuing a more complicated option.
Finally, also note that real-life entropy coders only approximate the entropy. Huffman coding, for example, has to assign an integer number of bits to each run length. Arithmetic coding can assign fractional bits.