What does alignment to 16-byte boundary mean in x86 - optimization

Intel's official optimization guide has a chapter on converting from MMX commands to SSE where they state the fallowing statment:
Computation instructions which use a memory operand that may not be aligned to a 16-byte boundary must be replaced with an unaligned 128-bit load (MOVDQU) followed by the same computation operation that uses instead register operands.
(chapter 5.8 Converting from 64-bit to 128-bit SIMD Integers, pg. 5-43)
I can't understand what they mean by "may not be aligned to a 16-byte boundary", could you please clarify it and give some examples?

Certain SIMD instructions, which perform the same instruction on multiple data, require that the memory address of this data is aligned to a certain byte boundary. This effectively means that the address of the memory your data resides in needs to be divisible by the number of bytes required by the instruction.
So in your case the alignment is 16 bytes (128 bits), which means the memory address of your data needs to be a multiple of 16. E.g. 0x00010 would be 16 byte aligned, while 0x00011 would not be.
How to get your data to be aligned depends on the programming language (and sometimes compiler) you are using. Most languages that have the notion of a memory address will also provide you with means to specify the alignment.

I'm guessing here, but could it be that "may not be aligned to a 16-byte boundary" means that this memory location has been aligned to a smaller value (4 or 8 bytes) before for some other purposes and now to execute SSE instructions on this memory you need to load it into a register explicitly?

Data that's aligned on a 16 byte boundary will have a memory address that's an even number — strictly speaking, a multiple of two. Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes.
Similarly, memory aligned on a 32 bit (4 byte) boundary would have a memory address that's a multiple of four, because you group four bytes together to form a 32 bit word.

Related

Homomorphic encryption using Palisade library

To all homomorphic encryption experts out there:
I'm using the PALISADE library:
int plaintextModulus = 65537;
float sigma = 3.2;
SecurityLevel securityLevel = HEStd_128_classic;
uint32_t depth = 2;
//Instantiate the crypto context
CryptoContext<DCRTPoly> cc = CryptoContextFactory<DCRTPoly>::genCryptoContextBFVrns(
plaintextModulus, securityLevel, sigma, 0, depth, 0, OPTIMIZED);
could you please explain (all) the parameters especially intrested in ptm, depth and sigma.
Secondly I am trying to make a Packed Plaintext with the cc above.
cc->MakePackedPlaintext(array);
What is the maximum size of the array? On my local machine (8GB RAM) when the array is larger than ~8000 int64 I get an free(): invalid next size (normal) error
Thank you for asking the question.
Plaintext modulus t (denoted as t here) is a critical parameter for BFV as all operations are performed mod t. In other words, when you choose t, you have to make sure that all computations do not wrap around, i.e., do not exceed t. Otherwise you will get an incorrect answer unless your goal is to compute something mod t.
sigma is the distribution parameter (used for the underlying Learning with Errors problem). You can just set to 3.2. No need to change it.
Depth is the multiplicative depth of the circuit you are trying to compute. It has nothing to with the size of vectors. Basically, if you have AxBxCxD, you have a depth 3 with a naive approach. BFV also supports more efficient binary tree evaluation, i.e., (AxB)x(CxD) - this option will reduce the depth to 2.
BFV is a scheme that supports packing. By default, the size of packed ciphertext is equal to the ring dimension (something like 8192 for the example you mentioned). This means you can pack up to 8192 integers in your case. To support larger arrays/vectors, you would need to break them into batches of 8192 each and encrypt each one separately.
Regarding your application, the CKKS scheme would probably be a much better option (I will respond on the application in more detail in the other thread).
I have some experience with the SEAL library which also uses the BFV encryption scheme. The BFV scheme uses modular arithmetic and is able to encrypt integers (not real numbers).
For the parameters you're asking about:
The Plaintext Modulus is an upper bound for the input integers. If this parameter is too low, it might cause your integers to overflow (depending on how large they are of course)
The Sigma is the distribution parameter for Gaussian noise generation
The Depth is the circuit depth which is the maximum number of multiplications on a path
Also for the Packed Plaintext, you should use vectors not arrays. Maybe that will fix your problem. If not, try lowering the size and make several vectors if necessary.
You can determine the ring dimension (generated by the crypto context based on your parameter settings) by using cc->GetRingDimension() as shown in line 113 of https://gitlab.com/palisade/palisade-development/blob/master/src/pke/examples/simple-real-numbers.cpp

Efficient way to create masking kreg values [duplicate]

This question already has answers here:
BMI for generating masks with AVX512
(2 answers)
Closed 3 years ago.
One of the benefits of Intel's AVX-512 extension is that nearly all operations can be masked by providing in addition to the vector register a kreg which specifies a mask to apply to the operation: elements excluded by the mask may either be set to zero or retain their previous value.
A particularly common use of the kreg is to create a mask that excludes N contiguous elements at the beginning or end of a vector, e.g., as the first or final iteration in a vectorized loop where less than a full vector would be processed. E.g., for a loop over 121 int32_t values, the first 112 elements could be handled by 7 full 512-bit vectors, but that leaves 9 elements left over which could be handled by masked operations which operate only on the first 9 elements.
So the question is, given a (runtime valued) integer r which is some value in the range 0 - 16 representing remaining elements, what's the most efficient way to load a 16-bit kreg such that the low r bits are set and the remaining bits unset? KSHIFTLW seems unsuitable for the purpose because it only takes an immediate.
BMI2 bzhi does exactly what you want: Zero High Bits Starting with Specified Bit Position. Every CPU with AVX512 so far has BMI2.
__mmask16 k = _bzhi_u32(-1UL, r);
This costs 2 instructions, both single-uop: mov-immediate and bzhi. It's even single-cycle latency. (Or 3 cycles on KNL)
For r=0, it zeros all the bits giving 0.
For r=1, it leaves only the low bit (bit #0) giving 1
For r=12, it zeros bit #12 and higher, leaving 0x0FFF (12 bits set)
For r>=32 BZHI leaves all 32 bits set (and sets CF)
The INDEX is specified by bits 7:0 of the second source operand
If you had a single-vector-at-a-time cleanup loop that runs after an unrolled vector loop, you could even use this every loop iterations, counting the remaining length down towards zero, instead of a separate last-vector cleanup. It leaves all bits set for high lengths. But this costs 2 uops inside the loop, including port 5 kmovw, and means your main loop would have to use masked instructions. This only works for r<=255 because it only looks at the low byte, not the full integer index. But the mov reg, -1 can be hoisted because bzhi doesn't destroy it.
PS. Normally I think you'd want to arrange your cleanup to handle 1..16 elements, (or 0..15 if you branch to maybe skip it). But the full 17-possibility 0..16 makes sense if this cleanup also handles small lengths that never enter the main loop at all, and len=0 is possible. (And your main loop exits with length remaining = 1..16 so the final iteration can be unconditional)

Why are large numpy arrays 64-byte aligned but not smaller ones

The following code:
prev=[]
addresses=[]
for i in range(10000):
a = np.ones(x).astype(np.float32)
prev.append(a)
address = a.__array_interface__['data'][0]
assert(address % 64 == 0)
assert((address not in addresses))
addresses.append(address)
Will not raise an assertionError for values of x > 252 suggesting that arrays bigger than 253, (or bigger than 505 when using float16) are aligned differently to smaller arrays. What is the reason for this?
I am on a OSX (Intel(R) Core(TM) i7-6920HQ CPU # 2.90GHz) running numpy 1.12.1
Your test loop isn't accomplishing exactly what you expect. Since only one array exists in memory at a time, it's quite possible - indeed LIKELY - that new ones will be allocated at the same memory address as the one just freed. You'd have to do something like append the arrays to a list (thus making them all exist in memory simultaneously) to actually test 10000 distinct allocations.
However, I can easily believe that you're seeing a real effect, as it's perfectly reasonable for a memory allocator to use different strategies based on the size of the block being allocated. For example, at some point the allocator may stop trying to use memory it already has, and start requesting entire memory pages directly from the operating system. Once that threshold is reached, you'd find that everything is aligned on a much higher power-of-2 boundary than 64 - perhaps 4096. You seem to be hitting some intermediate threshold at 1024 bytes (including overhead), it might be interesting to test for 128/256/512/1024 byte alignment.
Here is my guess: Using aligned memory typically involves allocating a larger block, and then releasing the upfront bytes that are allocated before the alignment boundary.
This is insignificant for large arrays, but for small arrays the fragmentation and overhead introduced likely outweights the benefits.

Finding the correct alignment for host visible memory

ppData points to a pointer in which is returned a host-accessible
pointer to the beginning of the mapped range. This pointer minus
offset must be aligned to at least
VkPhysicalDeviceLimits::minMemoryMapAlignment.
I want to allocate a Vec3 float in a uniform buffer. A Vec3 float is 12bytes big.
VkMemoryRequirements { size: 16, alignment: 16, memory_type_bits: 15 }
Vulkan reports that it has to be aligned to 16 bytes, which means that the size of the allocation is now 16 instead of 12. So Vulkan already handled this for me.
minMemoryMapAlignment on my GPU is 64 bytes. What exactly does this mean for my allocation? Does this mean that I can not use the size from a VkMemoryRequirements for my allocation? And instead of allocating 16bytes here, I would have to allocate 64bytes?
Update:
For a 12 byte allocation with a 16 byte alignment and 64 bytes minMemoryMapAlignment. I would still allocate only 16 bytes and then call:
vkMapMemory(device, memory, 0, 16, 0, &mapped);
But the ptr returned from vkMapMemory is actually not 16 bytes but 64 bytes wide? And all the relevant data is in the first 12 bytes and the rest is just "padded" memory? So in practice this basically means that I don't need to use minMemoryMapAlignment at all?
There is nothing in the spec that restricts the size of the allocation like that. The paragraph you quoted means that the mapping will be aligned to minMemoryMapAlignment and you can then tell the compiler to use aligned memory accesses when accessing it. What will happen is that when the memory is mapped the later 48 bytes are wasted space in the host's memory space. That is unlikely to matter though.
This is why people keep saying to allocate larger blocks and subdivide them as needed. That way you can put 4 of those vkBuffers into a single 64 byte allocation (which you will need if you want to pipeline the rendering).
It's highly unlikely that that single vec3 is the only thing you need memory for, so take a look at your other allocations and see which ones you can combine.

What is the difference between “SHA-2” and “SHA-256”

I'm a bit confused on the difference between SHA-2 and SHA-256 and often hear them used interchangeably. I think SHA-2 a "family" of hash algorithms and SHA-256 a specific algorithm in that family. Can anyone clear up the confusion.
The SHA-2 family consists of multiple closely related hash functions. It is essentially a single algorithm in which a few minor parameters are different among the variants.
The initial spec only covered 224, 256, 384 and 512 bit variants.
The most significant difference between the variants is that some are 32 bit variants and some are 64 bit variants. In terms of performance this is the only difference that matters.
On a 32 bit CPU SHA-224 and SHA-256 will be a lot faster than the other variants because they are the only 32 bit variants in the SHA-2 family. Executing the 64 bit variants on a 32 bit CPU will be slow due to the added complexity of performing 64 bit operations on a 32 bit CPU.
On a 64 bit CPU SHA-224 and SHA-256 will be a little slower than the other variants. This is because due to only processing 32 bits at a time, they will have to perform more operations in order to make it through the same number of bytes. You do not get quite a doubling in speed from switching to a 64 bit variant because the 64 bit variants do have a larger number of rounds than the 32 bit variants.
The internal state is 256 bits in size for the two 32 bit variants and 512 bits in size for all four 64 bit variants. So the number of possible sizes for the internal state is less than the number of possible sizes for the final output. Going from a large internal state to a smaller output can be good or bad depending on your point of view.
If you keep the output size fixed it can in general be expected that increasing the size of the internal state will improve security. If you keep the size of the internal state fixed and decrease the size of the output, collisions become more likely, but length extension attacks may become easier. Making the output size larger than the internal state would be pointless.
Due to the 64 bit variants being both faster (on 64 bit CPUs) and likely to be more secure (due to larger internal state), two new variants were introduced using 64 bit words but shorter outputs. Those are the ones known as 512/224 and 512/256.
The reasons for wanting variants with output that much shorter than the internal state is usually either that for some usages it is impractical to use such a long output or that the output need to be used as key for some algorithm that takes an input of a certain size.
Simply truncating the final output to your desired length is also possible. For example a HMAC construction specify truncating the final hash output to the desired MAC length. Due to HMAC feeding the output of one invocation of the hash as input to another invocation it means that using a hash with shorter output results in a HMAC with less internal state. For this reason it is likely to be slightly more secure to use HMAC-SHA-512 and truncate the output to 384 bits than to use HMAC-SHA-384.
The final output of SHA-2 is simply the internal state (after processing length extended input) truncated to the desired number of output bits. The reason SHA-384 and SHA-512 on the same input look so different is that a different IV is specified for each of the variants.
Wikipedia:
The SHA-2 family consists of six hash functions with digests (hash
values) that are 224, 256, 384 or 512 bits: SHA-224, SHA-256, SHA-384,
SHA-512, SHA-512/224, SHA-512/256.