MUL operator Vs IMUL operator in NASM - operators

Is there any reason why MUL operator is only in single operand form?
IMUL operator can be in three different forms (with one, two or three operands) and this is much more convenient. From the technical point of view I don't see any reason why MUL operator can't be in two/three operands form.

It has to do with the bytecodes that are output. In the pre-80286 world there were too many opcodes so the Intel engineers were finding ways to overcome the problem. One solution was to extend the portion of the bytecode that specifies the operation (multiplication in this case) into the portion of the bytecode that encoded the first operand. This obviously meant that only one operand could by supported when executing the MUL opcode. Because a multiplication requires two operands, they solved the problem by hard coding into the processor that the first operand would always be the eax register. Later processors supported bytecodes that were of multiple lengths wich allowed them to encode more data into a single command. This allowed them to make the IMUL opcode much more useful.
There is an interesting parallel today with the IP addresses running out.

It's not that NASM doesn't support it - on the CPU, the signed version of the instruction simply supports more variants than the unsigned version.

Three-operand imul, as well as two-operand forms with an immediate operand (which is an alias to the three-operand form) were introduced with the 186 instruction set. Later, the 386 added two-operand forms with a register and an r/m operand.
All of these new forms have in common that the multiplication is done either with 16 bits times 16 bits (possibly from sign-extended 8-bit immediate) with a 16 bit result, or 32 bits times 32 bits with a 32 bit result. In these cases, the low 16 or low 32 bits of the result is the same for imul as they would be for an equivalent mul, only the flags (CF and OF) may differ. So a separate multi-operand mul isn't needed. Presumably the designers went with the mnemonic imul because the forms with an 8 bit immediate do sign-extend that immediate.

Related

Is it possible to get the native CPU size of an integer in Rust?

For fun, I'm writing a bignum library in Rust. My goal (as with most bignum libraries) is to make it as efficient as I can. I'd like it to be efficient even on unusual architectures.
It seems intuitive to me that a CPU will perform arithmetic faster on integers with the native number of bits for the architecture (i.e., u64 for 64-bit machines, u16 for 16-bit machines, etc.) As such, since I want to create a library that is efficient on all architectures, I need to take the target architecture's native integer size into account. The obvious way to do this would be to use the cfg attribute target_pointer_width. For instance, to define the smallest type which will always be able to hold more than the maximum native int size:
#[cfg(target_pointer_width = "16")]
type LargeInt = u32;
#[cfg(target_pointer_width = "32")]
type LargeInt = u64;
#[cfg(target_pointer_width = "64")]
type LargeInt = u128;
However, while looking into this, I came across this comment. It gives an example of an architecture where the native int size is different from the pointer width. Thus, my solution will not work for all architectures. Another potential solution would be to write a build script which codegens a small module which defines LargeInt based on the size of a usize (which we can acquire like so: std::mem::size_of::<usize>().) However, this has the same problem as above, since usize is based on the pointer width as well. A final obvious solution is to simply keep a map of native int sizes for each architecture. However, this solution is inelegant and doesn't scale well, so I'd like to avoid it.
So, my questions: is there a way to find the target's native int size, preferably before compilation, in order to reduce runtime overhead? Is this effort even worth it? That is, is there likely to be a significant difference between using the native int size as opposed to the pointer width?
It's generally hard (or impossible) to get compilers to emit optimal code for BigNum stuff, that's why https://gmplib.org/ has its low level primitive functions (mpn_... docs) hand-written in assembly for various target architectures with tuning for different micro-architecture, e.g. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/core2/mul_basecase.asm for the general case of multi-limb * multi-limb numbers. And https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/aors_n.asm for mpn_add_n and mpn_sub_n (Add OR Sub = aors), tuned for SandyBridge-family which doesn't have partial-flag stalls so it can loop with dec/jnz.
Understanding what kind of asm is optimal may be helpful when writing code in a higher level language. Although in practice you can't even get close to that so it sometimes makes sense to use a different technique, like only using values up to 2^30 in 32-bit integers (like CPython does internally, getting the carry-out via a right shift, see the section about Python in this). In Rust you do have access to add_overflow to get the carry-out, but using it is still hard.
For practical use, writing Rust bindings for GMP is probably your best bet, unless that already exists.
Using the largest chunks possible is very good; on all current CPUs, add reg64, reg64 has the same throughput and latency as add reg32, reg32 or reg8. So you get twice as much work done per unit. And carry propagation through 64 bits of result in 1 cycle of latency.
(There are alternate ways to store BigInteger data that can make SIMD useful; #Mysticial explains in Can long integer routines benefit from SSE?. e.g. 30 value bits per 32-bit int, allowing you to defer normalization until after a few addition steps. But every use of such numbers has to be aware of these issues so it's not an easy drop-in replacement.)
In Rust, you probably want to just use u64 regardless of the target, unless you really care about small-number (single-limb) performance on 32-bit targets. Let the compiler build u64 operations for you out of add / adc (add with carry).
The only thing that might need to be ISA-specific is if u128 is not available on some targets. You want to use 64 * 64 => 128-bit full multiply as your building block for multiplication; if the compiler can do that for you with u128 then that's great, especially if it inlines efficiently.
See also discussion in comments under the question.
One stumbling block for getting compilers to emit efficient BigInt addition loops (even inside the body of one unrolled loop) is writing an add that takes a carry input and produces a carry output. Note that x += 0xff..ff + carry=1 needs to produce a carry out even though 0xff..ff + 1 wraps to zero. So in C or Rust, x += y + carry has to check for carry out in both the y+carry and the x+= parts.
It's really hard (probably impossible) to convince compiler back-ends like LLVM to emit a chain of adc instructions. An add/adc is doable when you don't need the carry-out from adc. Or probably if the compiler is doing it for you for u128.overflowing_add
Often compilers will turn the carry flag into a 0 / 1 in a register instead of using adc. You can hopefully avoid that for at least pairs of u64 in addition by combining the input u64 values to u128 for u128.overflowing_add. That will hopefully not cost any asm instructions because a u128 already has to be stored across two separate 64-bit registers, just like two separate u64 values.
So combining up to u128 could just be a local optimization for a function that adds arrays of u64 elements, to get the compiler to suck less.
In my library ibig what I do is:
Select architecture-specific size based on target_arch.
If I don't have a value for an architecture, select 16, 32 or 64 based on target_pointer_width.
If target_pointer_width is not one of these values, use 64.

Conditional instructions in AVX2

Can you give the list of conditional instructions available in AVX2?
So far I've found the following:
_mm256_blendv_* for selection from a and b based on mask c
Are there something like conditional multiply and conditional add, etc.?
Also if instructions taking imm8 count (like _mm256_blend_*), could you explain how to get that imm8 after a vector comparision?
Intel Intrinsics Guide suggests gather, load and store operating with a mask. The immediate imm8 in blend_epi16 is not programmable unless self-modifying code or a jump table is considered an option. It's still possible to derive using pext from BMI2 to compact half of odd positioned bits from the result of movemask -- one gets 32 independent mask bits from movemask in AVX2, but blend_epi16 uses each bit to control four bytes--or one 16-bit variable in each bank.
AVX512 introduces optional zero-masking and merge-masking for almost all instructions.
Before that, to do a conditional add, mask one operand (with vandps or vandnps for the inverse) before the add (instead of vblendvps on the result). This is why packed-compare instructions/intrinsics produce all-zero or all-one elements.
0.0 is the additive identity element, so adding it is a no-op. (Except for IEEE semantics of -0.0 and +0.0, I forget how that works exactly).
Masking a constant input instead of blending the result avoids making the critical path longer, for something like conditionally adding 1.0.
Conditional multiply is more cumbersome because 0.0 is not the multiplicative identity. You need to multiply by 1.0 to keep a value unchanged, and you can't easily produce that with an AND or ANDN with a compare result. You can blendv an input, or you can do the multiply and blendv the output.
The alternative to blendv is at least 3 booleans, like AND/ANDN/OR, but that's usually not worth it. Although note that Haswell runs vblendvps and vpblendvb as 2 uops for port 5, so it's a potential bottleneck compared to using integer booleans that can run on any port. Skylake runs them vblendvps as 2 uops for any port. It could make sense to do something to avoid having a blendv on the critical path, though.
Masking an input operand or blending the result is generally how you do branchless SIMD conditionals.
BLENDV is usually at least 2 uops, so it's slower than an AND.
Immediate blends are much more efficient, but you can't use them, because the imm8 blend control has to be a compile-time constant embedded into the instruction's machine code. That's what immediate means in an assembly-language context.

Do integers, whose size is not a power of two, make sense?

This is an 8 bit architecture, with a word size of 16 bits. I now need to use a 48-bit integer variable. My understanding is that libm implements 8, 16, 32, 64 bit operations (addition, multiplication, signed and unsigned).
So in order to make calculations, I must store the value in a 64-bit signed or unsigned integer. Correct?
If so, what is there to prevent general routines from being used? For example, for addition:
start with the LSB of both variables
add them up
if more bytes are available continue, otherways goto ready
shift both variables 1 byte to the right
goto 1)
libm implements the routines for the standard sizes of types, and the compiler chooses the right one to use for expression.
If you want to implement your own types, you can. If you want to use the usual operators, then you have to get into the compilation process to get the compiler to choose yours.
You could implement the operations as functions, say add(int48_t, int48_t), but then the compiler won't be able to do optimizations like constant folding, etc.
So, there is nothing stopping you from implementing your own custom compiler, but is it really necessary? Do you really need to save that space? If so, then go for it!
That is correct, saving a couple of bits is (in almost all cases) not worth the trouble of implementing your own logic.

Where does the limitation of 10^15 in D.J. Bernstein's 'primegen' program come from?

At http://cr.yp.to/primegen.html you can find sources of program that uses Atkin's sieve to generate primes. As the author says that it may take few months to answer an e-mail sent to him (I understand that, he sure is an occupied man!) I'm posting this question.
The page states that 'primegen can generate primes up to 1000000000000000'. I am trying to understand why it is so. There is of course a limitation up to 2^64 ~ 2 * 10^19 (size of long unsigned int) because this is how the numbers are represented. I know for sure that if there would be a huge prime gap (> 2^31) then printing of numbers would fail. However in this range I think there is no such prime gap.
Either the author overestimated the bound (and really it is around 10^19) or there is a place in the source code where the arithmetic operation can overflow or something like that.
The funny thing is that you actually MAY run it for numbers > 10^15:
./primes 10000000000000000 10000000000000100
10000000000000061
10000000000000069
10000000000000079
10000000000000099
and if you believe Wolfram Alpha, it is correct.
Some facts I had "reverse-engineered":
numbers are sifted in batches of 1,920 * PRIMEGEN_WORDS = 3,932,160 numbers (see primegen_fill function in primegen_next.c)
PRIMEGEN_WORDS controls how big a single sifting is - you can adjust it in primegen_impl.h to fit your CPU cache,
the implementation of the sieve itself is in primegen.c file - I assume it is correct; what you get is a bitmask of primes in pg->buf (see primegen_fill function)
The bitmask is analyzed and primes are stored in pg->p array.
I see no point where the overflow may happen.
I wish I was on my computer to look, but I suspect you would have different success if you started at 1 as your lower bound.
Just from the algorithm, I would conclude that the upper bound comes from the 32 bit numbers.
The page mentiones Pentium-III as CPU so my guess it is very old and does not use 64 bit.
2^32 are approx 10^9. Sieve of Atkins (which the algorithm uses) requires N^(1/2) bits (it uses a big bitfield). Which means in 2^32 big memory you can make (conservativ) N approx 10^15. As this number is a rough conservative upper bound (you have system and other programs occupying memory, reserving address ranges for IO,...) the real upper bound is/might be higher.

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).