Want to understand load imputed by floating point instructions - optimization

At outset, this may be a part-discussion part-solving kind of questions. No intent to offend anyone there.
I have written in 64 bit assembly the algorithm to generate MT Prime based random number generator for 64 bits. This generator function is required to be called 8 billion times to populate an array of size 2048x2048x2048, and generate a random no between 1..small_value (usually, 32)
Now I had two next steps possibilities :
(a) Keep generating numbers, compare with the limits [1..32] and discard those that don't fall within. The run time for this logic is 181,817 ms, measured by calling clock() function.
(b) take the 64 bit random number output in RAX,and scale it using FPU to be between [0..1], and then scale it up in the desired range [1..32] The code sequence for this is as below :
mov word ptr initialize_random_number_scaling,dx
fnclex ; clears status flag
call generate_fp_random_number ; returns a random number in ST(0) between [0..1]
fimul word ptr initialize_random_number_scaling ; Mults ST(0) & stores back in ST(0)
mov word ptr initialize_random_number_base,ax ; Saves base to a memory
fiadd word ptr initialize_random_number_base ; adds the base to the scaled fp number
frndint ; rounds off the ST(0)
fist word ptr initialize_random_number_result ; and stores this number to result.
ffree st(0) ; releases ST(0)
fincstp ; Logically pops the FPU
mov ax, word ptr initialize_random_number_result ; and saves it to AX
And the instructions in generate_fp_random_number are as below :
shl rax,1 ; RAX gets the original 64 bit random number using MT prime algorithm
shr ax,1 ; Clear top bit
mov qword ptr random_number_generator_act_number,rax ; Save the number in memory as we cannot move to ST(0) a number from register
fild qword ptr random_number_generator_max_number ; Load 0x7FFFFFFFFFFFFFFFH
fild qword ptr random_number_generator_act_number ; Load our number
fdiv st(0),st(1) ; We return the value through ST(0) itself, divide our random number with max possible number
fabs
ffree st(1) ; release the st(1)
fld1 ; push to top of stack a 1.0
fcomip st(0), st(1) ; compares our number in ST(1) with ST(0) and sets CF.
jc generate_fp_random_get_next_no ; if ST(0) (=1.0) < ST(1) (our no), we need a new no
fldz ; push to top of stack a 0.0
fcomip st(0),st(1) ; if ST(0) (=0.0) >ST(1) (our no) clears CF
jnc generate_fp_random_get_next_no ; so if the number is above zero the CF will be set
fclex
The problem is, just by adding these instructions, the run time jumps to a whopping 5,633,963 ms! I have written the above using xmm registers as an alternative, and the difference is absolutely marginal. (5,633,703 ms).
Would anyone kindly guide me on what degree of load do these additional instructions impute to the total run time? Is the FPU really this slow ? Or am I missing a trick? As always, all ideas are welcome and am grateful for your time and efforts.
Env : Windows 7 64 bit on Intel 2700K CPU overclocked to 4.4 GHz 16 GB RAM debugged in VS 2012 Express environment

"mov word ptr initialize_random_number_base,ax ; Saves base to a memory"
If you want the max speed you must find out how to separate write instructions and write data into different sections of memory
Rewriting data in the same area of cache creates a "self modifying code" situation
Your compiler may do this, it may not.
You need to know this because unoptimised assembly code runs 10 to 50 times slower
"All modern processors cache code and data memory for efficiency. Performance of assembly-language code can be seriously impaired if data is written to the same block of memory as that in which the code is executing, because it may cause the CPU repeatedly to reload the instruction cache (this is to ensure self-modifying-code works correctly). To avoid this, you should ensure that code and (writable) data do not occupy the same 2 Kbyte block of memory. "
http://www.bbcbasic.co.uk/bbcwin/manual/bbcwina.html#cache

There's a ton of stuff in your code that I can see no reason for. If there was a reason, feel free to correct me, but otherwise here are my alternatives:
For generate_fp_random_number
shl rax, 1
shr rax, 1
mov qword ptr act_number, rax
fild qword ptr max_number
fild qword ptr act_number
fdivrp ; divide actual by max and pop
; and that's it. It's already within bounds.
; It can't be outside [0, 1] by construction.
; It can't be < 0 because we just divided two positive number,
; and it can't be > 1 because we divided by the max it could be
For the other thing:
mov word ptr scaling, dx
mov word ptr base, ax
call generate_fp_random_number
fimul word ptr scaling
fiadd word ptr base
fistp word ptr result ; just save that thing
mov ax, word ptr result
; the default rounding mode is round to nearest,
; so the slow frndint is unnecessary
Also note the complete lack of ffree's etc. By making the right instruction pop, it all just worked out. It usually does.

Related

how hex file is converting into binary in microcontroller

I am new to embedded programming. I am using a compiler to convert source code into hex and I will burn into microcontroller. My question is: microntroller (all ICs) will support binary numbers only (0 & 1). Then how it is working with hex file?
the software that loads the program/data into the flash reads whatever format it support which may be intel hex, motorola srecord, elf, coff, or a raw binary or other. and then do the right thing to program the flash with just the relevant ones and zeros.
First of all, the PC you are using right now has a processor inside, which works just like any other microcontroller. You are using it to browse the internet, although it's all "1s and 0s on the inside". And I am presuming your actual firmware doesn't come even close to running what your PC is running at this moment.
microntroller will support binary numbers only (0 & 1)
Your idea that "microntroller only supports binary numbers (0 & 1)" is a misconception. At it's very low level, yes, microcontroller contains a bunch of transistors, and each of them can store only two states of information (a bit).
But the reason for this is simply because this is a practical way to physically store one small chunk of data.
If you check the assembly instruction manual for your uC architecture, you will see a large number of instructions operating on different data widths (bits grouped into 8, 16 or larger chunks). If your controller is, say, 16-bit, then this will the basic word size for most instructions, and the one that will be the most efficient. When programming in C, this will also be the size of the "special" int type which all smaller integral types get expanded to.
In other words, bits are just building blocks of your hardware, and most of the time shouldn't even concern you at the firmware level, let alone higher application levels. Compare it to a human life form: human body is made of cells, but is also capable of doing more than a single-cell organism, isn't it?
i am using compiler to convert source code into hex
Actually, you are using the compiler to create the machine code for your particular microcontroller architecture. "Hex", or more precisely Intel Hex file format, is just one of several file formats used for storing the machine code into a file, and it's by convenience a plain-text ASCII file which you can easily open in Notepad.
To clarify, let's say you wrote a simple line of C code like this:
a = b + c;
Your compiler needs to know which architecture you are targeting, in order to convert this to machine code. For a fictional uC architecture, this will first get compiled to the following fictional assembly language:
// compiler decides that a,b,c will be stored at addresses 0x1000, 1004, 1008
mov ax, (0x1004) // move value from address 0x1004 to accumulator
add ax, (0x1008) // add value from address 0x1008 to accumulator
mov (0x1000), ax // move value from accumulator to address 0x1000
Each of these instructions has its own instruction opcode, which can be found inside the assembly instruction manual. If the instruction operates on one or more parameters, uC will know that the bytes following the instruction are data bytes:
// mov ax, (addr) --> opcode 0x10
// add ax, (addr) --> opcode 0x20
// mov (addr), ax --> opcode 0x30
mov ax, (0x1004) // 0x10 (0x10 0x04)
add ax, (0x1008) // 0x20 (0x10 0x08)
mov (0x1000), ax // 0x30 (0x10 0x00)
Now you've got your machine-code, which, written as hex values, becomes:
10 10 04 20 10 08 30 10 00
And converted to binary becomes:
0001000000010000000010000100000...
To transfer this to your controller, you will use a file format which your flash uploader knows how to read, which is what Intel Hex is most commonly used for.
Once transferred to your microcontroller, it will be stored as a bunch of bits in its flash memory, but the controller is designed to read these bits in chunks of 8 or more bits, and evaluate them as instruction opcodes or data, depending on the context. For the example above, it will read first 8 bits, and seeing that it's an instruction opcode 0x10 (which takes an additional address parameter), it will read the next two bytes to form the address 0x1004. It will then execute the instruction and advance the instruction pointer.
Hex, Decimal, Binary, they are all just ways of representing a number.
AA in hex is the same as 170 in decimal and 10101010 in binary (and 252 or Octal).
The reason the hex representation is used is because it is very convenient when working with microcontrollers as one hex character fits into 1 nibble. Hence F is 1111, FF is 1111 1111 and so fourth.

mov command with vs. without parenthesis?

I don't understand what is the difference between these two statements
mov [var] , 10
and
mov var,10
in assembly?
For a variable like this:
var: db 0
The instruction mov var,10 would not be allowed by NASM, because in NASM syntax writing var like that (without square brackets) means that you want the address of var as an immediate. And there's no variant of mov that takes an immediate, immediate operand pair.
Adding the square brackets makes it a reference to an address in memory. So mov [var], 10 means store the value 10 at var. Actually you'd have to specify the size of the value to store as well, e.g. mov byte [var], 10. Otherwise NASM doesn't know if you want to store a byte, a word, or a dword, because the immediate 10 could be represented in any of those sizes.
Note that in MASM/TASM syntax mov var, 10 and mov [var], 10 would mean the same thing in this case (they would both have the same meaning as mov [var], 10 in NASM sytax).

x86 64 AT&T , moving part of register into another register

I'd like to move one byte from register rdx to register rbx, like this:
mov %rdx , (%rbx,%r15,1)
where
rdx contains 0x33
,r15 is index and rbx contains 0 at start.
I have tried using this method in many ways , always ending with SIGSEGV error.
In the end I am going to create a rbx register which will contain an array of next rdx values
You can shift the bytes in one at a time, like this:
; Calculate first dl
...
mov %dl,%bl
; Calculate next dl
...
shlq $8,%rbx
mov %dl,%bl
; Calculate next dl
...
shlq $8,%rbx
mov %dl,%bl
etc. This assumes that you want the first byte in the msb, and the last byte in the lsb. The revesre order is a bit more complicated, but not much.

Using a Variable to Point to a Specific Character in a String

I am working on a program where I need to be able change characters at specific places within a string as the user moves through it, and I'd like to use a variable to store the user's position within the string.
While working on other parts of the program, I temporarily used the code below, where buffer is my string:
mov eax, buffer
mov byte [eax + 14], '#'
In the finished program, I'd like to use something like:
mov byte [eax + position], '#'
However, when I use the line above, with position set to 14, I get a segmentation fault. How can I use a variable to point to a specific spot in the string?
EDIT: The position variable is set as follows:
segment .data
position db 14
Okay... [eax + position] is eax plus address of position. You want [eax + [position]] and we have no such instruction. Do something like mov ecx, [position] Make position dd not db! Then [eax + ecx] should work.

Is there a hardware unit called " 2's complement "?

I understand that in order to do a subtraction you should do a 2's complement transformation to the second number .
Is there a dedicated Hardware for that checks the MSB and if it is found to be 1 it does the transformation ?
Also , Is this system used for subtraction of floating points ?
The Two's Complement operation is implemented in most languages with the unary - operator. It is only used with signed integer types. It can be implemented in an ALU as either a distinct negation (e.g. NEG) instruction or rolled into another operation, for example when you use a subtract (e.g. SUB) instruction instead of an add (e.g. ADD) instruction.
Your first question is unclear because "the last bit" could refer to either the most-significant bit (MSB) or least significant bit (LSB). In a signed integer, the MSB indicates sign; checking for a negative is usually implemented as the N bit in the condition code register, which is updated from the result of the last instruction executed (though several instructions do not change the condition code register). Computing the two's complement only if the original number is negative is the absolute value (e.g. ABS) operation. Checking the LSB just tells you if the integer is even or odd.
Floating point numbers use a separate sign bit, so 0 and -0 are distinct values. Two's compliment does not work with floating point values; a different approach must be used.
EDIT: An example. Consider the following C code:
#include <stdlib.h>
int do_math(int a, int b)
{
return a - b;
}
int main(int argc, char* const argv[])
{
if(argc < 2)
return 0;
return do_math(atoi(argv[1]), atoi(argv[2]));
}
This can be run with:
$ gcc -O0 foo.c -o foo
$ ./foo 20 10; echo $?
10
On x86_64, the function do_math() contains the following code:
_do_math:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -8(%rbp), %edx
movl -4(%rbp), %eax
subl %edx, %eax
leave
ret
The first two lines are the preamble, setting up the stack for the function. The next four lines fetch the input parameters from the stack (since optimization was disabled, parameters weren't passed in registers).
Then the key instruction: subl, which takes the second parameter (%eax, the x86's Extended AX register, 32 bits in size) and subtracts it from the first parameter (%edx, the x86's Extended DX register, also 32 bits in size), storing the result back into %edx. In the ALU, the subl instruction takes the first parameter as-is and adds the two's complement of the second parameter. It calculates the two's complement by inverting the second parameter's bits (similar to the ~ operator in C) and then using a dedicated adder to add 1. This step could be pipelines, it could be optimized so both it and the final addition complete in one cycle, or they could go a step further and roll the two's complement logic into the ALU's adder chain.
The last two lines clean up the stack and return. (The x86 calling conventions store the result in %edx.
EDIT 2: Use the -S option to gcc to generate an assembly file (same name as input file except .c suffix is replaced with .s). For example: gcc -O0 foo.c -S (Had I not turned off the optimizer with -O0, the entire do_math() function could have been inlined into main(), making it much harder to see.)
Look, you don't have to check the number ever. If number is -ve it is stored in 2's complemented form in the memory. And you are using that number CPU changes the number to your calculations itself. You dnt need to check anything. you have to perform operations
No and no.
The transformation is done by code running on the CPU.