x86 assembly, optimize conditional move

x86 assembly, optimize conditional move - optimization

This is threaded code command, like Forth ?, check is 0 is on top of stack (edi) and skip next command by dereferencing command pointer (ebx).
_ifleap:
mov eax, [edi]
add edi, 4
test eax, eax
cmovz ebx, [ebx]
mov ebx, [ebx]
jmp [ebx + 12]
Is there a way to optimize this? Less lines, faster execution, better CPU support?
Idea is to check if [edi] is zero, then mov ebx, [ebx] otherwise do nothing. The edi must increment by 4 (this is sort of stack pointer). Of course cmovz is i686 only, but using label seems overkill for this task.
(Yes I have x86 instruction set reference, but it is huge, and takes long time to master, but I only use assembly occasionally, so I look for an expert advice.)

Related

Passing arrays to NASM DLL, pointer value gets reset to zero

I am passing three arrays of doubles from Python (3.6.2) into a DLL written in 64-bit NASM (Windows) using CTypes. The pointers to the arrays are in rcx, rdx, r8 and r9.
On entry, I extract the pointers into three separate arrays, called a_in_data,
b_in_data, and c_in_data. The elements of those arrays are (1) pointer (2) data type and (3) length.
In the area preceded by "Test #1" in the code below we check the value at b_in_data[0] and we get a valid pointer (just remove the comment symbols and jump to the end).
In the area preceded by "Test #2" we check the value at b_in_data[0] and we get zero. The array b_in_data[0] has not been changed by this point, but somehow it gets set to back zero.
The same happens in the block following for c_in_data. For some reason, the first code block (headed by "Extract data type and length") zeroes out the first value in b_in_data and c_in_data.
I have identified the line that is causing the problem; it's followed by the comment "THIS LINE IS THE PROBLEM, BUT IT'S NOT CLEAR WHY."
The Python code is long, but if it helps to reproduce this, please ask and I will post it. Here is the NASM code:
; Header Section
[BITS 64]
export TryThemAll
section .data
a_in_data: dd 0, 0, 0
b_in_data: dd 0, 0, 0
c_in_data: dd 0, 0, 0
out_array_pointer: dd 0
call_var_length: dd 0
section .text
finit
; _________________
TryThemAll:
push rdi
push rbp
push qword rcx
pop qword [a_in_data]
push qword rdx
pop qword [b_in_data]
push qword r8
pop qword [c_in_data]
push qword r9
pop qword [out_array_pointer]
; Test #1
; Now the value at b_in_data[0] is the pointer we just extracted from rdx
;mov rbp,b_in_data
;mov rax,qword [rbp]
;jmp out_here
;_______
; Extract data type and length
mov rdi,[out_array_pointer]
mov rbp,a_in_data
movsd xmm0,qword [rdi] ;Data type for a_in
cvttsd2si rax,xmm0
mov [rbp+8],rax ; THIS LINE IS THE PROBLEM, BUT IT'S NOT CLEAR WHY
movsd xmm0,qword [rdi+8] ;Length for a_in
cvttsd2si rax,xmm0
mov [rbp+16],rax
mov rbp,b_in_data
movsd xmm0,qword [rdi+16] ;Data type for b_in
cvttsd2si rax,xmm0
mov [rbp+8],rax
movsd xmm0,qword [rdi+24] ;Length for b_in
cvttsd2si rax,xmm0
mov [rbp+16],rax
; Test #2
; Now the value at [0] in b_in_data is zero !!!
mov rbp,b_in_data
mov rax,qword [rbp]
jmp out_here
mov rbp,c_in_data
movsd xmm0,qword [rdi+32] ;Data type for c_in
cvttsd2si rax,xmm0
mov [rbp+8],rax
movsd xmm0,qword [rdi+40] ;Length for c_in
cvttsd2si rax,xmm0
mov [rbp+16],rax
;_______
out_here:
pop rbp
pop rdi
ret
Thanks in advance for any help.

The solution to this problem was quite simple. The three arrays a_in_data, b_in_data and c_in_data were defined contiguously in the .data section as "dd" but should have been defined as "dq" to occupy eight bytes per element instead of four. Naturally successive writes had the effect of overstoring adjacent values.
I recently switched from 32-bit MASM to 64-bit NASM and I'm still getting used to NASM syntax and 64-bit assembly programming, so I'm still making some elementary mistakes.
Thanks, Peter, for the time you took on this. You made some other interesting points. For example, I've switched to using lea (load effective address) instead of moving the pointer to rbp (e.g., mov rbp,b_in_data).
Thanks again, and thanks to Michael Petch for adding the other tags.
BTW, these data are all converted to 64-bit integers, so the struc is not necessary -- they are not mixed types.

Error: "unknown opcode skipped: 32"

I wrote an 8086 program, and as far as I can tell it runs fine, but when it gets to the part where I declare the variables, the emulator gives me an error. When trying to run the line temp db 0x0F, the emulator says:
unknown opcode skipped: 32
not 8086 instruction - not supported yet.
Here's my full program:
org 100h
mov ah, temp ;put variables into registers
mov al, changed
mov dx, result
lea bx, temp ;get address of temp and put into bx
add dx, [bx] ;add value at the address in bx to result
lea bx, changed ;get address of changed and put into bx
add dx, [bx] ;add value at the address in bx to result
temp db 0x0F ;declare and initialize variables
changed db 32h
result dw 0
Is this consequential to how the program functions, and how do I fix it?
EDIT: sigjuice solved the problem, as you can see in the comments. Here's the final version of the program that runs correctly:
.CODE
org 100h
mov ah, temp ;put variables into registers
mov al, changed
mov dx, result
lea bx, temp ;get address of temp and put into bx
add dx, [bx] ;add value at the address in bx to result
lea bx, changed ;get address of changed and put into bx
add dx, [bx] ;add value at the address in bx to result
.DATA
temp db 0x0F ;declare and initialize variables
changed db 32h
result dw 0

add dx, [bx] ;add value at the address in bx to result
temp db 0x0F ;declare and initialize variables
In this part of your program there's nothing that stops the CPU from executing the data at the temp label as if it were an instruction.
Although adding the .CODE and .DATA assembler directives (perhaps suggested by #sigjuice) seemingly solves the problem, this is typically not what you use when writing a .COM executable. It's a .COM executable because you used the org 100h directive.
What your program really needs is a way to return to the operating system. Since this is EMU8086 the preferred way is using the DOS.TerminateWithReturncode function.
add dx, [bx] ;add value at the address in bx to result
; Exit to the operating system
mov ax, 4C00h ;AH=4Ch function number, AL=0 exitcode (0 most often means OK)
int 21h ;DOS system call
; Now beyond this point nothing gets executed inadvertently
temp db 0Fh ;declare and initialize variables
I can't really advice to return to the operating system using a mere ret instruction, because this method requires that the SS:SP registers are set as they were when the program started. This will not always be the case. Better use this DOS function that does not rely on any specific register setting.
lea bx, temp ;get address of temp and put into bx
add dx, [bx] ;add value at the address in bx to result
lea bx, changed ;get address of changed and put into bx
add dx, [bx] ;add value at the address in bx to result
Nothing to do with your original problem but as a bonus:
Because temp and changed are both byte-sized variables, the word-sized additions don't just add the variables alone but also the byte that happens to follow them in memory! Sometimes this is intentional (I sincerily doubt this is the case here!), but you need to make sure that you understand this.

Dynamic Variable creation in assembly? (x86 Assembly)

Lets say I have a PROC in My assembly code like so:
.CODE
PROC myProc
MOV EAX, 00000001
MOV EBX, 00001101
RET
ENDP myProc
I want to MOV 1, into the EAX register, and move 13 into the EBX register in my procedure, however I want to create two variables local to my PROC, assigning var a the value of 1, and var b the value of 13, and from there MOVing [a] into EAX, and [b] into EBX. I have had many ideas about this before, perhaps creating space on the stack for the variables, or something like:
.CODE
PROC myProc
PUSH ESP
PUSH EBP
MOV ESP, 00000001
MOV EBP, 00001101
MOV EAX, [ESP]
MOV EBX, [EBP]
ENDP myProc
But this still really isn't dynamic variable creation, I am just writing and reading data back and forth between registers. So in essence I am trying to figure out how to create variable in assembly at run-time. I would appreciate any help.

Variables are a high-level concept. An asm implementation of a C function will typically have a variable live in a register for some of the time, but maybe at other times it's live in a different register, or in memory at some location once it's no longer needed (or you ran out of registers).
In asm you don't really have variables (other than static storage), except by using comments to keep track of what means what. Just move data around and produce a meaningful result.
Avoid memory whenever possible. Look at C compiler output: any decent compiler will keep everything in registers as much as possible.
int foo(int a, int b) {
int c = a + 2*b;
int d = 2*a + b;
return c + d;
}
This function compiles to the following 32-bit code with gcc6.2 -O3 -fverbose-asm (on the Godbolt compiler explorer). Notice how gcc attaches variable names to registers with comments.
mov ecx, DWORD PTR [esp+4] # a, a
mov edx, DWORD PTR [esp+8] # b, b
lea eax, [ecx+edx*2] # c,
lea edx, [edx+ecx*2] # d,
add eax, edx # tmp94, d
ret

It seems like you're using MASM syntax. The standard MASM approach to create local variables is
.CODE
PROC myProc
LOCAL a: DWORD
LOCAL b: DWORD
; Initialize those vars
MOV a, 00000001
MOV b, 00001101
RET
ENDP myProc
The LOCAL directive creates space on the stack for the variables using EBP relative indexing.

enter low power mode within u-boot, wake up on interrupt

I try to implement a low power "deep sleep" functionality into uboot on button press. Button press is handled by linux and a magic code is set to make u-boot aware of the stay asleep do not reboot"
printf ("\nDisable interrupts to restore them later\n");
rupts = disable_interrupts();
printf ("\nEnable interrupts to enable magic wakeup later\n");
enable_interrupts();
printf ("\nSuspending. Press button to restart\n");
while(probe_button()/*gpio probe*/){
#if 1
//FIXME recheck if that one actually needs an unmasked interrupt or any is ok
__asm__ __volatile__(
"mcr p15, 0, %0, c7, c0, 4\n" /* read cp15 */
"mov %0, %0"
: "=r" (tmp)
:
: "memory"
);
#else
udelay (10000);
#endif
}
if (rupts) {
printf ("\nRe-Enabling interrupts\n");
enable_interrupts();
}
Unfortunatly the power dissipation does not change at all (got power dissipation measurment tied to the chip), no matter if hotspinning is used or not. Beyond that, if I use the Wait-For-Interrupt CP15 instruction, it never wakes up. The button is attached to one of the GPIOs. The plattform is Marvell Kirkwood ARM9EJ-S based.
I enabled some CONFIG_IRQ_* manually, and create implementation for arch_init_irq() aswell as do_irq(), I think there is my issue.
According to the CP15 instruction docs it should be just enough that a interrupt gets triggered (no matter if masked or not!).
Can anyone tell me what I am doing wrong or what needs to be done beyond the code above?
Thanks a lot in advance!

I'm not sure if it is the only reason your aproach isn't working on power saving but your inline assembly isn't correct. According to this article you need to execute:
MOV R0, #0
MCR p15, 0, r0, c7, c0, 4
but your inline assembly
__asm__ __volatile__(
"mcr p15, 0, %0, c7, c0, 4\n" /* read cp15 */
"mov %0, %0"
: "=r" (tmp)
:
: "memory"
);
produces
0: ee073f90 mcr 15, 0, r3, cr7, cr0, {4}
4: e1a03003 mov r3, r3
8: e12fff1e bx lr
I am not sure what's your intent but mov r3, r3 doesn'ẗ have any effect. So you are making coprocessor call with a random value. You also need to set r3 (ARM source register for mcr) before mcr call. Btw when you put 'memory' in clobber list it means
... will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
Try this line,
asm("MOV R0, #0\n MCR p15, 0, r0, c7, c0, 4" : : : "r0");
it produces
c: e3a00000 mov r0, #0 ; 0x0
10: ee070f90 mcr 15, 0, r0, cr7, cr0, {4}
For power saving in general, I would recommend this article at ARM's web site.
Bonus section:
A small answer to your claim on backward compability of this coprocessor supplied WFI:
ARMv7 processors (including Cortex-A8, Cortex-A9, Cortex-R4 and Cortex-M3) all implement the WFI instruction to enter "wait for interrupt" mode. On these processors, the coprocessor write used on earlier processors will always execute as a NOP. It is therefore possible to write code that will work across ARMv6K, ARMv6T2 and all profiles of ARMv7 by executing both the MCR and WFI instruction, though on ARM11MPCore this will cause "wait for interrupt" mode to be entered twice. To write fully portable code that enters "wait for interrupt" mode, the CPUID register must be read at runtime to determine whether "wait for interrupt" is available and the instruction needed to enter it.

IA-32 | reordering issue in a multiprocessor environment

I want to perform an atomic 'and' operation on IA-32.
Please consider the following situation:
; processor 0
lea edx, var
mov ecx, mask
mov eax, [edx]
lock and [edx], ecx
; processor 1
lea edx, var
mov eax, 0xff
xchg [edx], eax
I'm not sure if it's possible that the store to 'var' by processor 1 can or cannot occure between the load and the store to 'var' by processor 0.
So, is this working or do I need to spin lock like this:
; processor 0
push ebx
lea edx, var
mov ecx, mask
##loop:
mov ebx, [edx]
mov eax, ebx
and eax, ecx
lock cmpxchg [edx], eax
cmp eax, ebx
jne ##loop
pop ebx
Thanks for any answer. Best regards.
EDIT:
In other words:
I want to perform the conjunction in 'Processor 0' and need to fetch the initial value.

An xchg that references memory automatically locks the bus (or locks the cache when/if the data is already in the cache). See the Intel reference manual, §8.3.1. (Warning: I haven't looked hard recently, but Intel used to rearrange their web site, invalidating links fairly quickly. If so, Googling for something like "intel reference 3a" should turn it up).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas