How to pass function parameters into inline assembly blocks without assigning them to register variables in c++ [duplicate] - g++

This question already has answers here:
How to access C variable for inline assembly manipulation?
(2 answers)
How to invoke a system call via syscall or sysenter in inline assembly?
(2 answers)
How do I pass inputs into extended asm?
(1 answer)
Closed 3 years ago.
I am trying to write a function, which prints string to stdout without importing <cstdio> or <iostream>.
For this I am trying to pass 2 parameters (const char* and const unsigned) to the asm(...) section in c++ code. And calling write syscall.
This works fine:
void writeInAsm(const char* str, const unsigned len) {
register const char* arg3 asm("rsi") = str;
register const unsigned arg4 asm("rdx") = len;
asm(
"mov rax, 1 ;" // write syscall
"mov rdi, 1 ;" // file descriptor 1 - stdout
"syscall ;"
);
}
Is it possible to do this without those first two lines in which I assign parameters to registers?
Next lines don't work:
mov rsi, str;
// error: relocation R_X86_64_32S against undefined symbol `str' can not be used when making a PIE object; recompile with -fPIC
// compiled with -fPIC - still got this error
mov rsi, [str];
// error: relocation R_X86_64_32S against undefined symbol `str' can not be used when making a PIE object; recompile with -fPIC
// compiled with -fPIC - still got this error
mov rsi, dword ptr str;
// incorrect register `rsi' used with `l' suffix
mov rsi, dword ptr [str];
// incorrect register `rsi' used with `l' suffix
I am compiling with g++ -masm=intel. I am on x86_64 Intel® Core™ i7-7700HQ CPU # 2.80GHz × 8, Ubuntu 19.04 5.0.0-36-generic kernel (if it matters).
$ g++ --version
g++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0
Edit: According to Compiler Explorer, the next can be used:
void writeInAsm(const char* str, const unsigned len) {
asm(
"mov rax, 1 ;"
"mov rdi, 1 ;"
"mov rsi, QWORD PTR [rbp-8] ;"
"mov edx, DWORD PTR [rbp-12] ;"
"syscall ;"
);
}
But is it always rbp register and how will it change with larger number of parameters?

Related

Variable exporting between asm files

In asm file1 I try to export a variable and use it in another.
I've tried to find how to do that from manuals & tutorials, but no success.
So, how can I share a global variable between asm files?
// File 1
// Here is saved value of a register (r10) to a variable.
.section .data
.global r10_save
r10_save_addr: .word r10_save
.section .text
ldr r13, =r10_save_addr // Load address for the global variable to some reg (r13)
str r13, [r10] // Save r13 to global variable
// File 2
// Here the intention is to use the variable (that should have r10 value stored).
.section .data
str_r10:
.asciz "r10 = 0x"
strlen_r10 = .-str_r10
.section .text
/* Here I want to use the reference of a variable
which has got its value in other file.
*/
mov r0, $1 //
ldr r1, =str_r10 // address of text string
ldr r2, =strlen_r10 // number of bytes to write
mov r7, $4 //
swi 0
You can use extern to get the value of global varibles:
// File 2
// Here the intention is to use the variable (that should have r10 value stored).
.section .data
str_r10:
.asciz "r10 = 0x"
strlen_r10 = .-str_r10
.section .text
/* Here I want to use the reference of a variable
which has got its value in other file.
*/
.extern r10_save // or EXTERN DATA(r10_save)
mov r0, $1 //
ldr r1, =str_r10 // address of text string
ldr r2, =strlen_r10 // number of bytes to write
mov r7, $4 //
swi 0
Then you can access r10_save in the second file too.

Which is the fastest way to find the last N bits of an integer?

Which algorithm is fastest for returning the last n bits in an unsigned integer?
1.
return num & ((1 << bits) - 1)
2.
return num % (1 << bits)
3.
let shift = num.bitWidth - bits
return (num << shift) >> shift
(where bitWidth is the width of the integer, in bits)
Or is there another, faster algorithm?
This is going to depend heavily on what compiler you have, what the optimization settings are, and what size of integers you're working with.
My hypothesis going into this section was that the answer would be "the compiler will be smart enough to optimize all of these in a way that's better than whatever you'd choose to write." And in some sense, that's correct. Consider the following three pieces of code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number, uint32_t howManyBits) {
return number & ((1 << howManyBits) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number, uint32_t howManyBits) {
return number % (1 << howManyBits);
}
uint32_t lastBitsOf_v3(uint32_t number, uint32_t howManyBits) {
uint32_t shift = sizeof(number) * CHAR_BIT - howManyBits;
return (number << shift) >> shift;
}
Over at the godbolt compiler explorer with optimization turned up to -Ofast with -march=native enabled, we get this code generated for the three functions:
lastBitsOf_v1(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v2(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v3(unsigned int, unsigned int):
mov eax, 32
sub eax, esi
shlx edi, edi, eax
shrx eax, edi, eax
ret
Notice that the compiler recognized what you were trying to do with the first two versions of this function and completely rewrote the code to use the bzhi x86 instruction. This instruction copies the lower bits of one register into another. In other words, the compiler was able to generate a single assembly instruction! On the other hand, the compiler didn't recognize what the last version was trying to do, so it actually generated the code as written and actually did the shifts and subtraction.
But that's not the end of the story. Imagine that the number of bits to extract is known in advance. For example, suppose we want the lower 13 bits. Now, watch what happens with this code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number) {
return number & ((1 << 13) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number) {
return number % (1 << 13);
}
uint32_t lastBitsOf_v3(uint32_t number) {
return (number << 19) >> 19;
}
These are literally the same functions, just with the bit amount hardcoded. Now look at what gets generated:
lastBitsOf_v1(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v2(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v3(unsigned int):
mov eax, edi
and eax, 8191
ret
All three versions get compiled to the exact same code. The compiler saw what we're doing in each case and replaced it with this much simpler code that's basically the first version.
After seeing all of this, what should you do? My recommendation would be the following:
Unless this code is an absolute performance bottleneck - as in, you've measured your code's runtime and you're absolutely certain that the code for extracting the low bits of numbers is what's actually slowing you down - I wouldn't worry too much about this at all. Pick the most readable code that you can. I personally find option (1) the cleanest, but that's just me.
If you absolutely must get every ounce of performance out of this that you can, rather than taking my word for it, I'd recommend tinkering around with different versions of the code and seeing what assembly gets generated in each case and running some performance experiments. After all, if something like this is really important, you'd want to see it for yourself!
Hope this helps!

Direct2D COM calls returning 64-bit structs and C++Builder 2010

I'm trying to get the size of a Direct2D Bitmap and getting an immediate crash.
// props and target etc all set up beforehand.
CComPtr<ID2D1Bitmap> &b;
target->CreateBitmap(D2D1::SizeU(1024,1024), frame.p_data, 1024* 4, &props, &b));
D2D_SIZE_U sz = b->GetPixelSize(); // Crashes here.
All other operations using the bitmap (including drawing it) work correctly. It's just returning the size that seems to be the problem.
Based on a articles like this by Rudy V, my suspicion is that it's some incompatibility with C++Builder 2010 and how COM functions return 64-bit structures. http://rvelthuis.de/articles/articles-convert.html
The Delphi declaration of GetPixelSize looks like this: (from D2D1.pas)
// Returns the size of the bitmap in resolution dependent units, (pixels).
procedure GetPixelSize(out pixelSize: TD2D1SizeU); stdcall;
... and in D2D1.h it's
//
// Returns the size of the bitmap in resolution dependent units, (pixels).
//
STDMETHOD_(D2D1_SIZE_U, GetPixelSize)(
) CONST PURE;
Can I fix this without rewriting the D2D headers?
All suggestions welcome - except upgrading from C++Builder 2010 which is more of a task than I'm ready for at the moment.
„getInfo“ is a function derived from Delphi code, which can work around.
void getInfo(void* itfc, void* info, int vmtofs)
{
asm {
push info // pass pointer to return result
mov eax,itfc // eax poionts to interface
push eax // pass pointer to interface
mov eax,[eax] // eax points to VMT
add eax,vmtofs // eax points rto address of virtual function
call dword ptr [eax] // call function
}
}
Disassembly of code generated by CBuilder, which results in a crash:
Graphics.cpp.162: size = bmp->GetSize();
00401C10 8B4508 mov eax,[ebp+$08]
00401C13 FF7004 push dword ptr [eax+$04]
00401C16 8D55DC lea edx,[ebp-$24]
00401C19 52 push edx
00401C1A 8B4D08 mov ecx,[ebp+$08]
00401C1D 8B4104 mov eax,[ecx+$04]
00401C20 8B10 mov edx,[eax]
00401C22 FF5210 call dword ptr [edx+$10]
00401C25 8B4DDC mov ecx,[ebp-$24]
00401C28 894DF8 mov [ebp-$08],ecx
00401C2B 8B4DE0 mov ecx,[ebp-$20]
00401C2E 894DFC mov [ebp-$04],ecx
„bmp“ is declared as
ID2D1Bitmap* bmp;
Code to call „getInfo“:
D2D1_SIZE_F size;
getInfo(bmp,&pf,0x10);
You get 0x10 (vmtofs) from disassembly line „call dword ptr [edx+$10]“
You can call „GetPixelSize“, „GetPixelFormat“ and others by calling „getInfo“
D2D1_SIZE_U ps;// = bmp->GetPixelSize();
getInfo(bmp,&ps,0x14);
D2D1_PIXEL_FORMAT pf;// = bmp->GetPixelFormat();
getInfo(bmp,&pf,0x18);
„getInfo“ works with methods „STDMETHOD_ ... CONST PURE;“, which return a result.
STDMETHOD_(D2D1_SIZE_F, GetSize)(
) CONST PURE;
For this method CBuilder generates malfunctional code.
In case of
STDMETHOD_(void, GetDpi)(
__out FLOAT *dpiX,
__out FLOAT *dpiY
) CONST PURE;
the CBuilder code works fine, „getDpi“ results void.

Parameter passing to subroutine x64 inline assembly(Vc++ 2015 with Intel c++ compiler 2017)

I coded
main()
{
unsigned char *memory;
unsigned int a=15;
float sigma=5.0f;
gaussian_filter();
}
gaussian_filter()
{
unsigned __int64 evacuate_rbp;
unsigned __int64 evacuate_rsp;
__asm
{
mov eax, a
...
mov evacuate_rbp, rbp
mov evacuate_rsp, rsp
...
mov rsp, memory
...
movss xmm0, sigma
....
mov rbp, evacuate_rbp
mov rsp, evacuate_rsp
}
}
I want to non-parameter passing to use rbp and rsp registers as an address index of memory.
In Build with Intel C++ compiler, error is occured. Why?.

Speeding up the loop

I have the following piece of code:
for chunk in imagebuf.chunks_mut(4) {
let temp = chunk[0];
chunk[0] = chunk[2];
chunk[2] = temp;
}
For an array of 40000 u8s, it takes about 2.5 ms on my machine, compiled using cargo build --release.
The following C++ code takes about 100 us for the exact same data (verified by implementing it and using FFI to call it from rust):
for(;imagebuf!=endbuf;imagebuf+=4) {
char c=imagebuf[0];
imagebuf[0]=imagebuf[2];
imagebuf[2]=c;
}
I'm thinking it should be possible to speed up the Rust implementation to perform as fast as the C++ version.
The Rust program was built using cargo --release, the C++ program was built without any optimization flags.
Any hints?
I cannot reproduce the timings you are getting. You probably have an error in how you measure (or I have 😉). On my machine both versions run in exactly the same time.
In this answer, I will first compare the assembly output of both, the C++ and the Rust version. Afterwards I will describe how to reproduce my timings.
Assembly comparison
I generated the assembly code with the amazing Compiler Explorer (Rust code, C++ Code). I compiled the C++ code with optimizations activated (-O3), too, to make it a fair game (C++ compiler optimizations had no impact on the measured timings though). Here is the resulting assembly (Rust left, C++ right):
example::foo_rust: | foo_cpp(char*, char*):
test rsi, rsi | cmp rdi, rsi
je .LBB0_5 | je .L3
mov r8d, 4 |
.LBB0_2: | .L5:
cmp rsi, 4 |
mov rdx, rsi |
cmova rdx, r8 |
test rdi, rdi |
je .LBB0_5 |
cmp rdx, 3 |
jb .LBB0_6 |
movzx ecx, byte ptr [rdi] | movzx edx, BYTE PTR [rdi]
movzx eax, byte ptr [rdi + 2] | movzx eax, BYTE PTR [rdi+2]
| add rdi, 4
mov byte ptr [rdi], al | mov BYTE PTR [rdi-2], al
mov byte ptr [rdi + 2], cl | mov BYTE PTR [rdi-4], dl
lea rdi, [rdi + rdx] |
sub rsi, rdx | cmp rsi, rdi
jne .LBB0_2 | jne .L5
.LBB0_5: | .L3:
| xor eax, eax
ret | ret
.LBB0_6: |
push rbp +-----------------+
mov rbp, rsp |
lea rdi, [rip + panic_bounds_check_loc.3] |
mov esi, 2 |
call core::panicking::panic_bounds_check#PLT |
You can immediately see that C++ does in fact produce a lot less assembly (without optimization C++ produced nearly as many instruction as Rust does). I am not sure about all of the additional instructions Rust produces, but at least half of them are for bound checking. But this bound checking is, as far as I understand, not for the actual accesses via [] but just once every loop iteration. This is just for the case that the slice's length is not divisible by 4. But I guess the Rust assembly could be better still (even with bound checks).
As mentioned in the comments, you can remove bound checking by using get_unchecked() and get_unchecked_mut(). Note however, that this did not influence the performance in my measurements!
Lastly: you should use [&]::swap(i, j) here.
for chunk in imagebuf.chunks_mut(4) {
chunk.swap(0, 2);
}
This, again, did not notably influence performance. But it's shorter and better code.
Measuring
I used this C++ code (in foocpp.cpp):
extern "C" void foo_cpp(char *imagebuf, char *endbuf);
void foo_cpp(char* imagebuf, char* endbuf) {
for(;imagebuf!=endbuf;imagebuf+=4) {
char c=imagebuf[0];
imagebuf[0]=imagebuf[2];
imagebuf[2]=c;
}
}
I compiled it with:
gcc -c -O3 foocpp.cpp && ar rvs libfoocpp.a foocpp.o
Then I used this Rust code to measure everything:
#![feature(test)]
extern crate libc;
extern crate test;
use test::black_box;
use std::time::Instant;
#[link(name = "foocpp")]
extern {
fn foo_cpp(start: *mut libc::c_char, end: *const libc::c_char);
}
pub fn foo_rust(imagebuf: &mut [u8]) {
for chunk in imagebuf.chunks_mut(4) {
let temp = chunk[0];
chunk[0] = chunk[2];
chunk[2] = temp;
}
}
fn main() {
let mut buf = [0u8; 40_000];
let before = Instant::now();
foo_rust(black_box(&mut buf));
black_box(buf);
println!("rust: {:?}", Instant::now() - before);
// ----------------------------------
let mut buf = [0u8 as libc::c_char; 40_000];
let before = Instant::now();
let ptr = buf.as_mut_ptr();
let end = unsafe { ptr.offset(buf.len() as isize) };
unsafe { foo_cpp(black_box(ptr), black_box(end)); }
black_box(buf);
println!("cpp: {:?}", Instant::now() - before);
}
The black_box() all over the place prevents the compiler from optimizing where it isn't supposed to. I executed it with (nightly compiler):
LIBRARY_PATH=.:$LIBRARY_PATH cargo run --release
Giving me (i7-6700HQ) values like these:
rust: Duration { secs: 0, nanos: 30583 }
cpp: Duration { secs: 0, nanos: 30810 }
The times fluctuate a lot (way more than the difference between both versions). I am not exactly sure why the additional assembly generated by Rust does not result in a slower execution, though.