I am using a third-party library which relies on thread_local. This results in my program calling __tls_init() repeatedly, even in each iteration of some cycles (I haven't checked all of them) despite that the thread_local variables have been unconditionally initialized by another call earlier within the same function (and in fact, near the start of the whole program).
The first instructions in __tls_init() on my x86_64 are
cmpb $0, %fs:__tls_guard#tpoff
je .L530
ret
.L530:
pushq %rbp
pushq %rbx
subq (some stack space), %rsp
movb $1, %fs:__tls_guard#tpoff
so the first time this is called per each thread, the value at %fs:__tls_guard#tpoff is set to 1 and further calls return immediately. But still, this means all the overhead of a call every time a thread_local variable is going to be accessed, right?
Note that this is a statically linked (in fact generated!) function so the compiler "knows" it begins with this condition and it could be perfectly conceivable that the flow analysis finds that it is not necessary to call this function more than once. But it doesn't.
Is it possible to get rid of the superfluous call __tls_init instructions, or at least, to stop the compiler from emitting them in time-critical sections?
Example situation from actual compilation: (-O3)
pushq %r13
movq %rdi, %r13
pushq %r12
pushq %rbp
pushq %rbx
movq %rsi, %rbx
subq $88, %rsp
call __tls_init // always gets called
movq (%rbx), %rdi
call <some local function>
movq 8(%rax), %r12
subq (%rax), %r12
movq %rax, %rbp
sarq $4, %r12
cmpq $1, %r12
jbe .L6512
leaq -2(%r12), %rax
movq $0, (%rsp)
leaq 48(%rsp), %rbx
movq %rax, 8(%rsp)
.L6506:
call __tls_init // needless and called potentially very many times!
movq %rsp, %rsi
movq %rsp, %rdi
addq $8, %rbx
call <some other local function>
movq %rax, -8(%rbx)
leaq 80(%rsp), %rax
cmpq %rbx, %rax
jne .L6506 // cycle
Update: the source code of the above is overly complicated. Here's a MWE:
void external(int);
struct X {
volatile int a; // to prevent optimizing to a constexpr
X() { a = 5; } // to enforce calling a c-tor for thread_local
void f() { external(a); } // to prevent disregarding the value of a
};
thread_local X x;
void f() {
x.f();
for(int j = 0; j < 10; j++)
x.f(); // x is totally initialized now
}
If you see this analyzed with maximum optimization settings in the Compiler Explorer (link to this particular example), you'll notice the same phenomenon of checking fs:__tls_guard#tpoff against 0 redundantly in every repetition of the loop after putting a 1 there, namely in label .L4 (assuming the output will stay the same), even though __tls_init is inlined in this super-simple case.
Although this question is about G++, CLang (see in Compiler Explorer) makes this even more obvious.
One could say that the external function call could overwrite the stored value in this example. But then what would be guaranteed? If so it could also disrespect calling conventions. In these respects the compiler just has to assume it will play nice. Besides, there were no external functions in my main code above and a single translation unit, just rather large (turns out in small examples like the MWE the compiler will detect and remove the extraneous tests, showing that it must be possible somehow).
I don't know if there is any compiler option to eliminate the tls call, but your specific code could be optimized by using a pointer to the TLS object in the function:
void f() {
auto ptr = &x;
ptr->f();
for(int j = 0; j < 10; j++)
ptr->f();
}
Related
I was digging what's a block really is, I find that in _Block_copy_internal(), we assign _NSConcreteMallocBlock to the result->isa, but _NSConcreteMallocBlock is a array with 32 counts void * elements, it confused me a lot, why define _NSConcreteMallocBlock to a array pointer?and how did the dyld link the _NSConcreteMallocBlock to the NSMallocBlock class?
The declaring it as 32 pointers simply reserves space for the size of the class object they will put there later.
If you read the comment in https://opensource.apple.com/source/libclosure/libclosure-65/data.c
These data areas are set up by Foundation to link in as real classes
post facto.
Foundation is closed-source so you cannot see how that works and what content they put into that space.
thanks, I have understood this by disassembling CoreFoundation and Foundation Framework, found this code:
___CFMakeNSBlockClasses:
0000000000008029 leaq 0x4522b0(%rip), %rdi ## literal pool for: “__NSStackBlock"
0000000000008030 callq 0x1d4858 ## symbol stub for: _objc_lookUpClass
0000000000008035 movq 0x46f07c(%rip), %rdx ## literal pool symbol address: __NSConcreteStackBlock
000000000000803c movq %rdx, %rcx
000000000000803f subq $-0x80, %rcx
0000000000008043 leaq 0x4522a5(%rip), %rsi ## literal pool for: "__NSStackBlock__"
000000000000804a movq %rax, %rdi
this assemble code corresponding to these OC code:
Class __NSStackBlock = _objc_lookUpClass(“__NSStackBlock”);
objc_initializeClassPair_internal(__NSStackBlock, “__NSStackBlock__”, &__NSConcreteStackBlock, &__NSConcreteStackBlock+0x80);
The assembly for the method -(BOOL) f { return true; } (on my iMac) is:
test`-[AppDelegate f]:
0x1000014d0 <+0>: pushq %rbp
0x1000014d1 <+1>: movq %rsp, %rbp
0x1000014d4 <+4>: movb $0x1, %al
0x1000014d6 <+6>: movq %rdi, -0x8(%rbp)
0x1000014da <+10>: movq %rsi, -0x10(%rbp)
-> 0x1000014de <+14>: movsbl %al, %eax
0x1000014e1 <+17>: popq %rbp
0x1000014e2 <+18>: retq
(to generate this I set a breakpoint on the return statement and Debug -> Debug Workflow -> Always show disassembly).
I was surprised it is eight instructions.
pushq %rbp
movq %rsp, %rbp
:
popq %rbp
retq
^ this seems to be standard boilerplate for managing the stack and returning.
movb $0x1, %al
movsbl %al, %eax
^ this loads hex 00 00 00 01 into EAX, which is the register used for the return value.
movq %rdi, -0x8(%rbp)
movq %rsi, -0x10(%rbp)
^ but what are these doing? Aren't the above 6 lines sufficient?
EDIT: I found http://www.idryman.org/blog/2014/12/02/writing-64-bit-assembly-on-mac-os-x/ helpful.
In ObjC there are two implicit parameters to every method, self and _cmd. These are passed in %rdi and %rsi (that's the rules of the 64-bit ABI). They're being saved to the stack in case we overwrite those registers with another function call somewhere in this method. If you turn on optimizations, you'll see that those instructions are removed (since we never actually need the saved values).
It's the function prologue and epilogue.
https://en.wikipedia.org/wiki/Function_prologue
I am using parse to do a tableviewer. I am trying to load the table lines. The error shows like below.
libswiftCore.dylib`swift_dynamicCastObjCClassUnconditional:
0x103710991: je 0x1037109ac ; swift_dynamicCastObjCClassUnconditional + 44
0x103710993: movq 0x7f236(%rip), %rsi ; "isKindOfClass:"
0x1037109a0: callq 0x10371346a ; symbol stub for: objc_msgSend
0x1037109aa: je 0x1037109b3 ; swift_dynamicCastObjCClassUnconditional + 51
0x1037109b3: leaq 0xc158(%rip), %rax ; "Swift dynamic cast failed"
0x1037109ba: movq %rax, 0x87427(%rip) ; gCRAnnotations + 8
My code line is :
let array:NSArray = self.cartoonData.reverseObjectEnumerator().allObjects
self.cartoonData = array as NSMutableArray
I think this is the error line code. But I don't know how can I fix it.
It seems the error speaks by itself:
Swift dynamic cast failed
which I interpret as: array cannot be cast to NSMutableArray. You should create a mutable copy of it:
self.cartoonData = NSMutableArray(array: array)
Consider the following C99 code:
#include <stdio.h>
#include <stdint.h>
struct baz { uint64_t x, y; };
uint64_t foo(uint64_t a, uint64_t b, struct baz c)
{
return a + b + c.x + c.y;
}
void bar(uint64_t a, uint64_t b, struct baz c)
{
printf("%lu\n", a);
}
The behavior I expect, when compiled with gcc -O3, is that c is passed in registers to both foo and bar, is accessed using registers in foo, and is entirely ignored in bar. GCC produces code which does this for foo. However, in bar, c is copied from registers to the stack, and is then promptly ignored:
.file "pbv.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB22:
.cfi_startproc
leaq (%rcx,%rdx), %rdx
leaq (%rdx,%rdi), %rdi
leaq (%rdi,%rsi), %rax
ret
.cfi_endproc
.LFE22:
.size foo, .-foo
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%lu\n"
.text
.p2align 4,,15
.globl bar
.type bar, #function
bar:
.LFB23:
.cfi_startproc
movq %rdx, -24(%rsp)
movl $.LC0, %esi
movq %rdi, %rdx
xorl %eax, %eax
movl $1, %edi
movq %rcx, -16(%rsp)
jmp __printf_chk
.cfi_endproc
.LFE23:
.size bar, .-bar
.ident "GCC: (Ubuntu/Linaro 4.4.6-11ubuntu2) 4.4.6"
.section .note.GNU-stack,"",#progbits
(Note that a and b are passed in %rsi and %rdi, and c is passed in %rcx and %rdx.)
The only reason I can surmise for this is some sort of ABI requirement (e.g. for interaction with longjmp). I cannot find any optimization (-f) options for GCC, nor GCC-specific annotations which inhibit this behavior. Annotating c with register does not help.
This happens with different targets as well. (Notably, on the TileGX, foo has space allocated and deallocated on the stack, but nothing is stored there.) I have tested both GCC 4.4.6 and 4.6.1.
Is this expected behavior or a bug in GCC? Either way, is there some way to work around it (beside using call-by-reference or ensuring bar can be a leaf)?
This shortcoming is the same as mentioned in bug 44194, the patch for which is present in the very latest version of GCC (4.7.2).
The cause is roughly that the call to printf (or any function) is considered to be able to access anything in memory, including stack-based locals. The patch causes stack-based locals not to be considered reachable by the callee.
For this example, I am working with objective-c, but answers from the broader C/C++ community are welcome.
#interface BSWidget : NSObject {
float tre[3];
}
#property(assign) float* tre;
.
- (void)assignToTre:(float*)triplet {
tre[0] = triplet[0];
tre[1] = triplet[1];
tre[2] = triplet[2];
}
.
- (void)copyToTre:(float*)triplet {
memcpy(tre, triplet, sizeof(tre) );
}
So between these two approaches, and considering the fact that these setter functions will only generally handle dimensions of 2,3, or 4...
What would be the most efficient approach for this situation?
Will gcc generally reduce these to the same basic operations?
Thanks.
A quick test seems to show that the compiler, when optimising, replaces the memcpy call with the instructions to perform the assignment.
Disassemble the following code, when compiled unoptimised and with -O2, shows that in the optimised case the testMemcpy function does not contain a call to memcpy.
struct test src = { .a=1, .b='x' };
void testMemcpy(void)
{
struct test *dest = malloc(sizeof(struct test));
memcpy(dest, &src, sizeof(struct test));
}
void testAssign(void)
{
struct test *dest = malloc(sizeof(struct test));
*dest = src;
}
Unoptimised testMemcpy, with a memcpy call as expected
(gdb) disassemble testMemcpy
Dump of assembler code for function testMemcpy:
0x08048414 <+0>: push %ebp
0x08048415 <+1>: mov %esp,%ebp
0x08048417 <+3>: sub $0x28,%esp
0x0804841a <+6>: movl $0x8,(%esp)
0x08048421 <+13>: call 0x8048350 <malloc#plt>
0x08048426 <+18>: mov %eax,-0xc(%ebp)
0x08048429 <+21>: movl $0x8,0x8(%esp)
0x08048431 <+29>: movl $0x804a018,0x4(%esp)
0x08048439 <+37>: mov -0xc(%ebp),%eax
0x0804843c <+40>: mov %eax,(%esp)
0x0804843f <+43>: call 0x8048340 <memcpy#plt>
0x08048444 <+48>: leave
0x08048445 <+49>: ret
Optimised testAssign
(gdb) disassemble testAssign
Dump of assembler code for function testAssign:
0x080483f0 <+0>: push %ebp
0x080483f1 <+1>: mov %esp,%ebp
0x080483f3 <+3>: sub $0x18,%esp
0x080483f6 <+6>: movl $0x8,(%esp)
0x080483fd <+13>: call 0x804831c <malloc#plt>
0x08048402 <+18>: mov 0x804a014,%edx
0x08048408 <+24>: mov 0x804a018,%ecx
0x0804840e <+30>: mov %edx,(%eax)
0x08048410 <+32>: mov %ecx,0x4(%eax)
0x08048413 <+35>: leave
0x08048414 <+36>: ret
Optimised testMemcpy does not contain a memcpy call
(gdb) disassemble testMemcpy
Dump of assembler code for function testMemcpy:
0x08048420 <+0>: push %ebp
0x08048421 <+1>: mov %esp,%ebp
0x08048423 <+3>: sub $0x18,%esp
0x08048426 <+6>: movl $0x8,(%esp)
0x0804842d <+13>: call 0x804831c <malloc#plt>
0x08048432 <+18>: mov 0x804a014,%edx
0x08048438 <+24>: mov 0x804a018,%ecx
0x0804843e <+30>: mov %edx,(%eax)
0x08048440 <+32>: mov %ecx,0x4(%eax)
0x08048443 <+35>: leave
0x08048444 <+36>: ret
Speaking from a C background, I recommend using direct assignment. That version of the code is more obvious as to your intent, and less error-prone if your array changes in the future and adds extra indices that your function doesn't need to copy.
The two are not strictly equivalent. memcpy is typically implemented as a loop that copies the data in fixed-size chunks (that may be smaller than a float), so the compiler probably won't generate the same code for the memcpy case. The only way to know for sure is to build it both ways and look at the emitted assembly in a debugger.
Even if the memcpy call is inlined, it will probably result in more code and slower execution time. The direct assignment case should be more efficient (unless your target platform requires special code to handle float datatypes). This is only an educated guess, however; the only way to know for sure is to try it both ways and profile the code.
memcpy:
Do function prolog.
Initialize counter and pointers.
Check if have bytes to copy.
Copy memory.
Increment pointer.
Increment pointer.
Increment counter.
Repeat 3-7 3 or 11 more times.
Do function epilog.
Direct assignment:
Copy memory.
Copy memory.
Copy memory.
As you see, direct assignment is much faster.