My elf file's sections are as follows
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS fffffffff7e020b0 000000b0
0000000000074f50 0000000000000000 AX 0 0 16
[ 2] .rodata PROGBITS fffffffff7e77000 00075000
0000000000014000 0000000000000000 AM 0 0 16
[ 3] .eh_frame_hdr PROGBITS fffffffff7e8b000 00089000
000000000000000c 0000000000000000 A 0 0 4
[ 4] .data PROGBITS fffffffff7e8b010 00089010
0000000000001ff0 0000000000000000 WA 0 0 16
[ 5] .bss NOBITS fffffffff7e8d000 0008b000
0000000000014000 0000000000000000 WA 0 0 4096
[ 6] .got PROGBITS fffffffff7ea1000 0009f000
0000000000001000 0000000000000000 WA 0 0 8
[ 7] .tdata PROGBITS fffffffff7ea2000 000a0000
000000000000b000 0000000000000000 WAT 0 0 8
[ 8] .tbss NOBITS fffffffff7ead000 000ab000
000000000000012a 0000000000000000 WAT 0 0 8
[ 9] .debug_abbrev PROGBITS 0000000000000000 000ab000
000000000001a8d4 0000000000000000 0 0 1
[10] .debug_info PROGBITS 0000000000000000 000c58d4
000000000016c624 0000000000000000 0 0 1
[11] .debug_aranges PROGBITS 0000000000000000 00231ef8
0000000000024130 0000000000000000 0 0 1
[12] .debug_ranges PROGBITS 0000000000000000 00256028
000000000004bfa0 0000000000000000 0 0 1
[13] .debug_str PROGBITS 0000000000000000 002a1fc8
0000000000123ecd 0000000000000001 MS 0 0 1
[14] .debug_pubnames PROGBITS 0000000000000000 003c5e95
000000000005da3c 0000000000000000 0 0 1
[15] .debug_pubtypes PROGBITS 0000000000000000 004238d1
00000000000a1e07 0000000000000000 0 0 1
[16] .debug_frame PROGBITS 0000000000000000 004c56d8
0000000000056c10 0000000000000000 0 0 8
[17] .debug_line PROGBITS 0000000000000000 0051c2e8
00000000000b5c55 0000000000000000 0 0 1
[18] .debug_loc PROGBITS 0000000000000000 005d1f3d
000000000000628c 0000000000000000 0 0 1
[19] .symtab SYMTAB 0000000000000000 005d81d0
0000000000015a20 0000000000000018 21 2565 8
[20] .shstrtab STRTAB 0000000000000000 005edbf0
00000000000000da 0000000000000000 0 0 1
[21] .strtab STRTAB 0000000000000000 005edcca
000000000004783e 0000000000000000 0 0 1
and the program headers are
Elf file type is EXEC (Executable file)
Entry point 0xfffffffff7e39be0
There are 2 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0xfffffffff7e02000 0x0000000000000000
0x00000000000ab000 0x00000000000ab12a RWE 0x1000
TLS 0x00000000000a0000 0xfffffffff7ea2000 0x00000000000a0000
0x000000000000b000 0x000000000000b12a RW 0x8
Section to Segment mapping:
Segment Sections...
00 .text .rodata .eh_frame_hdr .data .bss .got .tdata
01 .tdata .tbss
The TLS section's total size is as expected being the size of tdata + tbss sections.
From the TLS handling document, i expect that tdata and tbss sections will be right before the fs register's pointed value and thus the total size would be 0xb12a.
But one of the variables which use thread_local have an offset of fs - 0xb130 which is outside of the expected TLS size.
I am trying to understand why the variable is not offset to at-most 0xb12a but more than that?
Although the size of TLS here is 0xb12a. The alignment of 0x8 will make the TLS pointer move to 0xb130 which is the address of variable observed here.
I have a Linux executable file (which is an ELF file) called a.out.
I use
readelf -hl a.out
and get output
ELF Header:
Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - GNU
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x4008e0
Start of program headers: 64 (bytes into file)
Start of section headers: 910680 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 6
Size of section headers: 64 (bytes)
Number of section headers: 33
Section header string table index: 30
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000c9be6 0x00000000000c9be6 R E 200000
LOAD 0x00000000000c9eb8 0x00000000006c9eb8 0x00000000006c9eb8
0x0000000000001c98 0x0000000000003550 RW 200000
NOTE 0x0000000000000190 0x0000000000400190 0x0000000000400190
0x0000000000000044 0x0000000000000044 R 4
TLS 0x00000000000c9eb8 0x00000000006c9eb8 0x00000000006c9eb8
0x0000000000000020 0x0000000000000050 R 8
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 10
GNU_RELRO 0x00000000000c9eb8 0x00000000006c9eb8 0x00000000006c9eb8
0x0000000000000148 0x0000000000000148 R 1
Section to Segment mapping:
Segment Sections...
00 .note.ABI-tag .note.gnu.build-id .rela.plt .init .plt .text __libc_freeres_fn __libc_thread_freeres_fn .fini .rodata __libc_subfreeres __libc_atexit .stapsdt.base __libc_thread_subfreeres .eh_frame .gcc_except_table
01 .tdata .init_array .fini_array .jcr .data.rel.ro .got .got.plt .data .bss __libc_freeres_ptrs
02 .note.ABI-tag .note.gnu.build-id
03 .tdata .tbss
04
05 .tdata .init_array .fini_array .jcr .data.rel.ro .got
According to the above output, this ELF file should have size
ELF header size : 64
+
Section headers : 64*33 = 2112
+
Program headers : 56*6 = 336
+
Segments : 834090 = 0x00000000000c9be6 + 0x0000000000001c98 + 0x0000000000000044 + 0x0000000000000020 + 0x0000000000000148
= 64 + 2112 + 336 + 834090 = 836602
However, /bin/ls reports file size 912792.
where are the 912792 - 836602 = 76190 bytes?
which parts did I forget to count?
UPDATE
according to #Jonathon Reinhart, I re count the size using section sizes instead of segment size by
echo "print 0x020 + 0x024 + 0x0f0 + 0x01a + 0x0a0 + 0x09eda4 + 0x02529 + 0x0de + 0x09 + 0x01d320 + 0x050 + 0x08 + 0x01 + 0x08 + 0x0b04c + 0x0b2 + 0x020 + 0x010 + 0x010 + 0x08 + 0x0e4 + 0x010 + 0x068 + 0x01ad0 + 0x023 + 0x0f18 + 0x0169 + 0x0b100 + 0x0685a" | /usr/bin/python
The section size values used above are extracted from the output of readelf -S a.out and the size of section type of "NOBITS" were not counted.
The above code output 909459 which is bigger than segment size 834090,
however, the total size is still not equal to the file size.
64 + 2112 + 336 + 909459 = 911971 != 912792
still missing 912792 - 911971 = 821 bytes.
update2 [solved]
according to the comment, I have test the offset. The result tells me that the adding pad does exist.
I have a question,about elf program segments offsize in file. For example , a program readelf -f xx -W like this:
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
INTERP 0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x4ca8e6 0x4ca8e6 R E 0x200000
LOAD 0x4cb000 0x0000000000acb000 0x0000000000acb000 0x035db8 0x04ed80 RW 0x200000
DYNAMIC 0x4ed4c8 0x0000000000aed4c8 0x0000000000aed4c8 0x000230 0x000230 RW 0x8
NOTE 0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R 0x4
TLS 0x4cb000 0x0000000000acb000 0x0000000000acb000 0x000010 0x000018 R 0x10
GNU_EH_FRAME 0x3dcf04 0x00000000007dcf04 0x00000000007dcf04 0x024c64 0x024c64 R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame .gcc_except_table
03 .tdata .init_array .fini_array .jcr .data.rel.ro .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.ABI-tag .note.gnu.build-id
06 .tdata .tbss
07 .eh_frame_hdr
08
The first load begin at offset 0x000000 and the size is 0x4ca8e6. why the second offset not (0x000000 + 0x4ca8e6), I see the (0x4cb000 - 0x4ca8e6) content, all 0. I can't get it. What the rule about the offset in file?
The first load begin at offset 0x000000 and the size is 0x4ca8e6. why the second offset not (0x000000 + 0x4ca8e6)
Because the loader mmaps LOAD segments directly into memory, for each LOAD segment the following must be true: (p_vaddr - p_offset) % page_size == 0.
On x86_64 the maximum page size is 2MiB (0x200000). This places severe restriction on the second (and subsequent) LOAD segment location.
I am trying to write a small library for highly optimised x86-64 bit operation code and am fiddling with inline asm.
While testing this particular case has caught my attention:
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
__asm__ ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d\n", test, bsr);
compiles and runs fine in both gcc and icc, but when I inspect the assembly I get differences
gcc -S -fverbose-asm -std=gnu99 -O3
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
movq -8(%rbp), %rax
movq %rax, -16(%rbp)
## InlineAsm Start
bsrq -16(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
I am wondering why so complicated? I am writing high performance code in which the number of instructions is critical. I am especially wondering why gcc makes a copy of my variable test before passing it to the second inline asm?
Same code compiled with icc gives far better results:
xorl %esi, %esi # test = 0
movl $.L_2__STRING.0, %edi # has something to do with printf
orl $32832, (%rsp) # part of function initiation
xorl %eax, %eax # has something to do with printf
ldmxcsr (%rsp) # part of function initiation
btsq $39, %rsi #106.0
bsrq %rsi, %rdx #109.0
call printf #111.2
despite the fact that gcc decides to keep my variables on the stack rather then in registers, what I do not understand is why make a copy of test before passing it to the second asm?
If I put test in as an input/output variable in the second asm
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) , "+rm" (test) );
then those lines disappear.
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
## InlineAsm Start
bsrq -8(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
Is this gcc screwed up optimisation or am I missing some vital compiler switches? I do have icc for my production system, but if I decide to distribute the source code at some point then it will have to be able to compile with gcc too.
compilers used:
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)
icc Version 12.0.2
I've tried your example on Linux like this (making it "evil" by forcing a stack ref/loc for test through using &test in the printf:):#include <stdio.h>
int main(int argc, char **argv)
{
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
asm ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
asm ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d, &test = %p\n", test, bsr, &test);
return 0;
}
and compiled it with various versions of gcc -O3 ... to the following results:
code generated gcc version
================================================================================
400630: 48 83 ec 18 sub $0x18,%rsp 4.7.2,
400634: 31 c0 xor %eax,%eax 4.6.2,
400636: bf 50 07 40 00 mov $0x400750,%edi 4.4.6
40063b: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
400640: 48 0f ba e8 27 bts $0x27,%rax
400645: 48 89 44 24 08 mov %rax,0x8(%rsp)
40064a: 48 89 c6 mov %rax,%rsi
40064d: 48 0f bd d0 bsr %rax,%rdx
400651: 31 c0 xor %eax,%eax
400653: e8 68 fe ff ff callq 4004c0
[ ... ]
---------------------------------------------------------------------------------
4004f0: 48 83 ec 18 sub $0x18,%rsp 4.1
4004f4: 31 c0 xor %eax,%eax
4004f6: bf 28 06 40 00 mov $0x400628,%edi
4004fb: 48 8d 4c 24 10 lea 0x10(%rsp),%rcx
400500: 48 c7 44 24 10 00 00 00 00 movq $0x0,0x10(%rsp)
400509: 48 0f ba e8 27 bts $0x27,%rax
40050e: 48 89 44 24 10 mov %rax,0x10(%rsp)
400513: 48 89 c6 mov %rax,%rsi
400516: 48 0f bd d0 bsr %rax,%rdx
40051a: 31 c0 xor %eax,%eax
40051c: e8 c7 fe ff ff callq 4003e8
[ ... ]
---------------------------------------------------------------------------------
400500: 48 83 ec 08 sub $0x8,%rsp 3.4.5
400504: bf 30 06 40 00 mov $0x400630,%edi
400509: 31 c0 xor %eax,%eax
40050b: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
400513: 48 89 e1 mov %rsp,%rcx
400516: 48 0f ba 2c 24 27 btsq $0x27,(%rsp)
40051c: 48 8b 34 24 mov (%rsp),%rsi
400520: 48 0f bd 14 24 bsr (%rsp),%rdx
400525: e8 fe fe ff ff callq 400428
[ ... ]
---------------------------------------------------------------------------------
4004e0: 48 83 ec 08 sub $0x8,%rsp 3.2.3
4004e4: bf 10 06 40 00 mov $0x400610,%edi
4004e9: 31 c0 xor %eax,%eax
4004eb: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
4004f3: 48 0f ba 2c 24 27 btsq $0x27,(%rsp)
4004f9: 48 8b 34 24 mov (%rsp),%rsi
4004fd: 48 89 e1 mov %rsp,%rcx
400500: 48 0f bd 14 24 bsr (%rsp),%rdx
400505: e8 ee fe ff ff callq 4003f8
[ ... ]
and while there's a significant difference in the created code (including whether the bsr acceesses test as register or memory), none of the tested revs recreate the assembly that you've shown. I'd suspect a bug in the 4.2.x version you used on MacOSX, but then I don't have either your testcase nor that specific compiler version available.
Edit: The code above is obviously different in the sense that it forces test into the stack; if that is not done, then all "plain" gcc versions I've tested do a direct pair bts $39, %rsi / bsr %rsi, %rdx.
I have found, though, that clang creates different code there: 140: 50 push %rax
141: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
149: 31 f6 xor %esi,%esi
14b: 48 0f ba ee 27 bts $0x27,%rsi
150: 48 89 34 24 mov %rsi,(%rsp)
154: 48 0f bd d6 bsr %rsi,%rdx
158: bf 00 00 00 00 mov $0x0,%edi
15d: 30 c0 xor %al,%al
15f: e8 00 00 00 00 callq printf#plt>so the difference seems to be indeed between the code generators of clang/llvm and "gcc proper".
I have made the following dummy code for testing
/tmp/test.c contains the following:
#include "test.h"
#include <stdio.h>
#include <stdlib.h>
struct s* p;
unsigned char *c;
void main(int argc, char ** argv) {
memset(c, 0, 10);
p->a = 10;
p->b = 20;
}
/tmp/test.h contains the following:
struct s {
int a;
int b;
};
I compile and run objdump as follows:
cd /tmp
gcc -c test.c -o test.o
objdump -gdsMIntel test.o
I get the following output:
test.o: file format elf32-i386
Contents of section .text:
0000 5589e5a1 00000000 c7000000 0000c740 U..............#
0010 04000000 0066c740 080000a1 00000000 .....f.#........
0020 c7000a00 0000a100 000000c7 40041400 ............#...
0030 00005dc3 ..].
Contents of section .comment:
0000 00474343 3a202855 62756e74 752f4c69 .GCC: (Ubuntu/Li
0010 6e61726f 20342e36 2e332d31 7562756e naro 4.6.3-1ubun
0020 74753529 20342e36 2e3300 tu5) 4.6.3.
Contents of section .eh_frame:
0000 14000000 00000000 017a5200 017c0801 .........zR..|..
0010 1b0c0404 88010000 1c000000 1c000000 ................
0020 00000000 34000000 00410e08 8502420d ....4....A....B.
0030 05700c04 04c50000 .p......
Disassembly of section .text:
00000000 <main>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: a1 00 00 00 00 mov eax,ds:0x0 ;;;; should be the address of unsigned char *c
8: c7 00 00 00 00 00 mov DWORD PTR [eax],0x0 ;;;; setting 10 bytes to 0
e: c7 40 04 00 00 00 00 mov DWORD PTR [eax+0x4],0x0
15: 66 c7 40 08 00 00 mov WORD PTR [eax+0x8],0x0
1b: a1 00 00 00 00 mov eax,ds:0x0
20: c7 00 0a 00 00 00 mov DWORD PTR [eax],0xa ;;;; p->a = 10;
26: a1 00 00 00 00 mov eax,ds:0x0
2b: c7 40 04 14 00 00 00 mov DWORD PTR [eax+0x4],0x14 ;;;; p->b = 20;
32: 5d pop ebp
33: c3 ret
In the above disassembly, I find that:
In the case of c, the following is done:
mov eax, ds:0x0
mov DWORD PTR [eax], 0
In the case of p->a the following is done:
mov eax, ds:0x0
mov DWORD PTR [eax],0x0
In that case, are both c and p->a located in the same address (ds:0x0)?
The .o file is not yet executable, there'll be relocation entries pointing to those 0's which the linker will fixup later. Since your source file seems mostly self-contained, can you drop the -c flag to produce an (fully linked) executable rather than a relocatable object file?