In an ELF file, how does the address for _start get detemined? - elf

I've been reading the ELF specification and cannot figure out where the program entry point and _start address come from.
It seems like they should have to be in a pretty consistent place, but I made a few trivial programs, and _start is always in a different place.
Can anyone clarify?

The _start symbol may be defined in any object file. Normally it is generated automatically (it corresponds to main in C). You can generate it yourself, for instance in an assembler source file:
.globl _start
_start:
// assembly here
When the linker has processed all object files it looks for the _start symbol and puts its value in the e_entry field of the elf header. The loader takes the address from this field and makes a call to it after it has finished loading all sections in memory and is ready to execute the file.

Take a look at the linker script ld is using:
ld -verbose
The format is documented at: https://sourceware.org/binutils/docs-2.25/ld/Scripts.html
It determines basically everything about how the executable will be generated.
On Binutils 2.24 Ubuntu 14.04 64-bit, it contains the line:
ENTRY(_start)
which sets the entry point to the _start symbol (goes to the ELF header as mentioned by ctn)
And then:
. = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
which sets the address of the first headers to 0x400000 + SIZEOF_HEADERS.
I have modified that address to 0x800000, passed my custom script with ld -T and it worked: readelf -s says that _start is at that address.
Another way to change it is to use the -Ttext-segment=0x800000 option.
The reason for using 0x400000 = 4Mb = getconf PAGE_SIZE is to start at the beginning of the second page as asked at: Why is the ELF execution entry point virtual address of the form 0x80xxxxx and not zero 0x0?
A question describes how to set _start from the command line: Why is the ELF entry point 0x8048000 not changeable with the "ld -e" option?
SIZEOF_HEADERS is the size of the ELF + program headers, which are at the beginning of the ELF file. That data gets loaded into the very beginning of the virtual memory space by Linux (TODO why?) In a minimal Linux x86-64 hello world with 2 program headers it is worth 0xb0, so that the _start symbol comes at 0x4000b0.

I'm not sure but try this link http://www.docstoc.com/docs/23942105/UNIX-ELF-File-Format
at page 8 it is shown where the entry point is if it is executable. Basically you need to calculate the offset and you got it.
Make sure to remember the little endianness of x86 ( i guess you use it) and reorder if you read bytewise edit: or maybe not i'm not quit sure about this to be honest.

Related

modify build-id in the notes section of the elf file

I need to modify a build-id in the notes section of the ELF file. I see there are plenty of tools to read elf but not to modify them. I found elfedit but it doesn't seem to do what I need. Is it even possible?
Here is the output of readelf
$ readelf -n myelffile
Displaying notes found in: .note.ABI-tag
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 3.14.0
Displaying notes found in: .note.gnu.build-id
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: d75a086c288c582036b0562908304bc3a8033235
I'm trying to modify .note.gnu.build-id section.
Is it even possible?
Yes. This is one of the easier modifications, since the data in the note is completely arbitrary, and no other data refer to it.
All you have to do is find the .note section, decode each note in turn until you find the one with NT_GNU_BUILD_ID type, and overwrite its data with same-length bytes of your choosing.
Are you aware of the linker --build-id 0x.... option which allows you to put in whatever hex data you desire at link time? If you can relink your binary, then you wouldn't need to modify the build-id note, as the linker will happily put your data there during the initial link.

what is the x86 opcode for assembly variables and constants?

i know that every instruction has opcodes.
i could find opcodes for mov , sub instructions.
but what is the opcode for variables and it's types.
we use assembler directives to define a variable and constant?
how they are represented in x86 opcodes?
nasm assembler x86: segment .bss
largest resb 2 ; reserves two bytes for largest
segment .data
number1 DW 12345 ; defines a constant number1
i tried online this https://defuse.ca/online-x86-assembler.htm#disassembly assembly to opcode conveter. but when i used nasm code to define a variable it shows error!
There's no opcode for variables. There are even no variables in machine code.
There is CPU and memory. Memory contains some values (bytes).
The CPU has cs:ip instruction pointer, pointing to memory address, where is the next instruction to execute, so it will read byte(s) from that address, and interpret them as opcode, and execute it as an instruction.
Whether you have in memory stored data or machine code doesn't matter, both are byte values.
What makes part of memory "data" or "variable" is the logical interpretation created by the running code, it's the code which does use certain part of memory only as "data/variables" and other part of memory as "code" (or eventually as both at the same time, like in this DOS 51B long COM code drawing Greece flag on screen, where the XLAT instruction is using the code opcodes also as source data for blue/white strips configuration).
Whether you write in your source:
x:
add al,al
or
x:
db 0x00, 0xC0
Doesn't matter, the resulting machine code is identical (in both cases the CPU will execute add al,al when pointed to that memory to be executed as instruction, and mov ax,[x] will set ax to 0xC000 in both cases, when used as "variable".
You may want to check listing file from the assembler (-l <listing_file_name> command line option for nasm) to see yourself there's no way to tell which bytes are code and which are data.
Assembler directives like segment, resb, or dw are not instructions and do not correspond to opcodes. That's why they are directives instead of instructions. Roughly speaking, there are two kinds of directives:
one kind of directive configures the assembler. For example, the segment directive configures the assembler to continue assembly in the section you provided.
another kind of directive emits data. For example, the dw directive emits the given datum into the object file. This can be used to place arbitrary data into memory for use with your program.

How to convert ELF file to binary file?

My understanding is that a binary file is the hex-codes of the instructions of the processor (can be loaded into memory & start executing from entry point) and a ELF file is the same with NO-Fixed memory addresses assigned for data etc...
Now, how can I convert ELF to binary?
How the conversion works? I mean how the memory addresses are assigned?
In general
An ELF file does not need to use "NO-Fixed memory addresses". In fact, the typical ELF executable file (ET_EXEC) is using a fixed address.
A binary file is usually understood as a file containing non-text data. In the context of programs, it is usually understood to mean the compiled form of the program (in opposition to the source form which is usually a bunch of text files). ELF file are binary files.
Now you might want to know how the ELF file is transformed into the in-memory-representation of the program: the ELF file contains additional information such as where in the program (virtual) address-space each segment of the program should be loaded, which dynamic-libraries should be loaded, how to link the main program and the dynamic libraries together, how to initialise the program, where is the entry point of the program, etc.
One important part of an executable or shared-object is the location of the segments which must be loaded into the program address space. You can look at them using readelf -l:
$ readelf -l /bin/bash
Elf file type is EXEC (Executable file)
Entry point 0x4205bc
There are 9 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x00000000000001f8 0x00000000000001f8 R E 8
INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238
0x000000000000001c 0x000000000000001c R 1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000f1a74 0x00000000000f1a74 R E 200000
LOAD 0x00000000000f1de0 0x00000000006f1de0 0x00000000006f1de0
0x0000000000009068 0x000000000000f298 RW 200000
DYNAMIC 0x00000000000f1df8 0x00000000006f1df8 0x00000000006f1df8
0x0000000000000200 0x0000000000000200 RW 8
NOTE 0x0000000000000254 0x0000000000400254 0x0000000000400254
0x0000000000000044 0x0000000000000044 R 4
GNU_EH_FRAME 0x00000000000d6af0 0x00000000004d6af0 0x00000000004d6af0
0x000000000000407c 0x000000000000407c R 4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 10
GNU_RELRO 0x00000000000f1de0 0x00000000006f1de0 0x00000000006f1de0
0x0000000000000220 0x0000000000000220 R 1
Each LOAD (PT_LOAD) entry describes a segment which must be loaded in the program address-space.
Reading and processing this information is the job of the ELF loaders: on your typical OS this is done in part by the kernel and in part by the dynamic-linker (ld.so, also called "program interpreter" in ELF parlance).
ARM plain binary files
(I don't really known about ARM stuff.)
You're apparently talking about embedded platforms. On ARM, a plain binary file contains the raw content of the initial memory of the program. It does not contain things such as string tables, symbol tables, relocation tables, debug informations but only the data of the (PT_LOAD) segments.
It is a binary file, not hex-encoded. The vhx files are hex-encoded.
Plain binary files can be generated from the ELF files with fromelf.
The basic idea here is that each PT_LOAD entry of a ELF file is dumped at its correct position in the file and remaining gaps (if any) between them are filled with zeros.
The ELF file already has addresses assigned in the p_vaddr field of each segment so this conversion process does not need to determine addresses: this has already been done by the link editor (and the linker script).
References
ARM ELF file format
I came here while searching for "convert .elf into binary file" (with arm files in mind, though).
It turned out that the easiest way in my case was to use
arm-none-eabi-objcopy -O binary kernel.elf kernel.bin
I do not understand what do you want to say but ELF(Executable Linkable format) is a new executable format. ofcourse its sections including .text need to be mapped in memory for execution. but if you want to convert ELF into binary check what is the difference between ELF files and bin files. some answers contain information how to change ELF into other binary format
in order to clear how ELF is loaded into memory check http://www.gsp.com/cgi-bin/man.cgi?topic=elf. if you have still some problems come with specific question.

Determining symbol addresses using binutils/readelf

I am working on a project where our verification test scripts need to locate symbol addresses within the build of software being tested. This might be used for setting breakpoints or reading static data from memory. What I am after is to create a map file containing symbol names, base address in memory, and size. Our build outputs an ELF file which has the information I want. I've been trying to use the readelf, nm, and objdump tools to try and to gain the symbol addresses I need.
I originally tried readelf -s file.elf and that seemed to access some symbols, particularly those which were written in assembler. However, many of the symbols that I wanted were not in there - specifically those that originated within our Ada code.
I used readelf --debug-dump file.elf to dump all debug information. From that I do see all symbols, including those that were in the Ada code. However, the format seems to be in the DWARF format. Does anyone know why these symbols would not be output by readelf when I ask it to list the symbolic information? Perhaps there is simply an option I am missing.
Now I could go to the trouble of writing a custom DWARF parser to get the information but if I can get it using one of the Binutils (nm, readelf, objdump) then I'd really like prefer a standard solution.
DWARF is the debug information and tries to reflect the relation of the original source code. Taking following code as an example
static int one() {
// something
return 1;
}
int main(int ac, char **av) {
return one();
}
After you compile it using gcc -O3 -g, the static function one will be inlined into main. So when you use readelf -s, you will never see the symbol one. However, when you use readelf --debug-dump, you can see one is a function which is inlined.
So, in this example, compiler does not prohibit you use optimization with -g, so you can still debug the executable. In that example, even the function is optimized and inlined, gdb still can use DWARF information to know the function and source/line from current code block inside inlined function.
Above is just a case of compiler optimization. There might be plenty of reasons that could lead to mismatch symbols address between readelf -s and DWARF.

Is the ELF .notes section really needed?

On Linux, I'm trying to strip a statically linked ELF file to the bare essentials. When I run:
strip --strip-unneeded foo
or
strip --strip-all foo
The resulting file still has a fat .notes section that appears to be full of funky strings.
Is the .notes section really needed or can I safely force it out with --remove-section?
Thanks for any help.
From experience and from looking at the man page for strip, it looks like strip isn't supposed to get rid of any and all sections and strings that aren't needed; just symbols. Quoth the man page:
GNU strip discards all symbols from object files objfile.
That being said, from experience, strip, even without --strip-all, removes sections unneeded for loading, such as .symtab and .strtab, and you can, as you note, remove sections you want it with --remove-section.
As an example of a .notes section, I took /bin/ls from my Ubuntu 11.10 64-bit box:
$ readelf -Wn /bin/ls
Notes at offset 0x00000254 with length 0x00000020:
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 2.6.15
Notes at offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 3e6f3159144281f709c3c5ffd41e376f53b47952
That encompasses the .note.ABI-tag section and the .note.gnu.build-id section. It looks like they contain data that isn't necessary to load the program, but also isn't standard, and isn't known by strip to not be necessary for the proper running of the program, since an ELF can have any number of additional "unknown" sections that aren't safe to remove. So rather using a virtual whitelist (which would fail miserably), it uses a blacklist of sections that it knows it can get rid of, and does so.
Short version: these sections don't seem to be standard and could be used for various things, so strip can't know it's safe to remove them. But based on the info inside the one I took above, if it's your own program, it's almost certainly safe to remove it.