Locating ELF shared library exports at runtime - elf

It is possible to extract exported symbols of a loaded shared library using only its memory image?
I'm talking about the symbols listed in .dynsym section. As I understand, we can go this way:
Locate the base address of the library.For example, by reading /proc/<pid>/maps it is possible to find memory areas which are mapped from the library on disk, and then we can look for ELF magic bytes to find the ELF header which gives us the base address.
Find the PT_DYNAMIC segment from the program headers.Parse the ELF header, then iterate over the program headers to find the segment which contains the .dynamic section.
Extract the location of the dynamic symbol table.Iterate over the ElfN_Dyn structs to find the ones with d_tags DT_STRTAB and DT_SYMTAB. These will give us addresses of the string table (with symbol names) and the dynamic symbol table itself.
And this is where I stumbled. .dynamic section has a tag for the size of the string table (DT_STRSZ), but there is no indication of the symbol table size. It only contains the size of a single entry (DT_SYMENT). How can I retrieve the number of symbol entries in the table?
It should be possible to infer that from the size of the .dynsym section, but ELF files are represented as segments in memory. The section table is not required to be loaded into memory and can only be (reliably) accessed by reading the corresponding file.
I believe it is possible because the dynamic linker has to know the size of the symbol table. However, the dynamic loader may have stored it somewhere when the file had been loaded, and the linker is just using the cached value. Though it seems somewhat stupid to load the symbol table into memory, but to not load a handful of bytes with its size alongside.

The size of the dynamic symbol table must be inferred from the symbol hash table (DT_HASH or DT_GNU_HASH): this answer gives some code which does that.
The standard hash table (which is not used on GNU systems anymore) is quite simple. The first entry is nchain which is:
The number of symbol table entries should equal nchain
The GNU hash table is more complicated.

Related

Some questions about ELF file format

I am trying to learn how ELF files are structured and probably how to make one manually.
I am working on aarch64 Linux OS, the ELF files I am inspecting are of elf64-littleaarch64 format.
Also I try to learn by myself, however I got stuck with some questions...
When I do xxd code, the first number in each line of the output specifies the address of bytes in the file. But when objdump -D code, the first number is something like 4000b0, however corresponds to 000000b0 in xxd. Why is there a four at the beginning?
In objdump, the bytecode is for example 11000a94, which 'means'
add w20, w20, #2 in assembly. I know, that 11 is the opcode, but what does 000a94 mean? I thought, it should be the parameters, but I am adding the value 2 and can't find the number 2 in it.
If you have a good article to read, or can help me explain this, I will be very grateful!
xxd shows the offset of the bytes within the file on disk. objdump -D shows (tentatively) the address in memory where those bytes will be loaded when the program is run. It is common for them to differ by a round number. In particular, 0x400000 may correspond to one higher-level page table entry; see Why Linux/gnu linker chose address 0x400000? which is for x86-64 but I think ARM64 is similar (haven't checked). It doesn't have anything to do with the fact that 0x40 is ASCII #; that's just a coincidence.
Note that if ASLR is in use, the actual memory address will be randomly chosen every time the program is run, and will not match what objdump shows you, though the difference will still be a multiple of the page size.
Well, I was too fast asking this question, but now, I will answer it too.
40 at the beginning of the addresses in objdump is the hex representation of the char "#", which means "at" and points to an address, very simple!
Little Endian has CPU addresses stored in 5 bits instead of 6 or 8. That means, that I should look for the binary value of the objdump code: 11000a94 --> 10001000000000000101010010100, where it can be divided into [10001][00000000000010][10100][10100] with [opcode][value][first address][second address]
Both answers are wrong, see the accepted answer.
I will still let them here, though

Minimal PDF size according to specs

I'm reading PDF specs and I have a few questions about the structure it has.
First of all, the file signature is %PDF-n.m (8 bytes).
After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
After that, there should be a body, a xref table and a trailer and an %%EOF.
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?
First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
Which docs do you have? The PDF specification ISO 32000-1 says:
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater.
Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are
preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
followed by an end-of-line, i.e. CR, LF, or CR LF.
So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.
(sigh, I just saw you and #Jongware cleared that in comments while I wrote this...)
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?
As you commented you indeed mean a minimal valid PDF.
Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. #plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?
Normally it would be the last cross reference table/stream as the usual use case is
you start with a PDF which has but one cross reference section;
you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
etc...
The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.
And as you can of course apply incremental updates to a linearized document, you can have mixed forms.
In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.

ELF segments mem size vs. file size

I have read a couple of ELF specification documents but haven't found answers for the below questions yet
1) When segment memory size is greater than segment file size, should the ELF segment downloader fill the segment in memory with zeros as specified by memsize?
2) Can there be a case where a section should be filled with a constant other than zero, i.e. a general case "constant fill" section?
3) What is the right way to identify a .const segment in elf executable file?
The per-section flags value does not have such information and seems to be limited.I have seen implementations of ELF segment downloader where they don't download segments with file size of zero at all.
Thanks!
It's a long over-due answer, but anyway..
When segment memory size is greater than segment file size, should the ELF segment downloader fill the segment in memory with zeros as specified by memsize?
==> I think so. Some sections like .BSS (unintialized data) don't have space in the elf file but should have space in memory when loaded uninitialized. But C run-time initializes the data with zero before going into main() I understand.
Can there be a case where a section should be filled with a constant other than zero, i.e. a general case "constant fill" section?
==> Yes, I remember we can set fill pattern in a section. Search shows me FILL(expression) attribute.
What is the right way to identify a .const segment in elf executable file?
==> I think you could do something like
unsigned int __attribute__((__section__ (".const") FILE("0x1234"))) data[0x1000];
?

String table in ELF

I get some symbol and I get (a hexdump of) an ELF file. How can I know in which section this symbol appears?
What is the difference between .strtab and .shstrtab? Is there another array of symbol strings?
When I get an index for the symbol names table, is it an index in .strtab or in .shstrtab?
For the first question, we would need the hexedit of the elf file to understand properly.
For the second question -
strtab stands for String Table
shstrtab stands for Section Header String table.
When we read ELF header, we see that every ElfHeader structure contains a member called e_shstrndx. This is an index to the shstrtab. If you use this index and then read from shstrtab you can find the name of that section.
strtab, is the string table for all other references. When you read symbols from an ELF object, every SYmbol structure (Elf32_Sym) has a member called st_name. This is an index into strtab to get the string name of that symbol.
Can you please elaborate more on array of symbol strings? Also, what do you mean by names table?
You can refer to the following link -
Reading ELF String Table on Linux from C
Hope this answers your question.
I will take a stab at the first question since Samir answered the second one so well.
The symbol's name will be in one of the STRTAB sections, and then there will be an entry in the symbol table (one of the SYMTAB or DYNSYM sections) which references that string by an offset in the containing section. The entry in the symbol table can tell you the index of the section this symbol is found in, but not where it is used.
For that you need to check the relocation table, contained in sections of type REL; common names include .rel.dyn, .rel.plt. A relocation table lists all the references to symbol in one other code section, i.e. code and relocation sections are paired. Each entry in the table is one "usage" of a symbol, and contains the offset in the corresponding section where the usage is and the index of the symbol in the symbol table.
If you can use the readelf utility, you can easily use readelf -r <binary> | grep <symbol name> to get all the references to a symbol.
If you are set on using hexedit/cannot use readelf, then you would need to
Find the offset of the symbol name string in the binary, what section that is in, and then get the offset of that string in that section;
Look through all the entries in the symbol table and find which one(s) match that name (st_name == offset of string in the string section);
Look through all entries in each relocation table to find symbol usages of that symbol in the corresponding code section for that table. The r_info field of each entry contains the index of the symbol table entry it references (this index is bitmapped to part of r_info, and at different places for 32- and 64-bit).
All relocation entries matching that symbol table index are usages of your string somewhere.
More info:
Symbol table: https://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-79797.html
Relocation table: https://docs.oracle.com/cd/E19683-01/816-1386/6m7qcoblj/index.html#chapter6-54839

dll files compared to gzip files

Okay, the title isn't very clear.
Given a byte array (read from a database blob) that represents EITHER the sequence of bytes contained in a .dll or the sequence of bytes representing the gzip'd version of that dll, is there a (relatively) simple signature that I can look for to differentiate between the two?
I'm trying to puzzle this out on my own, but I've discovered I can save a lot of time by asking for help. Thanks in advance.
Check if it's first two bytes are the gzip magic number 0x1f8b (see RFC 1952). Or just try to gunzip it, the operation will fail if the DLL is not gzip'd.
A gzip file should be fairly straight forward to determine as it ought to consist of a header, footer and some other distinguishable elements in between.
From Wikipedia:
"gzip" is often also used to refer to
the gzip file format, which is:
a 10-byte header, containing a magic
number, a version number and a time
stamp
optional extra headers, such as
the original file name
a body,
containing a DEFLATE-compressed
payload
an 8-byte footer, containing a
CRC-32 checksum and the length of the
original uncompressed data
You might also try determining if the gzip contains any records/entries as each will also have their own header.
You can find specific information on this file format (specifically the member header which is linked) here.