How to obtain number of entries in ELF's symbol table? - elf

Consider standard hello world program in C compiled using GCC without any switches. As readelf -s says, it contains 64 symbols. It says also that .symtab section is 1024 bytes long. However each symbol table entry has 18 bytes, so how it is possible it contains 64 entries? It should be 56 entries. I'm constructing my own program which reads symbol table and it does not see those "missing" entries as it reads till section end. How readelf knows how long to read?

As one can see in elf.h, symbol entry structure looks like that:
typedef struct elf32_sym {
Elf32_Word st_name;
Elf32_Addr st_value;
Elf32_Word st_size;
unsigned char st_info;
unsigned char st_other;
Elf32_Half st_shndx;
} Elf32_Sym;
Elf32_Word and Elf32_Addr are 32 bit values, `Elf32_Half' is 16 bit, chars are 8 bit. That means that size of structure is 16 not 18 bytes. Therefore 1024 bytes long section gives exactly 64 entries.

The entries are aligned to each other and padded with blanks, therefore the size mismatch. Check out this mailthread for a similar discussion.
As for your code, I suggest to check out the source for readelf, especially the function process_symbol_table() in binutils/readelf.c.

The file size of an ELF data type can differ from the size of its in-memory representation.
You can use the elf32_fsize() and elf64_fsize() functions in libelf to retrieve the file size of an ELF data type.

Related

What exactly is the size of an ELF symbol (both for 64 & 32 bit) & how do you parse it

According to oracles documentation on the ELF file format a 64 bit elf symbol is 30 bytes in size (8 + 1 + 1 + 4 + 8 + 8), However when i use readelf to print out the sections headers of an elf file, & then inspect the "EntSize" (entry size) member of the symbol table section header, it reads that the symbol entries are in fact only hex 0x18 (dec 24) in size.
I have attached a picture of readelfs output next to the oracle documentation. The highlighted characters under "SYMTAB" is the "EntSize" member.
As i am about to write an ELF parser i am curious as to which i should believe? the read value of the EntSize member or the documentation?
I have also attempted to look for an answer in this ELF documentation however it doesn't seem to go into any detail of the 64 bit ELF structures.
It should be noted that the ELF file i run readelf on, in the above picture, is a 64bit executable
EICLASS, the byte just after the ELF magic number, contains the "class" of the ELF file, with the value "2" (in hex of course) meaning a 64 bit class.
When the 32 bit standard was drafted there were competing popular 64 bit architectures. The 32 bit standard was a bit vague about the 64 bit standard as it was quite possible at that time to imagine multiple competing 64 bit standards
https://www.uclibc.org/docs/elf-64-gen.pdf
should cover the 64 bit standard with better attention to the 64 bit layouts.
The way you "parse" it is to read the bytes in the order described in the struct.
typedef struct {
Elf64_Word st_name;
unsigned char st_info;
unsigned char st_other;
Elf64_Half st_shndx;
Elf64_Addr st_value;
Elf64_Xword st_size;
} Elf64_Sym;
The first 8 bytes are a st_name, the next byte is a st_info, and so on. Of course, it is critical to know where the struct "starts" within the file, and the spec above should help with that.
"64" in this case means a "64 bit entry", byte means an 8 bit entry, and so on.
the Elf64_Sym has 8+1+1+8+8+8 bytes in it, or 34 bytes.

What GZip extra field subfields exist?

RFC 1952 (GZIP File Format Specification) section 2.3.1.1 reads:
2.3.1.1. Extra field
If the FLG.FEXTRA bit is set, an "extra field" is present in
the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<email#hidden> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use. The following
IDs are currently defined:
SI1 SI2 Data
---------- ---------- ----
0x41 ('A') 0x70 ('P') Apollo file type information
LEN gives the length of the subfield data, excluding the 4
initial bytes.
Do any subfield types exist beyond the AP given in the RFC? A web search doesn't find a list; neither is there any mention on GZip's Wikipedia page, the GNU homepage, in the gzip source code, or on Stack Overflow.
As far as I know, there is no such registry being maintained. Jean-loup no longer works on gzip.
Here is one more subfield in use:
The BGZF format (which is gzip-conformant) developed for use in bioinformatics, uses the subfield type "BC", to indicate the size of the current block. This is used to make parallel decompression easy.
From the specification at http://samtools.github.io/hts-specs/SAMv1.pdf :
Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
The F.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of
payload).
The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer
gives the size of the containing BGZF block minus one.

how hex file is converting into binary in microcontroller

I am new to embedded programming. I am using a compiler to convert source code into hex and I will burn into microcontroller. My question is: microntroller (all ICs) will support binary numbers only (0 & 1). Then how it is working with hex file?
the software that loads the program/data into the flash reads whatever format it support which may be intel hex, motorola srecord, elf, coff, or a raw binary or other. and then do the right thing to program the flash with just the relevant ones and zeros.
First of all, the PC you are using right now has a processor inside, which works just like any other microcontroller. You are using it to browse the internet, although it's all "1s and 0s on the inside". And I am presuming your actual firmware doesn't come even close to running what your PC is running at this moment.
microntroller will support binary numbers only (0 & 1)
Your idea that "microntroller only supports binary numbers (0 & 1)" is a misconception. At it's very low level, yes, microcontroller contains a bunch of transistors, and each of them can store only two states of information (a bit).
But the reason for this is simply because this is a practical way to physically store one small chunk of data.
If you check the assembly instruction manual for your uC architecture, you will see a large number of instructions operating on different data widths (bits grouped into 8, 16 or larger chunks). If your controller is, say, 16-bit, then this will the basic word size for most instructions, and the one that will be the most efficient. When programming in C, this will also be the size of the "special" int type which all smaller integral types get expanded to.
In other words, bits are just building blocks of your hardware, and most of the time shouldn't even concern you at the firmware level, let alone higher application levels. Compare it to a human life form: human body is made of cells, but is also capable of doing more than a single-cell organism, isn't it?
i am using compiler to convert source code into hex
Actually, you are using the compiler to create the machine code for your particular microcontroller architecture. "Hex", or more precisely Intel Hex file format, is just one of several file formats used for storing the machine code into a file, and it's by convenience a plain-text ASCII file which you can easily open in Notepad.
To clarify, let's say you wrote a simple line of C code like this:
a = b + c;
Your compiler needs to know which architecture you are targeting, in order to convert this to machine code. For a fictional uC architecture, this will first get compiled to the following fictional assembly language:
// compiler decides that a,b,c will be stored at addresses 0x1000, 1004, 1008
mov ax, (0x1004) // move value from address 0x1004 to accumulator
add ax, (0x1008) // add value from address 0x1008 to accumulator
mov (0x1000), ax // move value from accumulator to address 0x1000
Each of these instructions has its own instruction opcode, which can be found inside the assembly instruction manual. If the instruction operates on one or more parameters, uC will know that the bytes following the instruction are data bytes:
// mov ax, (addr) --> opcode 0x10
// add ax, (addr) --> opcode 0x20
// mov (addr), ax --> opcode 0x30
mov ax, (0x1004) // 0x10 (0x10 0x04)
add ax, (0x1008) // 0x20 (0x10 0x08)
mov (0x1000), ax // 0x30 (0x10 0x00)
Now you've got your machine-code, which, written as hex values, becomes:
10 10 04 20 10 08 30 10 00
And converted to binary becomes:
0001000000010000000010000100000...
To transfer this to your controller, you will use a file format which your flash uploader knows how to read, which is what Intel Hex is most commonly used for.
Once transferred to your microcontroller, it will be stored as a bunch of bits in its flash memory, but the controller is designed to read these bits in chunks of 8 or more bits, and evaluate them as instruction opcodes or data, depending on the context. For the example above, it will read first 8 bits, and seeing that it's an instruction opcode 0x10 (which takes an additional address parameter), it will read the next two bytes to form the address 0x1004. It will then execute the instruction and advance the instruction pointer.
Hex, Decimal, Binary, they are all just ways of representing a number.
AA in hex is the same as 170 in decimal and 10101010 in binary (and 252 or Octal).
The reason the hex representation is used is because it is very convenient when working with microcontrollers as one hex character fits into 1 nibble. Hence F is 1111, FF is 1111 1111 and so fourth.

Redis int representation of a string is bigger when the string is more than 7 bytes but smaller otherwise

I'm trying to reduce Redis's objects size as much as I can and I've taken this whole week to experiment with it.
While testing different data representations I found out that an int representation of the string "hello" results in a smaller object. It may not look like much, but if you have a lot of data it can make a difference between using a few GB memory vs dozens of it.
Look at the following example (you can try it yourself if you want):
> SET test:1 "hello"
> debug object test:1
> Value at:0xb6c9f380 refcount:1 encoding:raw serializedlength:6 lru:9535350 lru_seconds_idle:7
In particular you should look at serializedlength which is 6 (bytes) in this case.
Now, look at the following int representation of it:
> SET test:2 "857715"
> debug object test:2
> Value at:0xb6c9f460 refcount:1 encoding:int serializedlength:5 lru:9535401 lru_seconds_idle:2
As you see, it results in a byte shorter object (note also encoding:int which I think is suggesting that ints get handled in a more efficient way).
With the string "hello w" (you'll see in a few moments why I didn't use "hello world" instead) we get an even bigger saving when it's represented as an int:
> SET test:3 "hello w"
> SET test:4 "857715023" <- Int representation. Notice that I inserted a "0", if I don't, it results in a bigger object and the encoding is set to "raw" instead (after all a space is not an int).
>
> debug object test:3
> Value at:0xb6c9f3a0 refcount:1 encoding:raw serializedlength:8 lru:9535788 lru_seconds_idle:6
> debug object test:4
> Value at:0xb6c9f380 refcount:1 encoding:int serializedlength:5 lru:9535809 lru_seconds_idle:5
It looks cool as long as you don't exceed 7 bytes string.. Look at what happens by a "hello wo" int representation:
> SET test:5 "hello wo"
> SET test:6 "85771502315"
>
> debug object test:5
> Value at:0xb6c9f430 refcount:1 encoding:raw serializedlength:9 lru:9535907 lru_seconds_idle:9
> debug object test:6
> Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:5
As you can see the int (12 bytes) is bigger than the string representation (9 bytes).
My question here is, what's going on behind the scenes when you represent a string as an int, that it is smaller until you reach 7 bytes?
Is there a way to increase this limit as you do with "list-max-ziplist-entries/list-max-ziplist-value" or a clever way to optimize this process so that it always (or nearly) results in a smaller object than a string?
UPDATE
I've further experimented with other tricks, and you can actually have smaller ints than string, regardless of its size, but that would involve a little more work as of data structure modelling.
I've found out that if you split the int representation of a string in chunks of ~8 numbers each, it ends up being smaller.
Take as an example the word "Hello World Hi Universe" and create both a string and int SET:
> HMSET test:7 "Hello" "World" "Hi" "Universe"
> HMSET test:8 "74111114" "221417113" "78" "2013821417184"
The results are as follows:
> debug object test:7
> Value at:0x7d12d600 refcount:1 encoding:ziplist serializedlength:40 lru:9567096 lru_seconds_idle:296
>
> debug object test:8
> Value at:0x7c17d240 refcount:1 encoding:ziplist serializedlength:37 lru:9567531 lru_seconds_idle:2
As you can see we got the int set smaller by 3 bytes.
The problem in this will be how to organize such a thing, but it shows that it's possible nonetheless.
Still, don't know where this limit is set. The ~700K persistent use of memory (even when you have no data inside) makes me think that there is a pre-defined "pool" dedicated to the optimization of int sets.
UPDATE2
I think I've found where this intset "pool" is defined in Redis source.
At line 81 in the file redis.h there is the def REDIS_SHARED_INTEGERS set to 10000
REDISH_SHARED_INTEGERS
I suspect it's the one defining the limit of an intset byte length.
I have to try to recompile it with an higher value and see if I can use a longer int value (it'll most probably allocate more memory if it's the one I think of).
UPDATE3
I want to thank Antirez for the reply! Didn't expect that.
As he made me notice, len != memory usage.
I got further in my experiment and saw that the objects get already slightly compressed (serialized). I may have missed something from the Redis documentation.
The confirmation comes from analyzing a Redis key wih the command redis-memory-for-key key, which actually returns the memory usage and not the serialized length.
For example, let's take the "hello" string and int we used before, and see what's the result:
~ # redis-memory-for-key test:1
Key "test:1"
Bytes 101
Type string
~ #
~ # redis-memory-for-key test:2
Key "test:2"
Bytes 87
Type string
As you can notice the intset is smaller (87 bytes) than the string (101 bytes) anyway.
UPDATE4
Surprisingly a longer intset seems to affect its serializedlength but not memory usage..
This makes it possible to actually build a 2digit-char mapping while it still being more memory efficient than a string, without even chunking it.
By 2digit-char mapping I mean that instead of mapping "hello" to "85121215" we map it to digits with a fixed length of 2 each, prefixing it with "0" if digit < 10 like "0805121215".
A custom script would then proceed by taking every two digit apart and converting them to their equivalent char:
08 05 12 12 15
\ | | | /
h e l l o
This is enough to avoid disambiguation (like "o" and "ae" which both result in the digit "15").
I'll show you this works by creating another set and therefore analyzing its memory usage like I did before:
> SET test:9 "0805070715"
Unix shell
----------
~ # redis-memory-for-key test:9
Key "test:9"
Bytes 87
Type string
You can see that we have a memory win here.
The same "hello" string compressed with Smaz for comparison:
>>> smaz.compress('hello')
'\x10\x98\x06'
// test:10 would be unfair as it results in a byte longer object
SET post:1 "\x10\x98\x06"
~ # redis-memory-for-key post:1
Key "post:1"
Bytes 99
Type string
My question here is, what's going on behind the scenes when you represent a
string as an int, that it is smaller until you reach 7 bytes?
Notice that the integer you supplied as test #6 is no longer actually encoded
as an integer, but as raw:
SET test:6 "85771502315"
Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:
So we see that a "raw" value occupies one byte plus the length of its string representation. In memory
you get that plus the overhead of the value.
The integer encoding, I suspect, encodes a number as a 32-bit integer; then it will always
need five bytes, one to tell its type, and four to store those 32 bits.
As soon as you overflow the maximum representable integer in 32 bits, which is either 2 billions or 4 depending on whether you use a sign or not, you need to revert to raw encoding.
So probably
2147483647 -> five bytes (TYPE_INT 0x7F 0xFF 0xFF 0xFF)
2147483649 -> eleven bytes (TYPE_RAW '2' '1' '4' '7' '4' '8' '3' '6' '4' '9')
Now, how can you squeeze a string representation PROVIDED THAT YOU ONLY USE AN ASCII SET?
You can get the string (140 characters):
When in the Course of human events it becomes necessary for one people
to dissolve the political bands which have connected them with another
and convert each character to a six-bit representation; basically its index in the string
"ABCDEFGHIJKLMNOPQRSTUVWXYZ01234 abcdefghijklmnopqrstuvwxyz56789."
which is the set of all the characters you can use.
You can now encode four such "text-only characters" in three "binary characters", a sort of "reverse base 64 encoding"; base64 encoding will get three binary characters and create a four-byte sequence of ASCII characters.
If we were to code it as groups of integers, we would save a few bytes - maybe get it down
to 130 bytes - at the cost of a larger overhead.
With this type of "reverse base64" encoding, we can get 140 character to 35 groups of four characters, which become a string of 35x3 = 105 binary characters, raw encoded to 106 bytes.
As long, I repeat, as you never use characters outside the range above. If you do, you can
enlarge the range to 128 characters and 7 bits, thus saving 12.5% instead of 25%; 140 characters will then become 126, raw encoded to 127 bytes, and you save (141-127) = 14 bytes.
Compression
If you have much longer strings, you can compress them (i.e., you use a function such as deflate() or gzencode() or gzcompress() ). Either straight; in which case the above string becomes 123 bytes. Easy to do.
Compressing many small strings: the Rube Goldberg approach
Since compression algorithms learn, and at the beginning they dare assume nothing, small strings will not compress greatly. They're "all beginning", so to speak. Just as an engine, when running cold the performances are inferior.
If you have a "corpus" of text these strings come from, you can use a time-consuming trick that "warms up" the compression engine and may double (or better) its performances.
Suppose you have two strings, COMMON and TARGET (the second one is the one you're interested in). If you z-compressed COMMON you would get, say, ZCMN. If you compressed TARGET you would get ZTRGT.
But as I said, since the gz compression algorithm is stream oriented, and it learns as it goes by, the compression ratio of the second half of any text (provided there aren't freakish statistical distribution changes between halves) is always appreciably higher than that of the first half.
So if you were to compress, say, COMMONTARGET, you'd get ZCMGHQI.
Notice that the first part of the string, as far as almost the end, is the same as before. Indeed if you compressed COMMONFOOBAR, you'd get something like ZCMQKL. And the second part is compressed better than before, even if we count the area of overlap as belonging entirely to the second string.
And this is the trick. Given a family of strings (TARGET, FOOBAR, CASTLE BRAVO), we compress not the strings, but the concatenation of those strings with a large prefix. Then we discard from the result the common compressed prefix. Thus TARGET is taken from the compression of COMMONTARGET (which is ZCMGHQI), and becomes GHQI instead of ZTRGT, with a 20% gain.
The decoder does the reverse: given GHQI, it first applies the common compressed prefix ZCM (which it must know); then it decodes the result, and finally discards the common uncompressed prefix, of which it need only know the length beforehand.
So the first sentence above (140 characters) becomes 123 when compressed by itself; if I take the rest of the Declaration and use it as a prefix, it compresses to 3355 bytes. This prefix plus my 140 bytes becomes 3409 bytes, of which 3352 are common, leaving 57 bytes.
At the cost of storing once the uncompressed prefix in the encoder, and the compressed prefix once in the decoder, and the whole thingamajig running five times as slow, I can now get those 140 bytes down to 57 instead of 123 - less than half of before.
This trick works great for small strings; for larger ones, the advantage isn't worth the pain. Also, different prefixes yield different results. The best prefixes are those that contain most of the sequences that are likely to appear in the string pool, ordered by increasing length.
Added bonus: the compressed prefix also doubles as a sort of weak encryption, as without that, you can't easily decode the compressed strings, even if you might be able to recover some pieces thereof.

GMP variable's bit size

In GMP library,
_mp_size holds the number of limbs of an integer..
we can create integers of size
1 limb(32bits),2 limbs(64bits),3 limbs(96bits)...so on. using mpz_init or mpz_random functions..
cant we create an integer variable of size 8bit or 16 bit.. other than multiples of 32 bit size ???
can you code for that??
thank you ..
The GNU GMP library is for numbers that exceed the ranges provided by the standard C types. Use (unsigned) char or (unsigned) short for 8 and 16 bit integers, respectively.
This would be of limited utility, because most modern processors use at least a 32-bit word size.
I don't think you can. Here's an excerpt from a discussion at http://gmplib.org/list-archives/gmp-discuss/2004-June/001200.html:
The limb size is compiled into the
library, and is determined from the
available types of tghe [sic] processor and
the host environment.