What is the bit width of a single webassembly instruction? - serialization

I know that webassembly currently supports a 32 bit architecture, so I am supposing that, like RISCV32, that its base instruction set has instructions which are 32 bit wide (Of course, RISCV32 supports 16-bit compressed instructions and 48-bit ones as well). RISC-V's instructions are interpreted mostly as left-endian (in terms of bit indices).
For example, in RISC-V, we can have an instruction like lui (load upper-immediate to register), that embeds a 20-bit immediate into an instruction, has a 5-bit field to encode the desitination register, and a 7-bit format to specify the opcode. Among other things, the opcode contains two bits at the beginning that connote whether the instruction is compressed or not. This is encoded in the specification, where lui has an LUI opcode.:
RISC-V instructions have a variety of different layouts specified in the specification as well, and for example, the lui instruction takes the "U" format, so we know exactly where the 20-bit field is and where the 5-bit destination register is in the serialization:
What is the bit width of a wasm instruction? What are the possible layouts of a wasm instruction? Are there compressed instruction formats for webassembly, such as 16-bit instructions for very common operations?
If webassembly instructions are variable-width, how is the width of an instruction encoded for the interpreter?

Binary WASM bytecode has variable-length instruction, not fixed-width like a RISC CPU. https://en.wikipedia.org/wiki/WebAssembly#Code_representation has an example.
It's not intended to be executed directly, but rather JITed into native machine code, thus a fixed-width format that would require multiple instructions for some 32 or 64-bit constants would make more work for the JIT optimizer. And would be less compact in the WASM binary format, and more instructions to parse.
Much better for the JIT optimizer to know the ultimate goal is to materialize a whole constant, since some ISAs will be able to do that in one instruction, and others will need it split up in different parts depending on the ISA. e.g. 20:12 for RISC-V, 16:16 for ARM movw/movk or MIPS, or if the constant only has set bits in a narrow region, ARM rotated immediates can maybe still use one instruction. Or AArch64 bit-pattern immediates can materialize a constant like 0x01010101 (or 0x0101010101010101) in a single 32-bit instruction.
TL:DR: Don't make the JIT put the pieces back together before breaking back down into asm that works for the target machine.
And in general, variable-length isn't much of a problem for a stream that will be parsed once by software anyway, not decoded repeatedly by hardware every time through a loop.

Examples
A lot of webassembly instructions take up one byte. For example, the left shift instructions are i32.shl andi64.shl and take single byte opcodes 0x74 and 0x86 without any subsequent values, while the i32.const instruction for example starts with 0x41 and takes from 2 to 6 bytes.
Instruction
Opcode
i32.const
0x41
i64.const
0x42
f32.const
0x43
f64.const
0x44
-
-
i32.shl
0x74
i64.shl
0x86
-
-
i32.eqz
0x45
i32.eq
0x46
i64.eqz
0x50
i64.eq
0x51
And so on. The values here are taken from the MDN website. See the Numeric Instructions.
Encoding Numbers
Some instructions such as the const above require specifying the immediate, which increases the overall size of the instruction. The immediates are encoded in LEB128, and the variant depends on whether the integer is signed or unsigned. Those are normally given in the specification.
LEB128 is roughly this: bits are padded to a multiple of seven, split into groups and the last bit is used to determine whether the end is reached. Those numbers are constrained to their maximum width. Floating point numbers are encoded in IEE-754
The const instructions are followed by the respective literal.
All other numeric instructions are plain opcodes without any immediates.
Source: https://webassembly.github.io/spec/core/binary/instructions.html#numeric-instructions

Wasm instructions are represented with a unique opcode (typically 1 byte, more for newer instruction), followed by the encodings of immediate operands, for instructions that have them. There is no specific length, it depends on both the opcode and the immediate values.
For example:
i32.add is opcode 0x6A with no immediates;
i64.const i is opcode 0x42, followed by a variable-length encoding of i in LEB128 format;
br_table l* ld is opcode 0x0E, followed by a variable-length encoding of the length of l* in LEB128, followed by as many variable-length encodings of the label indices in l*, followed by the variable-length encoding of label index ld.
See the binary grammar in the specification for details. A Wasm decoder is essentially "parsing" the binary input according to this grammar.

Here are some citations from the current specification v2.0 related to the instructions (as "seen" by the specification itself):
some instructions also have static immediate arguments, typically
indices or type annotations, which are part of the instruction itself.
Some instructions are structured in that they bracket nested sequences of instructions.
In relation to the nesting:
Implementations typically impose additional restrictions on a number of aspects of a WebAssembly module or execution
Then, one of the noted implementation limitations is:
the nesting depth of structured control instructions
As the nesting depth of the instructions is not strictly defined by the specification, but its left to the implementation to choose, that means that there is no limit of the instructions length regardless are they encoded as binary or text, as per the specification.
Even if we ignore the structured instructions (as we should not), there are many instructions having vectors as arguments. The vectors length is limited to 2^32-1. If my memory serves me right, there was and an instruction having vector of vectors as an argument.

Related

Some questions about ELF file format

I am trying to learn how ELF files are structured and probably how to make one manually.
I am working on aarch64 Linux OS, the ELF files I am inspecting are of elf64-littleaarch64 format.
Also I try to learn by myself, however I got stuck with some questions...
When I do xxd code, the first number in each line of the output specifies the address of bytes in the file. But when objdump -D code, the first number is something like 4000b0, however corresponds to 000000b0 in xxd. Why is there a four at the beginning?
In objdump, the bytecode is for example 11000a94, which 'means'
add w20, w20, #2 in assembly. I know, that 11 is the opcode, but what does 000a94 mean? I thought, it should be the parameters, but I am adding the value 2 and can't find the number 2 in it.
If you have a good article to read, or can help me explain this, I will be very grateful!
xxd shows the offset of the bytes within the file on disk. objdump -D shows (tentatively) the address in memory where those bytes will be loaded when the program is run. It is common for them to differ by a round number. In particular, 0x400000 may correspond to one higher-level page table entry; see Why Linux/gnu linker chose address 0x400000? which is for x86-64 but I think ARM64 is similar (haven't checked). It doesn't have anything to do with the fact that 0x40 is ASCII #; that's just a coincidence.
Note that if ASLR is in use, the actual memory address will be randomly chosen every time the program is run, and will not match what objdump shows you, though the difference will still be a multiple of the page size.
Well, I was too fast asking this question, but now, I will answer it too.
40 at the beginning of the addresses in objdump is the hex representation of the char "#", which means "at" and points to an address, very simple!
Little Endian has CPU addresses stored in 5 bits instead of 6 or 8. That means, that I should look for the binary value of the objdump code: 11000a94 --> 10001000000000000101010010100, where it can be divided into [10001][00000000000010][10100][10100] with [opcode][value][first address][second address]
Both answers are wrong, see the accepted answer.
I will still let them here, though

How to manually construct a gzip so that compressed file is larger than original?

Suppose a 1KB file called data.bin, If it's possible to construct a gzip of it data.bin.gz, but much larger, how to do it?
How much larger could we theoretically get in GZIP format?
You can make it arbitrarily large. Take any gzip file and insert as many repetitions as you like of the five bytes: 00 00 00 ff ff after the gzip header and before the deflate data.
Summary:
With header fields/general structure: effect is unlimited unless it runs into software limitations
Empty blocks: unlimited effect by format specification
Uncompressed blocks: effect is limited to 6x
Compressed blocks: with apparent means, the maximum effect is estimated at 1.125x and is very hard to achieve
Take the gzip format (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) and play with it as much as you like.
Header
There are a whole bunch of places to exploit:
use optional fields (original file name, file comment, extra fields)
bluntly append garbage (GNU gzip will issue a warning when decompressing)
concatenate multiple gzip archives (the format allows that, the resulting uncompressed data is, likewise, the concatenation or all chunks).
An interesting side effect (a bug in GNU gzip, apparently): gzip -l takes the reported uncompressed size from the last chunk only (even if it's garbage) rather than adding up values from all. So you can make it look like the archive is (absurdly) larger/smaller than raw data.
These are the ones that are immediately apparent, you may be able to find yet other ways.
Data
The general layout of "deflate" format is (RFC1951):
A compressed data set consists of a series of blocks, corresponding to
successive blocks of input data. The block sizes are arbitrary,
except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
Full blocks
The 00 00 00 ff ff that Mark Adler suggests is essentially an empty, non-final block (RFC1951 section 3.2.3. for the 1st byte, 3.2.4. for the uncompressed block itself).
Btw, according to gzip overview at the official site and the source code, Mark is the author of the decompression part...
Uncompressed blocks
Using non-empty uncompressed blocks (see prev. section for references), you can at most create one for each symbol. The effect is thus limited to 6x.
Compressed blocks
In a nutshell: some inflation is achievable but it's very hard and the achievable effect is limited. Don't waste your time on them unless you have a very good reason.
Inside compressed blocks (section 3.2.5.), each chunk is [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>], with lengths starting at 3. A 7-9-bit code unambiguously resolves to a literal character or a specific range of lengths. Longer codes correspond to larger lengths/distances. No space/meaningless stuff is allowed between chunks.
So the maximum for raw byte chunks is 9/8 (1.125x) - if all the raw bytes are with codes 144 - 255.
Playing with reference chunks isn't going to do any good for you: even a reference to a 3-byte sequence gives 25/24 (1.04x) at most.
That's it for static Huffman tables. Looking through the docs on dynamic ones, it optimizes the aforementioned encoding for the specific data or something. So, it should allow to make the ratio for the given data closer to the achievable maximum, but that's it.

how hex file is converting into binary in microcontroller

I am new to embedded programming. I am using a compiler to convert source code into hex and I will burn into microcontroller. My question is: microntroller (all ICs) will support binary numbers only (0 & 1). Then how it is working with hex file?
the software that loads the program/data into the flash reads whatever format it support which may be intel hex, motorola srecord, elf, coff, or a raw binary or other. and then do the right thing to program the flash with just the relevant ones and zeros.
First of all, the PC you are using right now has a processor inside, which works just like any other microcontroller. You are using it to browse the internet, although it's all "1s and 0s on the inside". And I am presuming your actual firmware doesn't come even close to running what your PC is running at this moment.
microntroller will support binary numbers only (0 & 1)
Your idea that "microntroller only supports binary numbers (0 & 1)" is a misconception. At it's very low level, yes, microcontroller contains a bunch of transistors, and each of them can store only two states of information (a bit).
But the reason for this is simply because this is a practical way to physically store one small chunk of data.
If you check the assembly instruction manual for your uC architecture, you will see a large number of instructions operating on different data widths (bits grouped into 8, 16 or larger chunks). If your controller is, say, 16-bit, then this will the basic word size for most instructions, and the one that will be the most efficient. When programming in C, this will also be the size of the "special" int type which all smaller integral types get expanded to.
In other words, bits are just building blocks of your hardware, and most of the time shouldn't even concern you at the firmware level, let alone higher application levels. Compare it to a human life form: human body is made of cells, but is also capable of doing more than a single-cell organism, isn't it?
i am using compiler to convert source code into hex
Actually, you are using the compiler to create the machine code for your particular microcontroller architecture. "Hex", or more precisely Intel Hex file format, is just one of several file formats used for storing the machine code into a file, and it's by convenience a plain-text ASCII file which you can easily open in Notepad.
To clarify, let's say you wrote a simple line of C code like this:
a = b + c;
Your compiler needs to know which architecture you are targeting, in order to convert this to machine code. For a fictional uC architecture, this will first get compiled to the following fictional assembly language:
// compiler decides that a,b,c will be stored at addresses 0x1000, 1004, 1008
mov ax, (0x1004) // move value from address 0x1004 to accumulator
add ax, (0x1008) // add value from address 0x1008 to accumulator
mov (0x1000), ax // move value from accumulator to address 0x1000
Each of these instructions has its own instruction opcode, which can be found inside the assembly instruction manual. If the instruction operates on one or more parameters, uC will know that the bytes following the instruction are data bytes:
// mov ax, (addr) --> opcode 0x10
// add ax, (addr) --> opcode 0x20
// mov (addr), ax --> opcode 0x30
mov ax, (0x1004) // 0x10 (0x10 0x04)
add ax, (0x1008) // 0x20 (0x10 0x08)
mov (0x1000), ax // 0x30 (0x10 0x00)
Now you've got your machine-code, which, written as hex values, becomes:
10 10 04 20 10 08 30 10 00
And converted to binary becomes:
0001000000010000000010000100000...
To transfer this to your controller, you will use a file format which your flash uploader knows how to read, which is what Intel Hex is most commonly used for.
Once transferred to your microcontroller, it will be stored as a bunch of bits in its flash memory, but the controller is designed to read these bits in chunks of 8 or more bits, and evaluate them as instruction opcodes or data, depending on the context. For the example above, it will read first 8 bits, and seeing that it's an instruction opcode 0x10 (which takes an additional address parameter), it will read the next two bytes to form the address 0x1004. It will then execute the instruction and advance the instruction pointer.
Hex, Decimal, Binary, they are all just ways of representing a number.
AA in hex is the same as 170 in decimal and 10101010 in binary (and 252 or Octal).
The reason the hex representation is used is because it is very convenient when working with microcontrollers as one hex character fits into 1 nibble. Hence F is 1111, FF is 1111 1111 and so fourth.

Bzip2 block header: 1AY&SY

This is the question about bzip2 archive format. Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2:
/*--
A 6-byte block header, the value chosen arbitrarily
as 0x314159265359 :-). A 32 bit value does not really
give a strong enough guarantee that the value will not
appear by chance in the compressed datastream. Worst-case
probability of this event, for a 900k block, is about
2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
For a compressed file of size 100Gb -- about 100000 blocks --
only a 48-bit marker will do. NB: normal compression/
decompression do *not* rely on these statistical properties.
They are only important when trying to recover blocks from
damaged files.
--*/
The question is: Is it true, that all bzip2 archives will have blocks with start aligned to byte boundary? I mean all archives created by reference implementation of bzip2, the bzip2-1.0.5+ utility.
I think that bzip2 may parse the stream not as byte stream but as bit stream (the block itself is encoded by huffman, which is not byte-aligned by design).
So, in other words: If grep -c 1AY&SY greater (huffman may generate 1AY&SY inside block) or equal to count of bzip2 blocks in the file?
BZIP2 looks at a bit stream.
From http://blastedbio.blogspot.com/2011/11/random-access-to-bzip2.html:
Anyway, the important bits are that a BZIP2 file contains one or more
"streams", which are byte aligned, each containing one (zero?) or more
"blocks", which are not byte aligned, followed by an end of stream
marker (the six bytes 0x177245385090 which is the square root of pi as
a binary coded decimal (BCD), a four byte checksum, and empty bits for
byte alignment).
The bzip2 wikipedia article also alludes to bit-block alignment (see the File Format section), which seems to be inline from what I remember from school (had to implement the algorithm...).

Can Fortran read bytes directly from a binary file?

I have a binary file that I would like to read with Fortran. The problem is that it was not written by Fortran, so it doesn't have the record length indicators. So the usual unformatted Fortran read won't work.
I had a thought that I could be sneaky and read the file as a formatted file, byte-by-byte (or 4 bytes by 4 bytes, really) into a character array and then convert the contents of the characters into integers and floats via the transfer function or the dreaded equivalence statement. But this doesn't work: I try to read 4 bytes at a time and, according to the POS output from the inquire statement, the read skips over like 6000 bytes or so, and the character array gets loaded with junk.
So that's a no go. Is there some detail in this approach I am forgetting? Or is there just a fundamentally different and better way to do this in Fortran? (BTW, I also tried reading into an integer*1 array and a byte array. Even though these codes would compile, when it came to the read statement, the code crashed.)
Yes.
Fortran 2003 introduced stream access into the language. Prior to this most processors supported something equivalent as an extension, perhaps called "binary" or similar.
Unformatted stream access imposes no record structure on the file. As an example, to read data from the file that corresponds to a single int in the companion C processor (if any) for a particular Fortran processor:
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_INT
INTEGER, PARAMETER :: unit = 10
CHARACTER(*), PARAMETER :: filename = 'name of your file'
INTEGER(C_INT) :: data
!***
OPEN(unit, filename, ACCESS='STREAM', FORM='UNFORMATTED')
READ (unit) data
CLOSE(unit)
PRINT "('data was ',I0)", data
You may still have issues with endianess and data type size, but those aspects are language independent.
If you are writing to a language standard prior to Fortran 2003 then unformatted direct access reading into a suitable integer variable may work - it is Fortran processor specific but works for many of the current processors.