Are there unused bits in aarch64 instruction encoding? - armv8

As per this link about aarch64 instruction encoding, there are unused bits in some instructions, like x in below listing for LDR instruciton. But I any documentation about unused bits in armv8 manual. Are these unused bits valid according to armv8 manual?
xxx1 1101 x1ii iiii iiii iinn nnnt tttt - ldr Ft ADDR_UIMM12

That link is from 2012, that is when the ARMv8 architecture was released, so there was not a lot of information about it. The 'x' in that case is related to the decoding of the instruction, not sure about how they do it, it does not look correct to me.
You can find all the values for the encoding in the ARM Architecture Reference Manual, look at the LDR instructions use immediate values (e.g LDR (immediate) page 693 specifically the Unsigned offset in the next page).
You will see there that the two most significant bits are used for the size of the register (size == 10 is for W registers (32 bits) and size == 11 for X registers (64 bits)).
In the ARM Architecture Reference Manual usually, when there are encodings that are not used it says Unallocated Encoding or Reserve Encoding or something similar.
Also, there are plenty of free encodings to be used, probably for future use or for example for the Scalable Vector Extensions module. You can see all the encodings used and free in the following slides by Nigel Stephens at the Hot Chips 28 conference on August 22, 2016, look at slice 8, the grey squares are free unused encodings.

Related

How to change the gem5 ARM SVE vector length?

I'm doing an experiment to see which ARM SVE vector length would be the best for my chip design, or to help select which chip has the optimal vector length for my application.
How to change the vector length in a gem5 simulation to see how it affects workload performance?
For SE:
se.py --param 'system.cpu[:].isa[:].sve_vl_se = 2'
For FS:
fs.py --param 'system.sve_vl = 2'
where the values are given in multiples of 128 bits, so 2 means length 256.
You can test this easily with the ADDVL instruction as shown in this example.
The name of those parameters can be easily determined by looking at a m5out/config.ini generated from a previous run.
Note however that this value is architecturally visible, and so it might not be possible to checkpoint after Linux boot, and restore with a different vector length than the boot, to speed up experiments. This is likely true in general even though the kernel itself does not run vector instructions, because there is software control of the effective vector length. Maybe it is possible to set a big vector length on the simulator to start with and then tell Linux to reduce it somehow in software, but I'm not sure what's the API.
Tested in gem5 3126e84db773f64e46b1d02a9a27892bf6612d30.
To change the vector length, one can use command line option:
--arm-sve-vl=<vl in quadwords: one of {1, 2, 4, 8, 16}>
where vl is a multiple of 128. So for a simulation of 512-bit SVE machine, one should use:
--arm-sve-vl=4
This works both for Syscall-Emulation mode and Full System mode.
If one wants to quickly explore the space of different vector lengths, one can also change it during the simulation (only in Full system mode). For example, to change the SVE length to 256, put the following line in your bootscript, before running the benchmark:
echo 256 >/proc/sys/abi/sve_default_vector_length
You can get more information on https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf.

MIPS Branch Instructions

I am learning MIPS right now, and as I was reading the documentation, it said:
An 18-bit signed offset (the 16-bit offset field shifted left 2 bits)
I was wondering why exactly for branch instructions the offset is being multiplied by 4? The documentation also stated that this makes the range for branch instructions 128 kb because the 32kb is multiplied by 4. Does this multiplication only apply to branch instructions or does it also apply to Jump instructions as well?
Thanks!
I was wondering why exactly for branch instructions the offset is being multiplied by 4?
All instructions must be word-aligned. From that it follows that both the origin and the destination are word-aligned, which in turn means that the offset also always will be word-aligned. So it would be a waste to store the two least significant bits of the offset since they always will be 0. Instead we can use the available bits in the instruction word to encode 18-bit offsets by storing only the 16 most significant bits.
Does this multiplication only apply to branch instructions or does it also apply to Jump instructions as well?
It's the same for jump instructions. Though jump instructions differ in other ways; the offset for a jump is not PC-relative, but relative to the start of the 256MB-aligned region that the PC currently is in.

What does alignment to 16-byte boundary mean in x86

Intel's official optimization guide has a chapter on converting from MMX commands to SSE where they state the fallowing statment:
Computation instructions which use a memory operand that may not be aligned to a 16-byte boundary must be replaced with an unaligned 128-bit load (MOVDQU) followed by the same computation operation that uses instead register operands.
(chapter 5.8 Converting from 64-bit to 128-bit SIMD Integers, pg. 5-43)
I can't understand what they mean by "may not be aligned to a 16-byte boundary", could you please clarify it and give some examples?
Certain SIMD instructions, which perform the same instruction on multiple data, require that the memory address of this data is aligned to a certain byte boundary. This effectively means that the address of the memory your data resides in needs to be divisible by the number of bytes required by the instruction.
So in your case the alignment is 16 bytes (128 bits), which means the memory address of your data needs to be a multiple of 16. E.g. 0x00010 would be 16 byte aligned, while 0x00011 would not be.
How to get your data to be aligned depends on the programming language (and sometimes compiler) you are using. Most languages that have the notion of a memory address will also provide you with means to specify the alignment.
I'm guessing here, but could it be that "may not be aligned to a 16-byte boundary" means that this memory location has been aligned to a smaller value (4 or 8 bytes) before for some other purposes and now to execute SSE instructions on this memory you need to load it into a register explicitly?
Data that's aligned on a 16 byte boundary will have a memory address that's an even number — strictly speaking, a multiple of two. Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes.
Similarly, memory aligned on a 32 bit (4 byte) boundary would have a memory address that's a multiple of four, because you group four bytes together to form a 32 bit word.

Calculate size in Hex Bytes

what is the proper way to calculate the size in hex bytes of a code segment. I am given:
IP = 0848 CS = 1488 DS = 1808 SS = 1C80 ES = 1F88
The practice exercise I am working on asks what is the size (in hex bytes) of the code segment and gives these choices:
A. 3800 B. 1488 C. 0830 D. 0380 E. none of the above
The correct answer is A. 3800, but I haven't a clue as to how to calculate this.
How to calculate the length:
Note CS. Find the segment register that's nearest to it, but greater.
Take the difference between the two, and multiply by 0x10 (read: tack on a 0).
In your example, DS is closest. 1808 - 1488 == 380. And 380 x 10 = 3800.
BTW, this only works on the 8086 and other, similarly boneheaded CPUs, and in real mode on x86. In protected mode on x86 (which is to say, unless you're writing a boot sector or a simple DOS program), the value of the segment register has very little to do with the size of the segment, and thus the stuff above simply doesn't apply.

x86 binary bloat - 32-bit offsets when 8-bits would do

I'm using clang+LLVM 2.9 to compile various workloads for x86 with the -Os option. Small binary size is important and I must use static linking. All binaries are 32-bit.
I notice that many instructions use addressing modes with 32-bit displacements when only 8 bits are actually used. For example:
89 84 24 d4 00 00 00 mov %eax,0xd4(%esp)
Why didn't the compiler/assembler choose the compact 8-bit displacement?
89 44 24 d4 mov %eax,0xd4(%esp)
In fact, these wasted addressing bytes are over 2% of my entire binary!
I looked at LLVM's link time optimization and tried --emit-llvm, but it didn't mention or help this issue.
Is there some link-time optimization that can use knowledge of the actual displacements to choose the smaller instruction form?
Thanks for any help!
In x86, offsets are signed. This allows you to access data on both sides of the base address. Therefore, the range of an 8 bit offset is -128 to 127. Your instruction is referencing data 212 bytes forward (the value 0xD4 in decimal). If it had been encoded using an 8 bit offset, it would be -44 in decimal, which is not what you wanted.