How are bytes mapped to AXI4 bus on a little endian system - hardware

If a program running on a little endian processor writes the value 0xaabbccdd uncached to address 0, and the processor uses a 32-bit wide AXI4 bus, are bits 31-24 of WDATA 0xaa or 0xdd?
AXI does not expose byte addressable memory--it can only read or write a full data bus width (32 bits in this case). The question is how it maps addresses of byte values to data bus bits.
Section A3.4.3 of the AXI spec (rev E) discusses "byte invariant" endianness, but doesn't seem to explain the order of bytes on the data bus.

If a little endian processor writes the value 0xaabbccdd uncached to address 0 via a 32-bit wide AXI4 bus, are bits 31-24 of WDATA 0xaa or 0xdd?
31-24 bits of WDATA are 0xAA independent of processor endianness. But the order of bytes in memory is determined by the endianness. See this answer for explanation of byte-invariant endianness.

Related

Opencl Maximum Size of Private memory per Work Item

I Have an AMD RX 570 4G,
Opencl tells me that I can use a Maximum of 256 Workgroup and 256 WorkItem per group...
Let's say I use all 256 Workgroup with 256 WorkItem in each of them,
Now, What is the Maximum Size of private memory per work item?
Is Private memory Equal to Total VRAM(4GB) Divided by Total Work Items(256x256)?
Or is it equal to Cache if so, how?
VRAM is represented in OpenCL as global memory.
Private memory is initially allocated from the register file. Your RX 570 is from AMD's Polaris architecture, a.k.a. GCN 4 where each compute unit (64 shader processors) has access to 256 vector (SIMD) registers (64x32 bits wide) and 512 32-bit scalar registers. So that works out to about 66KiB per CU, but it's not as simple as just quoting that total.
A workgroup will always be scheduled on a single compute unit, so if you assign it 256 work items, then it will have to perform every vector instruction 4 times in sequence (64 x 4 = 256) and the vector registers will (simplifying slightly) effectively have to be treated as 64 256-entry registers.
Scalar registers are used for data and calculations which are identical on each work item, e.g. incrementing a loop counter, holding buffer base pointers, etc.
Private memory will usually spill to global if you use more than will fit in your register file. So performance simply drops.
So essentially, on GCN, your optimal workgroup size is usually 64. Use as little private memory as possible; definitely aim for less than half of the available register file so that more than one workgroup can be scheduled so latency from memory access can be papered over, otherwise your shader cores will be spending a lot of time just waiting for data to arrive or be written out.
Cache is used for OpenCL local and constant memory spaces. (Constant will again spill to global if you try to use too much. The size of local memory can be checked via the OpenCL API and again is divided among workgroups scheduled on the same compute unit, so if you use more than half, only one group can run on a CU, etc.)
I don't know where you're getting a limit of 256 workgroups from, the limit is essentially set by whether the GPU uses 32-bit or 64-bit addressing. Most applications won't get close to 4bn work items even in the 32-bit case.
Private memory space is registers on the GPU die (0 cycle access latency) and not related to the amount of VRAM (global memory space) at all. The amount of private memory depends on the device (private memory per compute unit).
I don't know private memory size for the RX 570, but for older HD7000 series GPUs it is 256kB per CU. If you have a work group size of 256, you get 1kB per work item, which is equal to 256 float variables.
Cache size determines the size of local and constant memory space.

Are there unused bits in aarch64 instruction encoding?

As per this link about aarch64 instruction encoding, there are unused bits in some instructions, like x in below listing for LDR instruciton. But I any documentation about unused bits in armv8 manual. Are these unused bits valid according to armv8 manual?
xxx1 1101 x1ii iiii iiii iinn nnnt tttt - ldr Ft ADDR_UIMM12
That link is from 2012, that is when the ARMv8 architecture was released, so there was not a lot of information about it. The 'x' in that case is related to the decoding of the instruction, not sure about how they do it, it does not look correct to me.
You can find all the values for the encoding in the ARM Architecture Reference Manual, look at the LDR instructions use immediate values (e.g LDR (immediate) page 693 specifically the Unsigned offset in the next page).
You will see there that the two most significant bits are used for the size of the register (size == 10 is for W registers (32 bits) and size == 11 for X registers (64 bits)).
In the ARM Architecture Reference Manual usually, when there are encodings that are not used it says Unallocated Encoding or Reserve Encoding or something similar.
Also, there are plenty of free encodings to be used, probably for future use or for example for the Scalable Vector Extensions module. You can see all the encodings used and free in the following slides by Nigel Stephens at the Hot Chips 28 conference on August 22, 2016, look at slice 8, the grey squares are free unused encodings.

Finding the correct alignment for host visible memory

ppData points to a pointer in which is returned a host-accessible
pointer to the beginning of the mapped range. This pointer minus
offset must be aligned to at least
VkPhysicalDeviceLimits::minMemoryMapAlignment.
I want to allocate a Vec3 float in a uniform buffer. A Vec3 float is 12bytes big.
VkMemoryRequirements { size: 16, alignment: 16, memory_type_bits: 15 }
Vulkan reports that it has to be aligned to 16 bytes, which means that the size of the allocation is now 16 instead of 12. So Vulkan already handled this for me.
minMemoryMapAlignment on my GPU is 64 bytes. What exactly does this mean for my allocation? Does this mean that I can not use the size from a VkMemoryRequirements for my allocation? And instead of allocating 16bytes here, I would have to allocate 64bytes?
Update:
For a 12 byte allocation with a 16 byte alignment and 64 bytes minMemoryMapAlignment. I would still allocate only 16 bytes and then call:
vkMapMemory(device, memory, 0, 16, 0, &mapped);
But the ptr returned from vkMapMemory is actually not 16 bytes but 64 bytes wide? And all the relevant data is in the first 12 bytes and the rest is just "padded" memory? So in practice this basically means that I don't need to use minMemoryMapAlignment at all?
There is nothing in the spec that restricts the size of the allocation like that. The paragraph you quoted means that the mapping will be aligned to minMemoryMapAlignment and you can then tell the compiler to use aligned memory accesses when accessing it. What will happen is that when the memory is mapped the later 48 bytes are wasted space in the host's memory space. That is unlikely to matter though.
This is why people keep saying to allocate larger blocks and subdivide them as needed. That way you can put 4 of those vkBuffers into a single 64 byte allocation (which you will need if you want to pipeline the rendering).
It's highly unlikely that that single vec3 is the only thing you need memory for, so take a look at your other allocations and see which ones you can combine.

What are the 0 bytes at the end of an Ethernet frame in Wireshark?

after ARP protocol in a frame, there are many 0 bytes. Does anyone know the reason for the existence of these 0 bytes?
Check the Ethernet II accordion, all the 0 are labelled as padding.
Ethernet requires that all packets be at least 60 bytes long (64 bytes if you include the Frame Check Sequence at the end), so if a packet is less than 60 bytes long (including the 14-byte Ethernet header), additional padding bytes have to be added to the end of the packet.
(Those padding bytes will not show up on packets sent by the machine running Wireshark; the padding is added by the Ethernet hardware, and packets being sent by the machine capturing the traffic are given to the program before being handed to the hardware, so they haven't been padded.)

What does alignment to 16-byte boundary mean in x86

Intel's official optimization guide has a chapter on converting from MMX commands to SSE where they state the fallowing statment:
Computation instructions which use a memory operand that may not be aligned to a 16-byte boundary must be replaced with an unaligned 128-bit load (MOVDQU) followed by the same computation operation that uses instead register operands.
(chapter 5.8 Converting from 64-bit to 128-bit SIMD Integers, pg. 5-43)
I can't understand what they mean by "may not be aligned to a 16-byte boundary", could you please clarify it and give some examples?
Certain SIMD instructions, which perform the same instruction on multiple data, require that the memory address of this data is aligned to a certain byte boundary. This effectively means that the address of the memory your data resides in needs to be divisible by the number of bytes required by the instruction.
So in your case the alignment is 16 bytes (128 bits), which means the memory address of your data needs to be a multiple of 16. E.g. 0x00010 would be 16 byte aligned, while 0x00011 would not be.
How to get your data to be aligned depends on the programming language (and sometimes compiler) you are using. Most languages that have the notion of a memory address will also provide you with means to specify the alignment.
I'm guessing here, but could it be that "may not be aligned to a 16-byte boundary" means that this memory location has been aligned to a smaller value (4 or 8 bytes) before for some other purposes and now to execute SSE instructions on this memory you need to load it into a register explicitly?
Data that's aligned on a 16 byte boundary will have a memory address that's an even number — strictly speaking, a multiple of two. Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes.
Similarly, memory aligned on a 32 bit (4 byte) boundary would have a memory address that's a multiple of four, because you group four bytes together to form a 32 bit word.