GPU Memory Read Instruction Flow, Operand collector

GPU Memory Read Instruction Flow, Operand collector - gpu

I am trying to learn the architecture of a GPU with GPGPU-Sim and I am confused with the flow of memory operations. Lets say I have arithmetic instruction like a = b + c. Before doing the calculation, memory load operations are required for b and c. Load instructions for these are sent to the memories.First of all cache tags are checked.
In case of a miss, the request is being added to MSHR and sent to the lower memory via interconnection network from gpu cores. When the request returns to the core from interconnection network, it is added to a some kind of memory response fifo. Then cache lines are filled by ejecting those requests from the response fifo.
In case of a hit, data are available at cache.
In both cases, our data for arithmetic instruction units are available in caches. I know that operand collector collects required operands for issuing warps, but the part confuses me is where does the operand collector collects those operands from? Per thread registers? If so, when do these registers get required data from caches?

Found the answer. One memory request response from memory response fifo is popped each cycle when the fifo is not empty and writeback stage is not stalled.The popped memory request response gets written to the single ported register file banks. SIMD execution units load required registers for arithmetic instructions from those register file banks when needed. Information about operand collector and those register file banks are available online and pantented by NVIDIA.

Related

Recovering from OOM on DirectByteBuffer allocation on a WebFlux Web Server

I'm not sure if this makes sense so please comment if I need to provide more info:
My webserver is used to upload files (receives files as Multipart/form-data and uploads them to another service). Using WebFlux, the controller defines the argument as a #RequestPart(name = "payload") final Part payload which wraps the header and Flux.
Reactor / Netty uses DirectByteBuffers to accomodate the payload. If the request handler cannot get enough direct memory to handle the request, it's gonna fail on an OOM and return 500. So this is normal / expected.
However, what's supposed to happen after?
I'm running load tests by sending multiple requests at the same time (either lots of requests with small files or less requests with bigger files). Once I get the first 500 due to an OOM, the system becomes unstable. Some requests will go through, and other fails with OOM (even requests with very small payload can fail).
This behaviour leds me to believe the allocated Pooled buffers are not shared between IO Channels? However this seems weird, it makes the system very easy to DDOS?
From the tests I did, I get the same behaviour using unpooled databuffers, although for a different reason. I do see the memory being unallocated when doing jcmd <PID> VM.native_memory but they aren't released to the OS according to metrics & htop. For instance, the reserved memory shown by jcmd goes back down but htop still reports the previous-high amount and it eventually OOM.
So Question :
Is that totally expected or am I missing a config value somewhere?
Setup :
Spring-boot 2.5.5 on openjdk11:jdk-11.0.10_9
Netty config :
-Dio.netty.allocator.type=pooled -Dio.netty.leakDetectionLevel=paranoid -Djdk.nio.maxCachedBufferSize=262144 -XX:MaxDirectMemorySize=1g -Dio.netty.maxDirectMemory=0

Is there a protocol or well-defined procedure for instruments to send their measurement results to control PC's over GPIB?

With a control PC, I am addressing a R&S ESPI Receiver device to perform a frequency scan and return the measurement results back via BAT-EMC control software and a NI GPIB-USB controller in between. My target is to track the binary measurement data (Definite Length Block Data according to IEEE 488.2) sent to the control PC to understand how the device is deciding on the byte size of each binary block sent.
The trace shows that binary blocks are sent with no consistent pattern or rule!
E.g, running the same scan with the same frequency range and step twice may result in a different distribution of the measurement values' bytes on binary blocks (and possibly different total number of blocks sent), although the amount of data delivered is the same.
Any help to figure out how the device and control software are communicating the measurement data?
PS: The NI trace at the level of GPIB controller is not showing that the control software is specifying a byte size when querying for the next block, neither is the instrument sending this piece of info when it is issuing a service request so that it is queried for more available data by the software (according to the trace).

Make sure that you are giving enough time for the instrument to respond. Possibly you are sending commands from the PC which would assert the ATN line and interrupt the response. You should be able to configure the instrument to send one result. Configure the instrument as a listener and talker and set the instrument to send only one response per trigger. Then send the group execute trigger (GET) and read the results off the bus. When it’s done measure how long it took for that packet to get sent. If you are sending triggers before the full response you will be terminating the output stream. I suspect this because the streams are randomly different.
I’m just starting to learn GPIB so please write back what happened.

Shared receive buffer for USB endpoints?

I'm developing a USB device driver for a microcontroller (Atmel/Microchip SAMD21, but I think the question is a general one). I need multiple endpoints for control & data, and the USB hardware uses per-endpoint descriptors to (among other things) locate buffers for input and output data.
Since IN data is polled at the host's discretion it makes sense that each endpoint has its own IN buffer, so that any endpoint's data (if it has any to send) is immediately available when polled.
But as far as incoming data from SETUP & OUT transactions is concerned, it occurs to me that I can save memory by configuring all endpoints to use a shared buffer. It seems wasteful for each endpoint to have its own buffer when, given the nature of USB transactions, only one such transaction can occur at a time.
Obviously this approach requires that transaction interrupts are handled sufficiently quickly that the shared buffer is freed and prepared for a new transaction in time for whatever the next transaction might be - but this is already a requirement for the control endpoint, where some SETUP transactions are immediately followed by an OUT.
So, assuming the timing is feasible, is there any other reason why such an approach wouldn't work?

Probably not.
Normally, the USB module on a microcontroller handles OUT packets by keeping track of which packet buffers it has written data to, and it waits for your firmware to say it is done processing the buffer before accepting more data from the computer and overwriting the buffer. If an endpoint has no buffers available to receive more data, but the computer sends an OUT packet to the endpoint, the USB module typically responds to the computer with a NAK packet, which tells the computer it should retry later. In this situation, your firmware can take pretty much as long as it wants to handle the OUT packets.
By having multiple endpoints configured to use the same buffer, you mess up this system. When you receive an OUT packet on any of your endpoints, the USB module would (probably) not know that multiple endpoints use the same buffer, so it would not issue NAK packets on your other OUT endpoints. If it receives another OUT packet right away, it would write it to the same buffer, overwriting the previous packet. Therefore, whenever you receive a packet, your code would have to rush as fast as it can to do something like copying the data out of that buffer, disabling other OUT endpoints, or reassigning buffers.
Even if you can actually get this to work, it means that your scheme to save a little bit of memory turns the servicing of USB events into a real-time task (i.e. a task that requires responses from your code in a few microseconds). If you want to add another real-time task to your system later, it will be very difficult, because you always have to be ready to be interrupted by your USB-handling code.
The SAMD21 has tons of memory (32 KB) so you probably don't need to worry about optimizing this part of it.

I agree with David's Response. You didn't mention the speed of the device you are creating. A low-speed would need just a few 8-byte buffers. A full-speed, a few 64-byte buffers. High-speed, maybe eight 64-byte buffers, depending on your use. A super-speed device, your still only talking a few 512-byte buffers.
I would create a ring buffer for each endpoint. This way you are not moving data around. You are simply using a pointer that points to an entry within a memory ring. A full-speed device with a control endpoint, an interrupt endpoint, and two bulk endpoints, each endpoint having sixteen 64-byte entries per ring, is still only a total of 4k RAM, 1/8th of the total RAM.
However, I am not familiar with the SAMD21, so please check the specification to be sure this will work.

asio - the design reason of async_write_some may not transmit all of the data

From user view, the property of "may not transmit all of the data" is a trouble thing. That will cause handler calls more than one time(may be).
The free function async_write ensure handler call only once, but it requires caller must call it in sequence or the data written will be interleaving. For network application usage, this is more bad than handler be called more than once.
If user want to handler called only once and data written is correct, user need to to do something.
I want to ask is: why asio not just make socket::async_write_some transmit all data?

I want to ask is: why asio not just make socket::async_write_some
transmit all data?
Opposed to async_write, socket::async_write_some is lower-level method.
The OS network stack is designed with send buffers and receive buffers. This buffers are required to be limited with some amount of memory. When you send many data over socket, receiving side can be more slow than sending and/or there can be network speed issues.
This is the reason why socket send buffers are limited and as a result system's syscalls like write or writev should be able to notify user program that system cannot accept chunk of data right now. With socket in async mode its even more critical. So, socket syscalls cannot work in async manner without signaling program to hold on.
So, the async_write_some as a mid-level wrapper to writev is required to support partial writes. In other hand async_write is composed operation and can call async_write_some many times in order to send buffers until operation is complete or possibly failed. It calls completion handler only once, not for each chunk of data passed to network stack.
If user want to handler called only once and data written is correct,
user need to to do something.
Nothing special, just to use async_write, not socket::async_write_some.

When do USB Hosts require a zero-length IN packet at the end of a Control Read Transfer?

I am writing code for a USB device. Suppose the USB host starts a control read transfer to read some data from the device, and the amount of data requested (wLength in the Setup Packet) is a multiple of the Endpoint 0 max packet size. Then after the host has received all the data (in the form of several IN transactions with maximum-sized data packets), will it initiate another IN transaction to see if there is more data even though there can't be more?
Here's an example sequence of events that I am wondering about:
USB enumeration process: max packet size on endpoint 0 is reported to be 64.
SETUP-DATA-ACK transaction starts a control read transfer, wLength = 128.
IN-DATA-ACK transaction delivers first 64 bytes of data to host.
IN-DATA-ACK transaction delivers last 64 bytes of data to host.
IN-DATA-ACK with zero-length DATA packet? Does this transaction ever happen?
OUT-DATA-ACK transaction completes Status Phase of the transfer; transfer is over.
I tested this on my computer (Windows Vista, if it matters) and the answer was no: the host was smart enough to know that no more data can be received from the device, even though all the packets sent by the device were full (maximum size allowed on Endpoint 0). I'm wondering if there are any hosts that are not smart enough, and will try to perform another IN transaction and expect to receive a zero-length data packet.
I think I read the relevant parts of the USB 2.0 and USB 3.0 specifications from usb.org but I did not find this issue addressed. I would appreciate it if someone can point me to the right section in either of those documents.
I know that a zero-length packet can be necessary if the device chooses to send less data than the host requested in wLength.
I know that I could make my code flexible enough to handle either case, but I'm hoping I don't have to.
Thanks to anyone who can answer this question!

Read carefully USB specification:
The Data stage of a control transfer from an endpoint to the host is complete when the endpoint does one of
the following:
Has transferred exactly the amount of data specified during the Setup stage
Transfers a packet with a payload size less than wMaxPacketSize or transfers a zero-length packet
So, in your case, when wLength == transfer size, answer is NO, you don't need ZLP.
In case wLength > transfer size, and (transfer size % ep0 size) == 0 answer is YES, you need ZLP.

In general, USB uses a less-than-max-length packet to demarcate an end-of-transfer. So in the case of a transfer which is an integer multiple of max-packet-length, a ZLP is used for demarcation.
You see this in bulk pipes a lot. For example, if you have a 4096 byte transfer, that will be broken down into an integer number of max-length packets plus one zero-length-packet. If the SW driver has a big enough receive buffer set up, higher-level SW receives the entire transfer at once, when the ZLP occurs.
Control transfers are a special case because they have the wLength field, so ZLP isn't strictly necessary.
But I'd strongly suggest SW be flexible to both, as you may see variations with different USB host silicon or low-level HCD drivers.

I would like to expand on MBR's answer. The USB specification 2.0, in section 5.5.3, says:
The Data stage of a control transfer from an endpoint to the host is
complete when the endpoint does one of the following:
Has transferred exactly the amount of data specified during the Setup stage
Transfers a packet with a payload size less than wMaxPacketSize or transfers a zero-length packet
When a Data stage is complete, the Host Controller advances to the
Status stage instead of continuing on with another data transaction.
If the Host Controller does not advance to the Status stage when the
Data stage is complete, the endpoint halts the pipe as was outlined in
Section 5.3.2. If a larger-than-expected data payload is received from
the endpoint, the IRP for the control transfer will be
aborted/retired.
I added emphasis to one of the sentences in that quote because it seems to specifically say what the device should do: it should "halt" the pipe if the host tries to continue the data phase after it was done, and it is done if all the requested data has been transmitted (i.e. the number of bytes transferred is greater than or equal to wLength). I think halting refers to sending a STALL packet.
In other words, the device does not need a zero-length packet in this situation and in fact the USB specification says it should not provide one.

You don't have to. (*)
The whole point of wLength is to tell the host the maximum number of bytes it should attempt to read (but it might read less !)
(*) I have seen devices that crash when IN/OUT requests were made at incorrect time during control transfers (when debugging our host solution). So any host doing what you are worried about, would of killed those devices and is hopefully not in the market.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas