What is the minimum latency of USB 3.0 - usb

First up, I don't know much about USB, so apologies in advance if my question is wrong.
In USB 2.0 the polling interval was 0.125ms, so the best possible latency for the host to read some data from the device was 0.125ms. I'm hoping for reduced latency in USB 3.0 devices, but I'm finding it hard to learn what the minimum latency is. The USB 3.0 spec says, "USB 2.0 style polling has been replaced with asynchronous notifications", which implies the 0.125ms polling interval may no longer be a limit.
I found some benchmarks for a USB 3.0 SSDs that look like data can be read from the device in just slightly less than 0.125ms, and that includes all time spent in the host OS and the device's flash controller.
http://www.guru3d.com/articles_pages/ocz_enyo_usb_3_portable_ssd_review,8.html
Can someone tell me what the lowest possible latency is? A theoretical answer is fine. An answer including the practical limits of the various versions of Linux and Windows USB stacks would be awesome.
To head-off the "tell me what you're trying to achieve" question, I'm creating a debug interface for the ASICs my company designs. ie A PC connects to one of our ASICs via a debug dongle. One possible use case is to implement conditional breakpoints when the ASIC hardware only implements simple breakpoints. To do so, I need to determine when a simple breakpoint has been hit, evaluate the condition, if false set the processor running again. The simple breakpoint may be hit millions of times before the condition becomes true. We might implement the debug dongle on an FPGA or an off-the-shelf USB 3.0 enabled micro-controller.

Answering my own question...
I've come to realise that this question kind-of misses the point of USB 3.0. Unlike 2.0, it is not a shared-bus system. Instead it uses a point-to-point link between the host and each device (I'm oversimplifying but the gist is true). With USB 2.0, the 125 us polling interval was critical to how the bus was time-division multiplexed between devices. However, because 3.0 uses point-to-point links, there is no multiplexing to be done and thus the polling interval no longer exists. As a result, the latency on packet delivery is much less than with USB 2.0.
In my experiments with a Cypress FX-3 devkit, I have found that it is easy enough to get an average round trip from Windows application to the device and back with an average latency of 30 us. I suspect that the vast majority of that time is spent in various OS delays, eg the user-space to kernel-space mode switch and the DPC latency within the driver.

I've got a couple of resources for you, one I've just downloaded which is the complete specs ... several pdfs zipped up for USB3, and here is short excerpt from page 58,59 (USB 3_r1.0_06_06_2011.pdf):
USB 2.0 transmits SOF/uSOF at fixed 1 ms/125 μs intervals. A device driver may change the interval with small finite adjustments depending on the implementation of host and system software. USB 3.0 adds mechanism for devices to send a Bus Interval Adjustment Message that is used by the host to adjust its 125 μs bus interval up to +/-13.333 μs.
In addition, the host may send an Isochronous Timestamp Packet (ITP) within a relaxed timing window from a bus interval boundary.
Here is one more resource which looked interesting which deals with calculating latency.
You make a good point about operating system latency issues, especially in not real time operating systems.
I might suggest that you check on SuperUser too, maybe someone has other ideas. CHEERS

I dispute the marked answer.
On Windows there is no way to achieve the stated roundtrip latency over USB. SuperSpeed (3.0) or not. The documentation states:
The number of isochronous packets must be a multiple of the number of packets per frame.
https://learn.microsoft.com/en-us/windows-hardware/drivers/usbcon/transfer-data-to-isochronous-endpoints
The packets per frame is given by the bIntervaland also determines the polling interval. E.g. if you want to achieve a transfer every microframe (125usec) you will need to submit 8 transfers per URB (USB Request Block), which means a scheduling service interval of 1ms.
Anything else requires your own kernel-mode driver or is out-of-spec.
On RT Linux I can confirm roundtrips of 2*125usec + some overhead.

Excerpts from embedded.com: "USB 3.0 vs USB 2.0: A quick reference summary for the busy engineer"
Communication architecture differences
USB 2.0 employs a communication architecture where the data transaction must be initiated by the host. The host will frequently poll the device and ask for data, and the device may only transmit data once it has been requested by the host. The high polling frequency not only increases power consumption, it increases transmission latency because the data can only be transmitted when the device is polled by the host. USB 3.0 improves upon this communication model and reduces transmission latency by minimizing polling and also allowing devices to transmit data as soon as it is ready.
...
Timestamp enhancements
Unlike USB 2.0 cameras, which can range in accuracy from 0 to 125 us, the timestamp originating from USB 3.0 cameras is more precise, and mimics the accuracy of the 1394 cycle timer of FireWire cameras.
...
USB 3.0 -- or Super-speed USB -- overcomes key limitations of other specifications all these limitations with six (over IEEE 1394b) to nine (over USB 2.0) times higher bandwidth, better error management, higher power supply, ... and lower latency and jitter times.
P.S. also it says about "longer cable lengths" for USB 3.0, but other paragraph contradicts to this & says upto 5m for USB 2.0, upto 3m for USB 3.0.

Related

WebRTC Datachannel for high bandwidth application

I want to send unidirectional streaming data over a WebRTC datachannel, and is looking of the best configuration options (high BW, low latency/jitter) and others' experience with expected bitrates in this kind of application.
My test program sends chunks of 2k, with a bufferedAmountLowThreshold event callback of 2k and calls send again until bufferedAmount exceeds 16k. Using this in Chrome, I achieve ~135Mbit/s on LAN and ~20Mbit/s from/to a remote connection, that has 100Mbit/s WAN connection on both ends.
What is the limiting factor here?
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
My ultimate application will use the google-webrtc library on Android - I'm only using JS for prototyping. Can I set options to speed up bitrate in the library, that I cannot do in official JS APIs?
There are many variables that impact throughput and it also highly depends on how you've measured it. But I'll list a couple of things I have adjusted to increase the throughput of WebRTC data channels.
Disclaimer: I have not done these adjustments for libwebrtc but for my own WebRTC data channel library called RAWRTC, which btw also compiles for Android. However, both use the same SCTP library underneath, both use some OpenSSL-ish library and UDP sockets, so all of this should be appliable to libwebrtc.
Note that WebRTC data channel implementations using usrsctp are usually CPU bound when executed on the same machine, so keep that in mind when testing. With RAWRTC's default settings, I'm able to achieve ~520 Mbit/s on my i7 5820k. From my own tests, both Chrom(e|ium) and Firefox were able to achieve ~350 Mbit/s with default settings.
Alright, so let's dive into adjustments...
UDP Send/Receive Buffer Size
The default send/receive buffer of UDP sockets in Linux is quite small by default. If you can, you may want to adjust it.
DTLS Cipher Suites
Most Android devices have ARM processors without hardware AES support. ChaCha20 usually performs better in software and thus you may want to prefer it.
(This is what RAWRTC negotiates by default, so I have not included it in the end results.)
SCTP Send/Receive Buffer Size
The default send/receive window size of usrsctp, the SCTP stack used by libwebrtc, is 256 KiB which is way too small to achieve high throughput with moderate delay. The theoretical maximum throughput is limited by mbits = (window / (rtt_ms / 1000)) / 131072. So, with the default window of window=262144 and a fairly moderate RTT of rtt_ms=20, you will end up with a theoretical maximum of 100 Mbit/s.
The practical maximum is below that... actually, way lower than the theoretical maximum (see my test results). This may be a bug in the usrsctp stack (see sctplab/usrsctp#245).
The buffer size has been increased in Firefox (see bug 1051685) but not in libwebrtc used by Chrom(e|ium).
Release Builds
Optimisation level 3 makes a difference (duh!).
Message Size
You probably want to send 256 KiB sized messages.
Unless you need to support Chrome < ??? (sorry, I currently don't know where it landed...), then the maximum message size is 64 KiB (see issue 7774).
Unless you also need to support Firefox < 56, in which case the maximum message size is 16 KiB (see bug 979417).
It also depends on how much you send before you pause sending (i.e. the buffer's high water mark), and when you continue sending after the buffer has been drained (i.e. the buffer's low water mark). My tests have shown that targeting a high water mark of 1 MiB and setting a low water mark of 256 KiB results in adequate throughput.
This reduces the amount of API calls and can increase throughput.
End Results
Using optimisation level 3 with default settings on RAWRTC brought me up to ~600 Mbit/s.
Based on that, increasing the SCTP and UDP buffer sizes to 4 MiB brought me up further to ~700 Mbit/s, with one CPU core at 100% load.
However, I believe there is still room for improvements but it's unlikely to be low-hanging.
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
Open about:webrtc in Firefox or chrome://webrtc-internals in Chrom(e|ium) and look for the chosen ICE candidate pair. Or use Wireshark.

Obtaining a fast ADC sample rate in embedded linux with an external ADC

I've been given the task of getting ADC samples onto an embedded linux computer at the highest rate I can (up to about 300kSPS). I am playing with several different platforms (odroid, edison) but easrly on I realized the limitations of using the build in ADCs from within linux and timing (I am relativly new to this).
Right now I am reliably getting 150kSPS using a teensy 3.2 with a very basic swapping buffer, a PDB, and the USB connection. USB writes take 2.5usec no matter my buffer size so any faster and the ADC read interrupt collides with the USB and I get nothing.
My question is: Would using an external ADC chip enable faster speeds? I see chips on Digikey and Mouser advertising 600kSPS and higher with SPI and even parallel outputs... but I fell like the bottleneck is the teensy with USB writes. Even if it could (and I am sure it could) read values 600k times a second how do you get it onto the computer without falling behind?
also, it is for long term collection so I can't just store everything and write it once the collection is over. The edison has a built in microcontroller, but no SPI implemented yet.
Edit:
To clarify, my question is weather there is any way to get large amounts of data very fast into my embedded linux device programmatically or is there some layer between a fast SPI device and the comptuer that I don't know about. So far my mentors have suggested I 1) learn to write a device driver for the SPI device or 2) recompile an image with RT_PREEMPT.

Configuration registers for LPC bus in Poulsbo System Controller Hub (US15W)

We have a system based around an Atom Z510/Intel SCH US15W Q7 card (running Debian Linux.) We need to transfer blocks of data from a device on the Low Pin Count Bus. As far as I know this chipset does not provide DMA facilities, meaning the processor has to read the data out a byte at a time in a software loop. (The device driver actually implements this using the "rep insb" x86 instructions so the loop is actually implemented by the CPU if I understand correctly.)
This is far from optimal, but it should be possible to hit a transfer rate of 14Mb/s. Instead we can barely manage 4Mb/s with transactions on the bus no closer than 2us apart even though each read to the slave device is is done in 560ns. I don't believe other traffic on the bus is to blame, but am still investigating.
My question is:
Does any one know if there are any configuration registers on the SCH that could affect the LPC bus timing?
I cannot find any useful information on the device on the Intel website, nor have I spotted anything in the Linux Kernel code that appears to be fiddling with any such registers (but I'm a noob when it come to Linux Kernel stuff.)
I'm not an x86 expert so any other factors that might come into play or any other 'war stories' relating to this device would be good to know about too.
Edit: I have found the datasheet. I've not seen anything in it that explains this behaviour, but I am investigating the possibility of mapping our device as a firmware device as the firmware bus cycles don't seem to suffer the same delays.
For the record, the solution was to modify the FPGA firmware such that the chip's data in/out register was mapped to four adjacent addresses and the driver modified to do 32 bit inb/outb instructions. Although the SCH does not implement 32 bit LPC read/write operations, the result is 4 back-to-back 8 bit operations followed by the same dead time as I was getting previously with a single byte, meaning it averages about 1us per byte. Not ideal, but still a doubling in throughput.
It transpires the firmware cycles were quicker because the SCH transfers 64 bytes at a time from the firmware flash - after 64 bytes there is the same 1.4us gap, indicating this is the per-transaction latency of the device. Exploiting this may have been slightly quicker than the above solution however the trade-off is that it is limited to 64 bytes chunks and each byte takes longer (680ns IIRC) due to the additional cycles required to do a firmware read.

Control stepper motors via USB

I'm doing a USB device is to control stepper motors. I've done this before using a parallel port. because these ports do not exist in current motherboards, I decided to implement a USB communication between my device and the PC (host).
To achieve My objective, I endowed the freescale microcontroller the device with that has a USB module 12Mbps.
My USB device must receive 4 bytes (one for each motor driver) at a given time, because every byte is a step that should move the engine.
In the PC (Host) an application of user processes a text file with information and make the trajectory coordinates sending bytes at a certain rate for each motor (time is trivial to achieve the acceleration and speed of the motors) .
Using the parallel port was an easy the task because each byte is sent sequentially to a time determined by the user app.
doing a little research about full speed USB protocol understood that the frame is sent every 1ms.
then you can send 4 byte or many more every 1ms but I can not manage time like I did with the parallel port.
My microcontroller can send up to 64 bytes per frame (Based on transfer papers type Control, Bulk, Int, Iso ..).
question 1:
I want to know in what way I can send 4-byte packets faster than every 1 ms?
question 2:
What type of transfer can advise me for these type of devices?
Thanks.
Like Ricardo said, USB-serial will suffice.
As for the type of transfer, try implementing a CDC stack and use your SCI receiver to listen for PC commands. That will give you a receive buffer which will meet your needs.
Initialize your SCI (baud, etc)
Enable receiver and interrupt
On data receive, move it to your 4-byte command buffer
Clear receive buffer, wait for more
When you have all 4 bytes, fire off the steppers! Four bytes should take µs.
Check with Freescale to see if your processor is supported.
http://cache.freescale.com/files/microcontrollers/doc/support_info/USB_STACK_RELEASE_NOTES_V4.1.1.pdf?fpsp=1
There might even be some sample code to get you started.
-Cheers
I am achieving the same goal (driving/control CNC machines) like this:
the USB device is just synchronous I/O parallel port. Using continuous bulk transfer one pipe as input and one as output. This way I was able to achieve synchronous 64bit parallel communication with ~70KHz sample rate. It uses traffic around (i)4.27+(o)4.27 MBit/s that is limit for mine MCU and code. Bigger speeds cause jitter on the output due to USB events interrupts.
How to do it (on MCU side)
I have 2 FIFO's one for ingoing and one for outgoing data. I have timer interrupt occurring with sample rate frequency. In it I read the inputs and feed it to the first FIFO and read data from the other FIFO and send it to the outputs.
On top of that the USB task is called (inside the same interrupt) checking FIFO for sending to and incoming data from USB handling the transfer itself
I choose ATMEL AT32UC3A chips for this task. After a long and pain full research I decided these MCU's because they have enough memory for both FIFO's and program so no need for additional IC. It has FPGA package which can be used (BGA is not an option). It has HS USB (most USB MCU's have only FS like yours). It runs at 66MHz. It supports many interesting features (did interesting projects with it in the past) and of coarse I have experience with ATMEL MCU's from past
So if you want to achieve something similar then
start with bulk transfer (PC -> USB -> MCU -> output)
add FIFO if needed
do not know the sample rate you need. The old LPT's could handle from 80-196KHz depend on the manufactor. The modern ones are much much slower (which is silly and sad).
measure the critical sample rate
you need oscilloscope or very good hearing for this. The output data must be synchronous so no holes in it, no jitter, etc...
if any of these are present you have to lower the sample rate. Mine setup could handle even 1MHz sample rate but the USB jitter was present (sometimes USB event froze the sending for longer that one sample...) so I achieve only 70KHz of stable output.
if needed also inputs then add them
but only if the output is working as it should. Do not forget to lower the sample rate after this too ... Use separate bulk pipes and FIFOs for input and output.

Why is not USB interrupt-driven?

What's the rationale behind making USB a polling mechanism rather than interrupt-driven? The answers I can come up with some reasoning are:
Leave control of processing efficiency and granularity to OS, rather than the device itself.
Prevent "interrupt storms" by faulty devices.
Some explanations on the net that I found say that it's mostly because of the nature of USB devices. They are mostly microcontroller-based systems which cannot queue larger transfers therefore require short interrupt intervals and such short interrupt intervals may not be the most efficient. Is that true?
Could there be other reasons?
The overarching premise of the development of USB was, "cheap chips". This was done, through the use of polling, which reduces the need for a higher arbitration protocol.
Firewire, which did allow for interrupts from the devices and even DMA, was much more expensive. So USB won in the low-cost field, and firewire in low-latency/low-overhead/... field. Due to history USB more or less won.
What's the rationale behind making USB a polling mechanism rather than interrupt-driven?
This seems to be anti-USB FUD (as in Fear-Uncertainy-Doubt).
The reason is that this simplifies things on the harware level quite a bit - no more collisions for example. USB is half-duplex to reducex the amount of wires in the cable, so only one can talk anyway.
While USB uses polling on the wire, once you use it in software you will notice that you have interrupts in USB. The only issue is a slight increase in latency - neglible in most use cases. Since the polling is usually realized in hardware IIRC, software only gets notified if there is new data.
On the software level, there are so-called "interrupt endpoints" - and guess what, every HID device uses them: Mice, Keyboard and Josticks are HID.
There are three ways to avoid data transfer collisions on a bus:
Have a somewhat complex bus management protocol. Such a protocol has to be rather sophisticated, as when too simple, it will make the bus rather slow (see Token Ring, which is rather simple, yet inefficient). However, having a sophisticated protocol makes all components expensive as all of them require management logic and need to understand of how the bus actually works (see Firewire).
Don't avoid them at all, allow them but detect and handle them. This is also somewhat complex and the bus cannot guarantee any speed or latency as if there are constantly collisions, throughput will fall and latency will raise (see Ethernet without switches, see WiFi).
Have one bus master that controls who can use the bus at which time and for how long. This is inexpensive, as only the master must be sophisticated and the master can give any possible guarantee regarding speed or latency.
And (3) is how USB works and not even the master USB chips need to be sophisticated as the master is usually a computer with a fast CPU and can perform all bus management in software.
USB device chips are dump as toast. They don't need to understand the nifty details of the bus. They just have to look for packets addressed to them which are either control packets, e.g. requesting meta data or selecting a configuration, data packets sent by the master, or poll requests from the master saying "if you have something to send, the bus is now yours".
To make sure the master polls them on time, they hand out a simple description table on request, that explains which endpoints they offer, how often those need to be polled, and how much data they will at most transfer when being polled. The master can use that information to build up a poll schedule that ensures that all devices are polled on time and get the bus for long enough to allow their maximum transfer size. Of course, this is not possible under all circumstances. If you connect too many devices that require very frequent polling and always want to send a lot of data, your system may refuse to add a new device with the error, that its poll requirements cannot be satisfied anymore. Yet that situation is rare in practice and USB is limited to 127 devices (hubs count as devices, so does the master itself).
Power management works in a similar way. Every device tells the master how much power it needs and the master ensures that the bus can still deliver that much, taking active hubs into account. If you connect another device and the bus cannot power it anymore, adding the device will fail with an error.
This allows a fairly complex, powerful, and fast bus system, yet with components that don't even need a real CPU. The simplest USB chips out there are just bridges to a serial data line (like an internal RS-232 bus or an I2C bus) and there is nothing really configurable and they cannot run software or have a firmware one could update. They just place incoming data packets into a buffer and then sent the buffer content bit for bit over the serial bus and they receive serial data in another buffer and return the buffer content when being polled. As for telling the master the configuration (including device and vendor IDs, as well as human readable strings), they simply send the content of a small external EPROM. Things cannot get much simpler than that yet such a chip is enough to build plenty of USB hardware already.