WebRTC Datachannel for high bandwidth application - webrtc

I want to send unidirectional streaming data over a WebRTC datachannel, and is looking of the best configuration options (high BW, low latency/jitter) and others' experience with expected bitrates in this kind of application.
My test program sends chunks of 2k, with a bufferedAmountLowThreshold event callback of 2k and calls send again until bufferedAmount exceeds 16k. Using this in Chrome, I achieve ~135Mbit/s on LAN and ~20Mbit/s from/to a remote connection, that has 100Mbit/s WAN connection on both ends.
What is the limiting factor here?
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
My ultimate application will use the google-webrtc library on Android - I'm only using JS for prototyping. Can I set options to speed up bitrate in the library, that I cannot do in official JS APIs?

There are many variables that impact throughput and it also highly depends on how you've measured it. But I'll list a couple of things I have adjusted to increase the throughput of WebRTC data channels.
Disclaimer: I have not done these adjustments for libwebrtc but for my own WebRTC data channel library called RAWRTC, which btw also compiles for Android. However, both use the same SCTP library underneath, both use some OpenSSL-ish library and UDP sockets, so all of this should be appliable to libwebrtc.
Note that WebRTC data channel implementations using usrsctp are usually CPU bound when executed on the same machine, so keep that in mind when testing. With RAWRTC's default settings, I'm able to achieve ~520 Mbit/s on my i7 5820k. From my own tests, both Chrom(e|ium) and Firefox were able to achieve ~350 Mbit/s with default settings.
Alright, so let's dive into adjustments...
UDP Send/Receive Buffer Size
The default send/receive buffer of UDP sockets in Linux is quite small by default. If you can, you may want to adjust it.
DTLS Cipher Suites
Most Android devices have ARM processors without hardware AES support. ChaCha20 usually performs better in software and thus you may want to prefer it.
(This is what RAWRTC negotiates by default, so I have not included it in the end results.)
SCTP Send/Receive Buffer Size
The default send/receive window size of usrsctp, the SCTP stack used by libwebrtc, is 256 KiB which is way too small to achieve high throughput with moderate delay. The theoretical maximum throughput is limited by mbits = (window / (rtt_ms / 1000)) / 131072. So, with the default window of window=262144 and a fairly moderate RTT of rtt_ms=20, you will end up with a theoretical maximum of 100 Mbit/s.
The practical maximum is below that... actually, way lower than the theoretical maximum (see my test results). This may be a bug in the usrsctp stack (see sctplab/usrsctp#245).
The buffer size has been increased in Firefox (see bug 1051685) but not in libwebrtc used by Chrom(e|ium).
Release Builds
Optimisation level 3 makes a difference (duh!).
Message Size
You probably want to send 256 KiB sized messages.
Unless you need to support Chrome < ??? (sorry, I currently don't know where it landed...), then the maximum message size is 64 KiB (see issue 7774).
Unless you also need to support Firefox < 56, in which case the maximum message size is 16 KiB (see bug 979417).
It also depends on how much you send before you pause sending (i.e. the buffer's high water mark), and when you continue sending after the buffer has been drained (i.e. the buffer's low water mark). My tests have shown that targeting a high water mark of 1 MiB and setting a low water mark of 256 KiB results in adequate throughput.
This reduces the amount of API calls and can increase throughput.
End Results
Using optimisation level 3 with default settings on RAWRTC brought me up to ~600 Mbit/s.
Based on that, increasing the SCTP and UDP buffer sizes to 4 MiB brought me up further to ~700 Mbit/s, with one CPU core at 100% load.
However, I believe there is still room for improvements but it's unlikely to be low-hanging.
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
Open about:webrtc in Firefox or chrome://webrtc-internals in Chrom(e|ium) and look for the chosen ICE candidate pair. Or use Wireshark.

Related

Resource usage of a static web server

I came across this question in a blog post. It was asked by Mozilla in their internship interview. (Blog Post)
You are running a HTTP server (nginx, Apache, etc) that is configured
to serve static files off the local filesystem of your modern,
multi-core server connected to a gigabit network. A handful of clients
start requesting the same 4kb static file as fast as they can. What
system resource do you think will be exhausted first?
a. CPU
b. Disk / I/O
c. Memory
d. Network
e. Other
According to me, none of this would be exhausted on a modern machine, with Nginx/Apache. Won't the web server cache such a small file and just keep serving that. Also, for repeated request it can easily send a Not-Modified header.
In case of Apache, I guess due to it handling multiple clients by spawning threads, CPU will be exhausted first, but for a "handful" of clients, that won't matter.
I wanted to know what others have to say about this question.
It reeeeeeeeally depends. 4k is that magical size that will fit into as good as all caches and buffers at their default settings, so it is easy (and fast) to pass around. memory is not a limiting factor here as webservers will operate on filehandles, not entire files. In this case I would assume they keep it right in memory, but that would be one file per worker instance which would usually come down to 4kb * (num_cores + 1) at most, which is not really an issue.
One could argue that either memory- or diskspeed were an issue. But former one were neglectable when methods like sendfile are properly configured, enabling for a zero-copy approach. Latter one would amortize over time once a copy of the file got loaded into memory.
Lastly, there's the interface and the CPU(s). Overall, CPU time tends to be a lot cheaper than network time, so I would expect the NIC to be the bottleneck long before the CPU - if at all.
The question is a bit unspecific on the location of the clients. If they are connected to the same GbE network, they could indeed have the power to saturate your NIC with their requests. If not, some intermediary could become the limiting factor.
Now let us assume those clients were in our network and we had a single-homed 10GbE NIC here, connected via 8 lanes (which is fairly standard IMHO): PCIe 3.0 x8 is specified with 7,877 MB/s. A Core i7 3770 has a bus speed of 5GT/s, which is translating to roughly 8 GB/s at 8 lanes. Assuming no other I/O-intensive workload, this CPU could easily saturate the NIC.
So in summary: Network/NIC saturation before CPU saturation before anything else.

What is the minimum latency of USB 3.0

First up, I don't know much about USB, so apologies in advance if my question is wrong.
In USB 2.0 the polling interval was 0.125ms, so the best possible latency for the host to read some data from the device was 0.125ms. I'm hoping for reduced latency in USB 3.0 devices, but I'm finding it hard to learn what the minimum latency is. The USB 3.0 spec says, "USB 2.0 style polling has been replaced with asynchronous notifications", which implies the 0.125ms polling interval may no longer be a limit.
I found some benchmarks for a USB 3.0 SSDs that look like data can be read from the device in just slightly less than 0.125ms, and that includes all time spent in the host OS and the device's flash controller.
http://www.guru3d.com/articles_pages/ocz_enyo_usb_3_portable_ssd_review,8.html
Can someone tell me what the lowest possible latency is? A theoretical answer is fine. An answer including the practical limits of the various versions of Linux and Windows USB stacks would be awesome.
To head-off the "tell me what you're trying to achieve" question, I'm creating a debug interface for the ASICs my company designs. ie A PC connects to one of our ASICs via a debug dongle. One possible use case is to implement conditional breakpoints when the ASIC hardware only implements simple breakpoints. To do so, I need to determine when a simple breakpoint has been hit, evaluate the condition, if false set the processor running again. The simple breakpoint may be hit millions of times before the condition becomes true. We might implement the debug dongle on an FPGA or an off-the-shelf USB 3.0 enabled micro-controller.
Answering my own question...
I've come to realise that this question kind-of misses the point of USB 3.0. Unlike 2.0, it is not a shared-bus system. Instead it uses a point-to-point link between the host and each device (I'm oversimplifying but the gist is true). With USB 2.0, the 125 us polling interval was critical to how the bus was time-division multiplexed between devices. However, because 3.0 uses point-to-point links, there is no multiplexing to be done and thus the polling interval no longer exists. As a result, the latency on packet delivery is much less than with USB 2.0.
In my experiments with a Cypress FX-3 devkit, I have found that it is easy enough to get an average round trip from Windows application to the device and back with an average latency of 30 us. I suspect that the vast majority of that time is spent in various OS delays, eg the user-space to kernel-space mode switch and the DPC latency within the driver.
I've got a couple of resources for you, one I've just downloaded which is the complete specs ... several pdfs zipped up for USB3, and here is short excerpt from page 58,59 (USB 3_r1.0_06_06_2011.pdf):
USB 2.0 transmits SOF/uSOF at fixed 1 ms/125 μs intervals. A device driver may change the interval with small finite adjustments depending on the implementation of host and system software. USB 3.0 adds mechanism for devices to send a Bus Interval Adjustment Message that is used by the host to adjust its 125 μs bus interval up to +/-13.333 μs.
In addition, the host may send an Isochronous Timestamp Packet (ITP) within a relaxed timing window from a bus interval boundary.
Here is one more resource which looked interesting which deals with calculating latency.
You make a good point about operating system latency issues, especially in not real time operating systems.
I might suggest that you check on SuperUser too, maybe someone has other ideas. CHEERS
I dispute the marked answer.
On Windows there is no way to achieve the stated roundtrip latency over USB. SuperSpeed (3.0) or not. The documentation states:
The number of isochronous packets must be a multiple of the number of packets per frame.
https://learn.microsoft.com/en-us/windows-hardware/drivers/usbcon/transfer-data-to-isochronous-endpoints
The packets per frame is given by the bIntervaland also determines the polling interval. E.g. if you want to achieve a transfer every microframe (125usec) you will need to submit 8 transfers per URB (USB Request Block), which means a scheduling service interval of 1ms.
Anything else requires your own kernel-mode driver or is out-of-spec.
On RT Linux I can confirm roundtrips of 2*125usec + some overhead.
Excerpts from embedded.com: "USB 3.0 vs USB 2.0: A quick reference summary for the busy engineer"
Communication architecture differences
USB 2.0 employs a communication architecture where the data transaction must be initiated by the host. The host will frequently poll the device and ask for data, and the device may only transmit data once it has been requested by the host. The high polling frequency not only increases power consumption, it increases transmission latency because the data can only be transmitted when the device is polled by the host. USB 3.0 improves upon this communication model and reduces transmission latency by minimizing polling and also allowing devices to transmit data as soon as it is ready.
...
Timestamp enhancements
Unlike USB 2.0 cameras, which can range in accuracy from 0 to 125 us, the timestamp originating from USB 3.0 cameras is more precise, and mimics the accuracy of the 1394 cycle timer of FireWire cameras.
...
USB 3.0 -- or Super-speed USB -- overcomes key limitations of other specifications all these limitations with six (over IEEE 1394b) to nine (over USB 2.0) times higher bandwidth, better error management, higher power supply, ... and lower latency and jitter times.
P.S. also it says about "longer cable lengths" for USB 3.0, but other paragraph contradicts to this & says upto 5m for USB 2.0, upto 3m for USB 3.0.

Faster USB HID output

I'm attempting to speed up a rather sluggish bootloader. Currently I'm sending data on a single USB HID output endpoint, and as it's a low-speed device I'm apparently limited to one 8-byte packet per 10 ms interval for a whopping 800 bytes/second.
Is it possible to increase the reporting frequency somehow? Or to use multiple output endpoints in a single interface or as part of a composite device? Or perhaps to abuse the control endpoint to send additional data?
Better compression is always an alternative I suppose, but it's an area of diminishing returns, and redesigning the hardware to allow full-speed USB isn't really an option.
For the record I'd be happy with a Windows-only solution.
Or perhaps to abuse the control endpoint to send additional data?
You can use "Vendor specific requests" for that. The TI TUSB3410 Chip works that way AFAIK. Many USB stacks have the hooks for them already in place.
This requires a driver or libusb on the host side, however.
I was able to speed up the upload by orders of magnitude by using SET_REPORT requests on the control endpoint, instead of declaring a separate interrupt out endpoint. That way you get all of the bandwidth available for control transfers.
Also using a larger report split into multiple segments helped reduce the number of SETUP packets needed.
Who says you are limited to an 8-byte packet per 10ms? I don't know the exact numbers off the top of my head, but I know you can send larger packets than that. I did an HID device and was using 64-byte packets. I think I could go larger, but that limit is probably hardware-specific. What hardware are you using?
Also, have you consulted USB in a NutShell?
The actual limit is 8 bytes every 10ms for low-speed devices, and 64 bytes every 1ms for high-speed devices, per interrupt-based endpoint.
So it seems that the first thing to try is switching to high-speed mode, if the hardware supports it. The next thing on the list is using multiple endpoints. If you really want to get the highest possible transfer rate, the HID class is a bad choice.

Suggestions to Increase tcp level throughput

we have an application requirement where we'll be receiving messages from around 5-10 clients at a rate of 500KB/sec and doing some internal logic then distrubuting the received messages among 30-35 other network entities.
What all the tcp level or thread level optimizations are suggested ?
Sometimes programmers can "shoot themselves in the foot". One example is attempting to increase a linux user-space application's socket buffer size with setsockopt/SO_RCVBUF. On recent Linux distributions, this deactivates auto-tuning of the receive window, leading to poorer performance than what would have been seen had we not pulled the trigger.
~4Mbits/sec (8 x 500KB/sec) per TCP connection is well within the capability of well written code without any special optimizations. This assumes, of course, that your target machine's clock rate is measured in GHz and isn't low on RAM.
When you get into the range of 60-80 Mbits/sec per TCP connection, then you begin to hit some bottlenecks that might need profiling and countermeasures.
So to answer your question, unless you're seeing trouble, no TCP or thread optimizations are suggested.

How can I calculate an optimal UDP packet size for a datastream?

Short radio link with a data source attached with a needed throughput of 1280 Kbps over IPv6 with a UDP Stop-and-wait protocol, no other clients or noticeable noise sources in the area. How on earth can I calculate what the best packet size is to minimise overhead?
UPDATE
I thought it would be an idea to show my working so far:
IPv6 has a 40 byte header, so including ACK responses, that's 80 bytes overhead per packet.
To meet the throughput requirement, 1280 K/p packets need to be sent a second, where p is the packet payload size.
So by my reckoning that means that the total overhead is (1280 K/p)*(80), and throwing that into Wolfram gives a function with no minima, so no 'optimal' value.
I did a lot more math trying to shoehorn bit error rate calculations into there but came up against the same thing; if there's no minima, how do I choose the optimal value?
Your best bet is to use a simulation framework for networks. This is a hard problem, and doesn't have an easy answer.
NS2 or SimPy can help you devise a discrete event simulation to find optimal conditions, if you know your model in terms of packet loss.
Always work with the largest packet size available on the network, then in deployment configure the network MTU for the most reliable setting.
Consider latency requirements, how is the payload being generated, do you need to wait for sufficient data before sending a packet or can you immediately send?
The radio channel is already optimized for noise as the low packet level, you will usually have other demands of the implementation such as power requirements: sending in heavy batches or light continuous load.