Suggestions to Increase tcp level throughput - optimization

we have an application requirement where we'll be receiving messages from around 5-10 clients at a rate of 500KB/sec and doing some internal logic then distrubuting the received messages among 30-35 other network entities.
What all the tcp level or thread level optimizations are suggested ?

Sometimes programmers can "shoot themselves in the foot". One example is attempting to increase a linux user-space application's socket buffer size with setsockopt/SO_RCVBUF. On recent Linux distributions, this deactivates auto-tuning of the receive window, leading to poorer performance than what would have been seen had we not pulled the trigger.

~4Mbits/sec (8 x 500KB/sec) per TCP connection is well within the capability of well written code without any special optimizations. This assumes, of course, that your target machine's clock rate is measured in GHz and isn't low on RAM.
When you get into the range of 60-80 Mbits/sec per TCP connection, then you begin to hit some bottlenecks that might need profiling and countermeasures.
So to answer your question, unless you're seeing trouble, no TCP or thread optimizations are suggested.

Related

WebRTC Datachannel for high bandwidth application

I want to send unidirectional streaming data over a WebRTC datachannel, and is looking of the best configuration options (high BW, low latency/jitter) and others' experience with expected bitrates in this kind of application.
My test program sends chunks of 2k, with a bufferedAmountLowThreshold event callback of 2k and calls send again until bufferedAmount exceeds 16k. Using this in Chrome, I achieve ~135Mbit/s on LAN and ~20Mbit/s from/to a remote connection, that has 100Mbit/s WAN connection on both ends.
What is the limiting factor here?
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
My ultimate application will use the google-webrtc library on Android - I'm only using JS for prototyping. Can I set options to speed up bitrate in the library, that I cannot do in official JS APIs?
There are many variables that impact throughput and it also highly depends on how you've measured it. But I'll list a couple of things I have adjusted to increase the throughput of WebRTC data channels.
Disclaimer: I have not done these adjustments for libwebrtc but for my own WebRTC data channel library called RAWRTC, which btw also compiles for Android. However, both use the same SCTP library underneath, both use some OpenSSL-ish library and UDP sockets, so all of this should be appliable to libwebrtc.
Note that WebRTC data channel implementations using usrsctp are usually CPU bound when executed on the same machine, so keep that in mind when testing. With RAWRTC's default settings, I'm able to achieve ~520 Mbit/s on my i7 5820k. From my own tests, both Chrom(e|ium) and Firefox were able to achieve ~350 Mbit/s with default settings.
Alright, so let's dive into adjustments...
UDP Send/Receive Buffer Size
The default send/receive buffer of UDP sockets in Linux is quite small by default. If you can, you may want to adjust it.
DTLS Cipher Suites
Most Android devices have ARM processors without hardware AES support. ChaCha20 usually performs better in software and thus you may want to prefer it.
(This is what RAWRTC negotiates by default, so I have not included it in the end results.)
SCTP Send/Receive Buffer Size
The default send/receive window size of usrsctp, the SCTP stack used by libwebrtc, is 256 KiB which is way too small to achieve high throughput with moderate delay. The theoretical maximum throughput is limited by mbits = (window / (rtt_ms / 1000)) / 131072. So, with the default window of window=262144 and a fairly moderate RTT of rtt_ms=20, you will end up with a theoretical maximum of 100 Mbit/s.
The practical maximum is below that... actually, way lower than the theoretical maximum (see my test results). This may be a bug in the usrsctp stack (see sctplab/usrsctp#245).
The buffer size has been increased in Firefox (see bug 1051685) but not in libwebrtc used by Chrom(e|ium).
Release Builds
Optimisation level 3 makes a difference (duh!).
Message Size
You probably want to send 256 KiB sized messages.
Unless you need to support Chrome < ??? (sorry, I currently don't know where it landed...), then the maximum message size is 64 KiB (see issue 7774).
Unless you also need to support Firefox < 56, in which case the maximum message size is 16 KiB (see bug 979417).
It also depends on how much you send before you pause sending (i.e. the buffer's high water mark), and when you continue sending after the buffer has been drained (i.e. the buffer's low water mark). My tests have shown that targeting a high water mark of 1 MiB and setting a low water mark of 256 KiB results in adequate throughput.
This reduces the amount of API calls and can increase throughput.
End Results
Using optimisation level 3 with default settings on RAWRTC brought me up to ~600 Mbit/s.
Based on that, increasing the SCTP and UDP buffer sizes to 4 MiB brought me up further to ~700 Mbit/s, with one CPU core at 100% load.
However, I believe there is still room for improvements but it's unlikely to be low-hanging.
How can I see if the data is truly going peer to peer directly, or whether a TURN server is used?
Open about:webrtc in Firefox or chrome://webrtc-internals in Chrom(e|ium) and look for the chosen ICE candidate pair. Or use Wireshark.

Are there any advantages of using multiple threads for a file upload?

I got asked in an interview recently to design a file upload feature. After the initial discussion, The interviewer asked if I can design for multiple threads. My thought was, As the network bandwidth is limited and the internet is connected through a serial data connection, the network bottleneck will kick-in much before the CPU bottleneck, and a multiple thread implementation would have a limited performance improvement. But the interviewer was hell bend on the multi-thread approach. What are the arguments in favor of a multi-thread upload approach? (I recently came to know that AWS has a library which permits uploads on multiple threads. So there should be some advantages I am unaware of.)
A TCP connection can be limited in rate even on a high-speed network because of the bandwidth delay product.
A high bandwidth-delay product is an important problem case in the design of protocols such as Transmission Control Protocol (TCP) in respect of TCP tuning, because the protocol can only achieve optimum throughput if a sender sends a sufficiently large quantity of data before being required to stop and wait until a confirming message is received from the receiver, acknowledging successful receipt of that data. If the quantity of data sent is insufficient compared with the bandwidth-delay product, then the link is not being kept busy and the protocol is operating below peak efficiency for the link.
One easy way to work around TCP limitations on connections with large bandwith delay products is to do multiple streams in parallel.

How to improve throughput of TUN interface when using Erlang TUNCTL

I'm using TUNCTL with {active, true} to get UDP packets from a TUN interface. The process gets the packets and sends them to a different process that does work and sends them to yet another process that pushes them out a different interface using gen_udp. The same process repeats in the opposite direction, I use gen_udp to get packets and send them to a TUN interface.
I start seeing overruns on the incoming TUN interface when CPU load is close to 50%, about 2500 packets/sec. I don't loose any packets on gen_udp side ever, only with tunctl. Why is my application not getting all the packets from the TUN interface when CPU is not overloaded? My process has no messages in it's message queue.
I've played with process priorities and buffer sizes, which didn't do much. Total CPU load makes a bit of a difference. I managed to lower CPU load, but even though I saw a slight increase in TUN interface throughput, it now seems to max out at a lower CPU load, say 50% instead of 60%.
Is TUNCTL/Procket not able to read packets fast enough or is TUNCTL/Procket not getting enough CPU time for some reason? My theory is that Erlang Scheduler doesn't know how much time it needs as it's calling a NIF and it doesn't know about the number of unhandled messages on the TUN interface. Do I need to get my hands dirty with C++ and/or write my own NIF? MSANTOS HELP!
As expected, it was a problem with TUNCTL not getting enough CPU time when active is true. I used procket:read which gets the packet from the TUN buffer. Using this approach lets you specify how often to check the buffer, which tells Erlang Scheduler how much time your process needs. This let me load the CPU up to 100% if needed and allowed me to get all the packets from TUN interface that I needed. Bottleneck solved.

What properties make MQTT have a high latency?

The only possible reason that I could think of is the low overhead ie fixed header size of only 2 bytes minimum, leading to low packet size. Are there other factors in the design of the protocol?
EDIT:- I am sorry, I made a mental typo (?), as #Shashi pointed out, I actually meant high latency, low bandwidth.
MQTT is designed for devices with little memory footprint, low network bandwidth etc. Devices, for example sensors, energy meters, pace makers etc are ideal use cases for MQTT. Low latency means high speed. For low latency you require different protocol, like Reliable Multicast running over Gigabit Ethernet or InfiniBand networks.
One of the key factors is, that the TCP connection a MQTT client establishes is reused all the time. That means you don't have to establish a new connection all the time as it's the case with classic HTTP. Also, as you already suspected, the very low packet size is key here, typical MQTT messages don't have much overhead over the raw TCP packet.
To save more bandwidth on unreliable networks, the persistent session feature of MQTT allows clients to subscribe only once and on reconnect the subscriptions are retained for the client. For subscribing clients this can drastically reduce the overhead as the subscription message is only sent once.
Another reason, it seems is the Last Will and Testament feature, which is a useful to have feature in high latency network, low bandwidth and unreliable networks.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.