How to improve throughput of TUN interface when using Erlang TUNCTL

How to improve throughput of TUN interface when using Erlang TUNCTL - udp

I'm using TUNCTL with {active, true} to get UDP packets from a TUN interface. The process gets the packets and sends them to a different process that does work and sends them to yet another process that pushes them out a different interface using gen_udp. The same process repeats in the opposite direction, I use gen_udp to get packets and send them to a TUN interface.
I start seeing overruns on the incoming TUN interface when CPU load is close to 50%, about 2500 packets/sec. I don't loose any packets on gen_udp side ever, only with tunctl. Why is my application not getting all the packets from the TUN interface when CPU is not overloaded? My process has no messages in it's message queue.
I've played with process priorities and buffer sizes, which didn't do much. Total CPU load makes a bit of a difference. I managed to lower CPU load, but even though I saw a slight increase in TUN interface throughput, it now seems to max out at a lower CPU load, say 50% instead of 60%.
Is TUNCTL/Procket not able to read packets fast enough or is TUNCTL/Procket not getting enough CPU time for some reason? My theory is that Erlang Scheduler doesn't know how much time it needs as it's calling a NIF and it doesn't know about the number of unhandled messages on the TUN interface. Do I need to get my hands dirty with C++ and/or write my own NIF? MSANTOS HELP!

As expected, it was a problem with TUNCTL not getting enough CPU time when active is true. I used procket:read which gets the packet from the TUN buffer. Using this approach lets you specify how often to check the buffer, which tells Erlang Scheduler how much time your process needs. This let me load the CPU up to 100% if needed and allowed me to get all the packets from TUN interface that I needed. Bottleneck solved.

Related

Operating System Basics

I am reading process management,and I have a few doubts-
What is meant by an I/o request,for E.g.-A process is executing and
hence it is in running state,it is in waiting state if it is waiting
for the completion of an I/O request.I am not getting by what is meant by an I/O request,Can you
please give an example to elaborate.
Another doubt is -Lets say that a process is executing and suddenly
an interrupt occurs,then the process stops its execution and will be
put in the ready state,is it possible that some other process began
its execution while the interrupt is also being processed?

Regarding the first question:
A simple way to think about it...
Your computer has lots of components. CPU, Hard Drive, network card, sound card, gpu, etc. All those work in parallel and independent of each other. They are also generally slower than the CPU.
This means that whenever a process makes a call that down the line (on the OS side) ends up communicating with an external device, there is no point for the OS to be stuck waiting for the result since the time it takes for that operation to complete is probably an eternity (in the CPU view point of things).
So, the OS fires up whatever communication the process requested (call it IO request), flags the process as waiting for IO, and switches execution to another process so the CPU can do something useful instead of sitting around blocked waiting for the IO request to complete.
When the external device finishes whatever operation was requested, it generates an interrupt, so the OS is informed the work is done, and it can then flag the blocked process as ready again.
This is all a very simplified view of course, but that's the main idea. It allows the CPU to do useful work instead of waiting for IO requests to complete.
Regarding the second question:
It's tricky, even for single CPU machines, and depends on how the OS handles interrupts.
For code simplicity, a simple OS might for example, whenever an interrupt happens process the interrupt in one go, then resume whatever process it decides it's appropriate whenever the interrupt handling is done. So in this case, no other process would run until the interrupt handling is complete.
In practice, things get a bit more complicated for performance and latency reasons.
If you think about an interrupt lifetime as just another task for the CPU (From when the interrupt starts to the point the OS considers that handling complete), you can effectively code the interrupt handling to run in parallel with other things.
Just think of the interrupt as notification for the OS to start another task (that interrupt handling). It grabs whatever context it needs at the point the interrupt started, then keeps processing that task in parallel with other processes.

I/O request generally just means request to do either Input , Output or both. The exact meaning varies depending on your context like HTTP, Networks, Console Ops, or may be some process in the CPU.
A process is waiting for IO: Say for example you were writing a program in C to accept user's name on command line, and then would like to print 'Hello User' back. Your code will go into waiting state until user enters their name and hits Enter. This is a higher level example, but even on a very low level process executing in your computer's processor works on same basic principle
Can Processor work on other processes when current is interrupted and waiting on something? Yes! You better hope it does. Thats what scheduling algorithms and stacks are for. However the real answer depending on what Architecture you are on, does it support parallel or serial processing etc.

Synopsys USB OTG Controller (2.65a) occasionally truncates isochronous IN in USB device mode

I'm using a Synopsys OTG core in device mode. Programming an isochronous IN high speed endpoint (USB 2.0) for the maximum transfer per microframe (3 packets of 1024 bytes) using a periodic FIFO dedicated to this endpoint. It works 99+% of the time. But occasionally the transfer is truncated. For example, the first 1024 bytes will go onto the bus with the DATA0 PID (instead of the correct DATA2 PID) and the remaining 2048 bytes will not be sent. Since I've programmed the packet count, multicount, max packet size and transfer size correctly I'm not sure what is causing this.
Obviously this is a very specific question and I don't have much hope of getting an answer, but I figured a shot in the dark was worth a try. Thanks in advance.

Isochronous transfers does not guarantee packet delivery. So if host controller has other active transfers, it will silently drop isochronous packets. If you need guaranteed packed delivery, you should use bulk transfers (but then it will not guarantee delivery time).
Isochronous is ideal for applications, like sound or video streaming, where you need constant delivery time, but loss of some frames is ok.
The specification places limits on the bus, allowing no more than 90% of any frame to be allocated for periodic transfers (Interrupt and Isochronous) on a full speed bus. On high speed buses this limitation gets reduced to no more than 80% of a microframe can be allocated for periodic transfers. (c) http://www.beyondlogic.org/usbnutshell/usb4.shtml

Answering my own question in case it may help someone else. It appears this OTG controller has a bug where the TX FIFO does not always empty properly. I found a successful workaround is to flush the FIFO after each TX. It's quick and the truncation symptom goes away.

Direct memory access DMA - how does it work?

I read that if DMA is available, then processor can route long read or write requests of disk blocks to the DMA and concentrate on other work. But, DMA to memory data/control channel is busy during this transfer. What else can processor do during this time?

First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.
The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.
As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.
Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.

Well the key point to note is that the CPU bus is always partly used by the DMA and the rest of the channel is free to use for any other jobs/process to run. This is the key advantage of DMA over I/O. Hope this answered your question :-)

But, DMA to memory data/control channel is busy during this transfer.
Being busy doesn't mean you're saturated and unable to do other concurrent transfers. It's true the memory may be a bit less responsive than normal, but CPUs can still do useful work, and there are other things they can do unimpeded: crunch data that's already in their cache, receive hardware interrupts etc.. And it's not just about the quantity of data, but the rate at which it's generated: some devices create data in hard real-time and need it to be consumed promptly otherwise it's overwritten and lost: to handle this without DMA the software may may have to nail itself to a CPU core then spin waiting and reading - avoiding being swapped onto some other task for an entire scheduler time slice - even though most of the time further data's not even ready.

During DMA transfer, the CPU is idle and has no control over memory bus. CPU is put in idle state by using high impedance state

Suggestions to Increase tcp level throughput

we have an application requirement where we'll be receiving messages from around 5-10 clients at a rate of 500KB/sec and doing some internal logic then distrubuting the received messages among 30-35 other network entities.
What all the tcp level or thread level optimizations are suggested ?

Sometimes programmers can "shoot themselves in the foot". One example is attempting to increase a linux user-space application's socket buffer size with setsockopt/SO_RCVBUF. On recent Linux distributions, this deactivates auto-tuning of the receive window, leading to poorer performance than what would have been seen had we not pulled the trigger.

~4Mbits/sec (8 x 500KB/sec) per TCP connection is well within the capability of well written code without any special optimizations. This assumes, of course, that your target machine's clock rate is measured in GHz and isn't low on RAM.
When you get into the range of 60-80 Mbits/sec per TCP connection, then you begin to hit some bottlenecks that might need profiling and countermeasures.
So to answer your question, unless you're seeing trouble, no TCP or thread optimizations are suggested.

How CPU finds ISR and distinguishes between devices

I should first share all what I know - and that is complete chaos. There are several different questions on the topic, so please don't get irritated :).
1) To find an ISR, CPU is provided with a interrupt number. In x86 machines (286/386 and above) there is a IVT with ISRs in it; each entry of 4 bytes in size. So we need to multiply interrupt number by 4 to find the ISR. So first bunch of questions is - I am completely confused in mechanism of CPU receiving the interrupt. To raise an interrupt, firstly device shall probe for IRQ - then what ? The interrupt number travels "on IRQ" towards CPU? I also read something like device putting ISR address on data bus ; whats that then ? What is the concept of devices overriding the ISR. Can somebody tell me few example devices where CPU polls for interrupts? And where does it finds ISR for them ?
2) If two devices share an IRQ (which is very much possible), how does CPU differs amongst them ? What if both devices raise an interrupt of same priority simultaneously. I got to know there will be masking of same type and low priority interrupts - but how this communication happens between CPU and device controller? I studied the role of PIC and APIC for this problem, but could not understand.
Thanks for reading.
Thank you very much for answering.

CPUs don't poll for interrupts, at least not in a software sense. With respect to software, interrupts are asynchronous events.
What happens is that hardware within the CPU recognizes the interrupt request, which is an electrical input on an interrupt line, and in response, sets aside the normal execution of events to respond to the interrupt. In most modern CPUs, what happens next is determined by a hardware handshake particular to the type of CPU, but most of them receive a number of some kind from the interrupting device. That number can be 8 bits or 32 or whatever, depending on the design of the CPU. The CPU then uses this interrupt number to index into the interrupt vector table, to find an address to begin execution of the interrupt service routine. Once that address is determined, (and the current execution context is safely saved to the stack) the CPU begins executing the ISR.
When two devices share an interrupt request line, they can cause different ISRs to run by returning a different interrupt number during that handshaking process. If you have enough vector numbers available, each interrupting device can use its own interrupt vector.
But two devices can even share an interrupt request line and an interrupt vector, provided that the shared ISR is clever enough to go back to all the possible sources of the given interrupt, and check status registers to see which device requested service.
A little more detail
Suppose you have a system composed of a CPU, and interrupt controller, and an interrupting device. In the old days, these would have been separate physical devices but now all three might even reside in the same chip, but all the signals are still there inside the ceramic case. I'm going to use a powerPC (PPC) CPU with an integrated interrupt controller, connected to a device on a PCI bus, as an example that should serve nicely.
Let's say the device is a serial port that's transmitting some data. A typical serial port driver will load bunch of data into the device's FIFO, and the CPU can do regular work while the device does its thing. Typically these devices can be configured to generate an interrupt request when the device is running low on data to transmit, so that the device driver can come back and feed more into it.
The hardware logic in the device will expect a PCI bus interrupt acknowledge, at which point, a couple of things can happen. Some devices use 'autovectoring', which means that they rely on the interrupt controller to see to it that the correct service routine gets selected. Others will have a register, which the device driver will pre-program, that contains an interrupt vector that the device will place on the data bus in response to the interrupt acknowledge, for the interrupt controller to pick up.
A PCI bus has only four interrupt request lines, so our serial device will have to assert one of those. (It doesn't matter which at the moment, it's usually somewhat slot dependent..) Next in line is the interrupt controller (e.g. PIC/APIC), that will decide whether to acknowledge the interrupt based on mask bits that have been set in its own registers. Assuming it acknowledges the interrupt, it either then obtains the vector from the interrupting device (via the data bus lines), or may if so programmed use a 'canned' value provided by the APIC's own device driver. So far, the CPU has been blissfully unaware of all these goings-on, but that's about to change.
Now it's time for the interrupt controller to get the attention of the CPU core. The CPU will have its own interrupt mask bit(s) that may cause it to just ignore the request from the PIC. Assuming that the CPU is ready to take interrupts, it's now time for the real action to start. The current instruction usually has to be retired before the ISR can begin, so with pipelined processors this is a little complicated, but suffice it to say that at some point in the instruction stream, the processor context is saved off to the stack and the hardware-determined ISR takes over.
Some CPU cores have multiple request lines, and can start the process of narrowing down which ISR runs via hardware logic that jumps the CPU instruction pointer to one of a handful of top level handlers. The old 68K, and possibly others did it that way. The powerPC (and I believe, the x86) have a single interrupt request input. The x86 itself behaves a bit like a PIC, and can obtain a vector from the external PIC(s), but the powerPC just jumps to a fixed address, 0x00000500.
In the PPC, the code at 0x0500 is probably just going to immediately jump out to somewhere in memory where there's room enough for some serious decision making code, but it's still the interrupt service routine. That routine will first go to the PIC and obtain the vector, and also ask the PIC to stop asserting the interrupt request into the CPU core. Once the vector is known, the top level ISR can case out to a more specific handler that will service all the devices known to be using that vector. The vector specific handler then walks down the list of devices assigned to that vector, checking interrupt status bits in those devices, to see which ones need service.
When a device, like the hypothetical serial port, is found wanting service, the ISR for that device takes appropriate actions, for example, loading the next FIFO's worth of data out of an operating system buffer into the port's transmit FIFO. Some devices will automatically drop their interrupt request in response to being accessed, for example, writing a byte into the transmit FIFO might cause the serial port device to de-assert the request line. Other devices will require a special control register bit to be toggled, set, cleared, what-have-you, in order to drop the request. There are zillions of different I/O devices and no two of them ever seem to do it the same way, so it's hard to generalize, but that's usually the way of it.
Now, obviously there's more to say - what about interrupt priorities? what happens in a multi-core processor? What about nested interrupt controllers? But I've burned enough space on the server. Hope any of this helps.

I Came over this Question like after 3 years.. Hope I Can help ;)
The Intel 8259A or simply the "PIC" has 8 pins ,IRQ0-IRQ7, every pin connects to a single device..
Lets suppose that u pressed a button on the keyboard.. the voltage of the IRQ1 pin, which is connected to the KBD, is High.. so after the CPU gets interrupted, acknowledge the Interrupt bla bla bla... the PIC does simply add 8 to the number of the IRQ line so IRQ1 means 1+8 which means 9
SO the CPU sets its CS and IP on the 9th entry in the vector table.. and because the IVT is an array of longs it just multiply the number of cells by 4 ;)
CPU.CS=IVT[9].CS
CPU.IP=IVT[9].IP
the ESR deals with the device through the I/O ports ;)
Sorry for my bad english .. am an Arab though :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas