Imbalanced IRQs on virtio devices

Imbalanced IRQs on virtio devices - virtual-machine

I noticed in the top on my linux server that one cpu had far more number of software interrupts than all other 7 cores. Digging further, I noticed that this core is pinned to a particular irq which happens to be a virtio device. Infact each core has an affinity toward a particular virtio device
virtio0-config
virtio0-control
virtio0-event
virtio0-request
virtio2-config
virtio2-input.0
virtio2-output.0
virtio3-config
virtio3-input.0
virtio3-output.0
virtio4-config
virtio4-input.0
virtio4-output.0
In this list, virtio4-input.0 has in particular very high number of interrupts and I am not able to figure out what is special about this particular device. any clues will be very helpful.. The machine in question is a nutanix VM running on a linux host.

iirc, that's your virtualised KVM network device (virto4), and -input.0 is it's input queue. I don't know why, but it the interrupts appear to only be handled by one CPU. You can read more on someone's investigation, and attempts to spread the IRQ handling over multiple CPUs here:
http://www.9bitwizard.eu/packets-part-2

Related

Configuration registers for LPC bus in Poulsbo System Controller Hub (US15W)

We have a system based around an Atom Z510/Intel SCH US15W Q7 card (running Debian Linux.) We need to transfer blocks of data from a device on the Low Pin Count Bus. As far as I know this chipset does not provide DMA facilities, meaning the processor has to read the data out a byte at a time in a software loop. (The device driver actually implements this using the "rep insb" x86 instructions so the loop is actually implemented by the CPU if I understand correctly.)
This is far from optimal, but it should be possible to hit a transfer rate of 14Mb/s. Instead we can barely manage 4Mb/s with transactions on the bus no closer than 2us apart even though each read to the slave device is is done in 560ns. I don't believe other traffic on the bus is to blame, but am still investigating.
My question is:
Does any one know if there are any configuration registers on the SCH that could affect the LPC bus timing?
I cannot find any useful information on the device on the Intel website, nor have I spotted anything in the Linux Kernel code that appears to be fiddling with any such registers (but I'm a noob when it come to Linux Kernel stuff.)
I'm not an x86 expert so any other factors that might come into play or any other 'war stories' relating to this device would be good to know about too.
Edit: I have found the datasheet. I've not seen anything in it that explains this behaviour, but I am investigating the possibility of mapping our device as a firmware device as the firmware bus cycles don't seem to suffer the same delays.

For the record, the solution was to modify the FPGA firmware such that the chip's data in/out register was mapped to four adjacent addresses and the driver modified to do 32 bit inb/outb instructions. Although the SCH does not implement 32 bit LPC read/write operations, the result is 4 back-to-back 8 bit operations followed by the same dead time as I was getting previously with a single byte, meaning it averages about 1us per byte. Not ideal, but still a doubling in throughput.
It transpires the firmware cycles were quicker because the SCH transfers 64 bytes at a time from the firmware flash - after 64 bytes there is the same 1.4us gap, indicating this is the per-transaction latency of the device. Exploiting this may have been slightly quicker than the above solution however the trade-off is that it is limited to 64 bytes chunks and each byte takes longer (680ns IIRC) due to the additional cycles required to do a firmware read.

USB in an embedded system without RTOS

I have no experience of embedded USB stacks so my question is, can I run it without an OS?
Of course it must be possible to run without OS, but will things be MUCH easier if I have one?
I want to use it to save data to a attached USB Mass Storage Device.

If your USB device is on-chip, your chip vendor will almost certainly have example code for USB that may include mass storage. You won't need an OS, but interrupt handling will be necessary and a file system too.
Your USB controller will need host or OTG capability - if it is only device capable, then you cannot connect to another USB device, only a host.
The benefit of an OS - or at least a simple RTOS kernel - is that you can schedule file system activity concurrently with other processing tasks. The OS in that case would not necessarily make things easier, but it may make your system more responsive to critical tasks and events.

I have used usb stacks in the past with PIC18F2550 (8 bits) and LPC1343 (32 bits ARM-Cortex M3) microcontrollers without any problems.

how to port uclinux linux to any microcontroller

I have stellaris LM4f232 evaluation borad. I have ported free rtos , sysbios to stellaris lm4f232 board and successfully developed an gps tracking application . But I always wanted to port uc linux for my board . my question are
i) is there any material to port uclinux to any controller
ii)what are necessary knowledge I required to do the same
I have googled a lot . I did n't get the right information, but I have seen posts that its difficult ,but I cant able to realise the same .any help????
iii) what is the road map to achieve it , what are the knowledge I should need to achieve this

Linux, even uCLinux requires considerable memory resources; you'd want to start with at least 2Mb for the boot device and 16Mb of RAM (although a minimal system can be booted in as little as 4Mb). On a microcontroller, this means that you must have external memory.
Another issue is that Cortex-M devices are optimised to run code from on-chip Flash memory, having separate buses for ROM and RAM so that data and instructions can be fetched simultaneously. uClinux must run from external RAM, which has a detrimental effect on the performance, and you will be unlikely to achieve the 1.25MIPS per MHz figure the CM4 is otherwise capable of. It is possible to arrange for time critical code to be placed in on-chip flash is necessary, but it is of course a limited resource.
Some good advice on the issues of deploying Linux on a Cortex-M device can be found here

I would suggest to have a look on buildroot which as far as I know can be build for this board.

adding to #Clifford , you can use u-boot (bootloader) ,already configured for many boards ,if your board is not on list you can edit it.,

What is the minimum latency of USB 3.0

First up, I don't know much about USB, so apologies in advance if my question is wrong.
In USB 2.0 the polling interval was 0.125ms, so the best possible latency for the host to read some data from the device was 0.125ms. I'm hoping for reduced latency in USB 3.0 devices, but I'm finding it hard to learn what the minimum latency is. The USB 3.0 spec says, "USB 2.0 style polling has been replaced with asynchronous notifications", which implies the 0.125ms polling interval may no longer be a limit.
I found some benchmarks for a USB 3.0 SSDs that look like data can be read from the device in just slightly less than 0.125ms, and that includes all time spent in the host OS and the device's flash controller.
http://www.guru3d.com/articles_pages/ocz_enyo_usb_3_portable_ssd_review,8.html
Can someone tell me what the lowest possible latency is? A theoretical answer is fine. An answer including the practical limits of the various versions of Linux and Windows USB stacks would be awesome.
To head-off the "tell me what you're trying to achieve" question, I'm creating a debug interface for the ASICs my company designs. ie A PC connects to one of our ASICs via a debug dongle. One possible use case is to implement conditional breakpoints when the ASIC hardware only implements simple breakpoints. To do so, I need to determine when a simple breakpoint has been hit, evaluate the condition, if false set the processor running again. The simple breakpoint may be hit millions of times before the condition becomes true. We might implement the debug dongle on an FPGA or an off-the-shelf USB 3.0 enabled micro-controller.

Answering my own question...
I've come to realise that this question kind-of misses the point of USB 3.0. Unlike 2.0, it is not a shared-bus system. Instead it uses a point-to-point link between the host and each device (I'm oversimplifying but the gist is true). With USB 2.0, the 125 us polling interval was critical to how the bus was time-division multiplexed between devices. However, because 3.0 uses point-to-point links, there is no multiplexing to be done and thus the polling interval no longer exists. As a result, the latency on packet delivery is much less than with USB 2.0.
In my experiments with a Cypress FX-3 devkit, I have found that it is easy enough to get an average round trip from Windows application to the device and back with an average latency of 30 us. I suspect that the vast majority of that time is spent in various OS delays, eg the user-space to kernel-space mode switch and the DPC latency within the driver.

I've got a couple of resources for you, one I've just downloaded which is the complete specs ... several pdfs zipped up for USB3, and here is short excerpt from page 58,59 (USB 3_r1.0_06_06_2011.pdf):
USB 2.0 transmits SOF/uSOF at fixed 1 ms/125 μs intervals. A device driver may change the interval with small finite adjustments depending on the implementation of host and system software. USB 3.0 adds mechanism for devices to send a Bus Interval Adjustment Message that is used by the host to adjust its 125 μs bus interval up to +/-13.333 μs.
In addition, the host may send an Isochronous Timestamp Packet (ITP) within a relaxed timing window from a bus interval boundary.
Here is one more resource which looked interesting which deals with calculating latency.
You make a good point about operating system latency issues, especially in not real time operating systems.
I might suggest that you check on SuperUser too, maybe someone has other ideas. CHEERS

I dispute the marked answer.
On Windows there is no way to achieve the stated roundtrip latency over USB. SuperSpeed (3.0) or not. The documentation states:
The number of isochronous packets must be a multiple of the number of packets per frame.
https://learn.microsoft.com/en-us/windows-hardware/drivers/usbcon/transfer-data-to-isochronous-endpoints
The packets per frame is given by the bIntervaland also determines the polling interval. E.g. if you want to achieve a transfer every microframe (125usec) you will need to submit 8 transfers per URB (USB Request Block), which means a scheduling service interval of 1ms.
Anything else requires your own kernel-mode driver or is out-of-spec.
On RT Linux I can confirm roundtrips of 2*125usec + some overhead.

Excerpts from embedded.com: "USB 3.0 vs USB 2.0: A quick reference summary for the busy engineer"
Communication architecture differences
USB 2.0 employs a communication architecture where the data transaction must be initiated by the host. The host will frequently poll the device and ask for data, and the device may only transmit data once it has been requested by the host. The high polling frequency not only increases power consumption, it increases transmission latency because the data can only be transmitted when the device is polled by the host. USB 3.0 improves upon this communication model and reduces transmission latency by minimizing polling and also allowing devices to transmit data as soon as it is ready.
...
Timestamp enhancements
Unlike USB 2.0 cameras, which can range in accuracy from 0 to 125 us, the timestamp originating from USB 3.0 cameras is more precise, and mimics the accuracy of the 1394 cycle timer of FireWire cameras.
...
USB 3.0 -- or Super-speed USB -- overcomes key limitations of other specifications all these limitations with six (over IEEE 1394b) to nine (over USB 2.0) times higher bandwidth, better error management, higher power supply, ... and lower latency and jitter times.
P.S. also it says about "longer cable lengths" for USB 3.0, but other paragraph contradicts to this & says upto 5m for USB 2.0, upto 3m for USB 3.0.

How are interrupts handled by dual processor machines?

I have an idea of how interrupts are handled by a dual core CPU. I was wondering about how interrupt handling is implemented on a board with more than one physical processor.
Is any of the interrupt responsibility determined by the physical board's configuration? Each processor must be able to handle some types of interrupts, like disk I/O. Unless there is some circuitry to manage and dispatch interrupts to the appropriate processor? My guess is that the scheme must be processor neutral, so that any processor and core can run the interrupt handler.
If a core is waiting on a disk read, will that core be the one to run the interrupt handler when the disk is ready?

On x86 systems each CPU gets its own local APIC (Advanced Programmable Interrupt Controller) which are also wired to each other and to an I/O APIC that handles routing device interrupts to the local APICs.
The OS can program the APICs to determine which interrupts get routed to which CPUs (or to let the APICs make that decision).
I imagine that a multi-core CPU would have a local APIC for each core, but I'm honestly not certain about that.
See these links for more details:
http://osdev.berlios.de/pic.html
http://www.microsoft.com/whdc/archive/io-apic.mspx
http://en.wikipedia.org/wiki/Intel_APIC_Architecture

What you're interested in is SMP Processor Affinity. Here is an excellent article about how it is handled in Linux. The Advanced Programmable Interrupt Controller (APIC) is how you manage this in a modern system. Basically, the default would be to all go to processor 0 unless you had an OS that utilized this interface to set things up properly. Also, you don't necessarily want the core that issued a command to wait on a particular interrupt. You want the less loaded cores to receive it.

I already asked this question a while back. Maybe it can offer you some insight :)
how do interrupts in multicore/multicpu machines work

I would say that it would depend on the hardware manufacturer...
However this link makes me believe most are probably handled by the primary processor and/or first core.
Another link

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas