Is a difference in the implementation of boost::asio between Unix(AIX, HP-UX) and Linux(RedHat, Ubuntu)? - boost-asio

In Windows the Completion Event Queue is implemented by the operating system, and is associated with an I/O completion port.
http://www.boost.org/doc/libs/1_37_0/doc/html/boost_asio/overview/core/async.html
But what of this (select, epoll or kqueue) used for maximum perfomance on Unix and what on Linux, and is there a difference in the implementation of boost::asio between Unix(AIX, HP-UX) and Linux(RedHat, Ubuntu)?

The platform specific implementation notes describes the supported platforms and how the event demultiplexing is implemented. Newer Linux kernels will use epoll(4) while older versions use select(2). AIX and HP-UX both use select(2).
kqueue(2) is used on BSD systems, including Mac OS X and iOS. It is very similar to epoll on Linux.
Generally speaking, epoll will perform better than select. When using select, it performs a linear search over the list of n descriptors, this is O(n). Using epoll has the concept of an epoll_event structure when adding descriptors to the epoll descriptor. This means modifying the list of events to wait for is somewhat expensive, but waiting for these events is very fast at O(1).

Related

When to use full system FS vs syscall emulation SE with userland programs in gem5?

Since syscall emulation is much easier to setup, I'm wondering what are the advantages of using the full system emulation when running an userland program.
Or in other words, what interesting aspects are modeled in the full system but not syscall emulation mode, and when are they significant?
It is mentioned in the docs at: http://gem5.org/Splash_benchmarks that full system is
Realistic: you're getting the actual Linux thread scheduler to schedule your threads
Is this the only advantage, or are there any other advantage for users that are optimizing their applications or investigating micro-architecture?
I also suspect that the MMU simulation is another important feature that is only modeled properly in full system mode, and could affect program performance.
Full system mode should be preferred (when it is possible to use it). There are benefits to using it, primarily fidelity in the simulation which is not possible with system call emulation mode. (The kernel interactions with an application can be important depending on the study that a researcher is trying to conduct.) Also, the user does not need to worry about implementing (or debugging) the system call implementation.
With that said, system call emulation mode can be useful under the right conditions. It is faster to run application code because there is no kernel running in the background. There is also no system noise if you want to mitigate it entirely. Arguably, it is easier to bootstrap a new device model as well. You can work on the model without driver support and make magic happen though fake interfaces. (It saves you having to model the bare-metal interface perfectly or having to write your own device driver.)
Your comments about dynamic linking and multi-threading support are related. If dynamic linking is fixed, you should be able to use your system's pthreads library and can forget about linking with m5threads entirely. The pthread library support has existed in the simulator for a while now (the system calls necessary for it to work properly).
However, there's a caveat to the threading implementation. You need to preallocate enough thread contexts at the start of simulation (by invoking with the -n option on the se.py script).
To elaborate, there is no operating system running in the background to schedule threads on the processors. (I use the terms threads and processors very loosely here.) To obviate the scheduling problem, you have to preallocate enough processors so that the threads can be created on calls to clone/execve. There is a constraint that you can never have more threads than processors (unlike a real system where the operating system can schedule them as it pleases).
The configuration scripts probably do not behave how a researcher would want them to behave for a multi-threaded workload. The researcher would need to verify that the caches were configured correctly and that they are sharing certain cache levels like a real machine would do. If the application calls clone/execve many times, it may not be possible to cause the generated configuration to behave realistically.
Your last statement about modeling accelerators is incorrect. The AMD GFX8 model does use system call emulation mode. (Also, we developed a NIC model which was never publicly released.) It involves creating a fake driver and manipulating it through the same ioctl interfaces that a real driver would use. Linux treats everything like a file so the driver is opened through the open system call interface and you can capture it there. There are other things which you might need to do (like map mmio ranges in the configuration), but the driver interface is the main piece. The application interacts with the driver and the driver interacts with the accelerator model.
Advantages of SE:
sometimes easier to setup benchmarks, if all syscalls you need are implemented (see also, see also), and if you have just the right cross compiler, which of course no one has documented properly which one that is.
SE runs Dhrystone about 2x https://github.com/cirosantilli/linux-kernel-module-cheat/tree/00d282d912173b72c63c0a2cc893a97d45498da5#user-mode-vs-full-system-benchmark That benchmark makes no syscalls (except for information before / after the actual benchmark runs)
it is easier to get greater visibility and control of what the application is doing since the kernel is not running in parallel. E.g. stats will be just for the application, GDB will be just for the application: thread-aware gdb for the Linux kernel
Disadvantages of SE:
in practice, harder to setup benchmarks, because it is too fragile / has too many restrictions.
If your content does not work immediately out of the box, it is easier to just create or download a full system image and go for that instead, which is much more reliable.
Here is a sample minimal working Ubuntu setup if you are still interested: How to compile and run an executable in gem5 syscall emulation mode with se.py?
less representative, since no actual OS is running
no dynamic linking for ARM as of June 2018: How to run a dynamically linked executable syscall emulation mode se.py in gem5?
if you want to evaluate an accelerator like a GPU, you will have to create some slightly custom interface for it, since there is no kernel driver running on top the the kernel as usual.
Brandon has pointed out in his answer that this has in fact been done before: https://stackoverflow.com/a/56371006/9160762
So my recommendation is:
try SE first. If it works, great. If it doesn't, try to fix it quickly, since most problems are trivial. Having the SE setup will save you a lot of time over full system, and it is often representative enough.
otherwise, use FS mode. It is just simpler to setup, more representative, and the performance hit is acceptable for most.
You could also use SE first, and then go to FS to further validate only your most important SE results, since FS is slower and you can therefore validate less different setups.

Can HID Input/Feature reports have side-effects for the device?

To talk to an embedded device. I'm [ab]using the USB HID protocol, as seems to be common.
Due to limitations of the device stack (it's a Freescale Kinetis KL26 with their SDK) it supports only Output and GetFeature on EP0, and Interrupt Input on EP1. I'm mostly using EP0 for I/O.
What's prompted my question is the discovery that a Linux host downloads all Feature Reports at enumeration time (which is a problem, because on my device, they have side-effects). Is there any reliable way to stop the host getting reports except when an application explicitly requests them? [I've tried flagging all reports as "Volatile" but it doesn't take the hint.]
More generally: Does HID impose any rules to say that getting Feature or Input Reports must be idempotent or free from side-effects (kinda analogous to REST)? I can't find any such rule in the HID class specification, but maybe it's implicit.
Thanks

Virtualization architecture on mainframe (z/Architecture)

I have with interest studied the hardware virtualization extensions added by Intel and AMD to the x86 architecture (known as VMX and SVM, respectively). While this is still a relatively recent addition to x86 CPU's my understanding is that the mainframe architecture made extensive use of virtualization since the 70's-80's for instance in the form of the venerable z/VM operating system. Even nested virtualization has been used.
My question is, is there a public documentation of the hardware facilities provided by the z/Architecture used by the z/VM operating system to implement this virtualization? I.e. the control registers and data structures that the hardware implements to allows the hypervisor to simulate the guest state and trap necessary instructions? Another thing I am curious about is if the z/Architecture supports second-level address translation (which was added later to VMX and SVM).
Just to get it out of the way, System/370 and all its descendants support virtualization as is (they satisfy virtualization requirements). In that sense, no special hardware support has ever been needed, as opposed to Intel architecture.
The performance improvements for VM guests on System/370, XA, ESA etc. all the way through z/Architecture have been traditionally implemented using DIAG (diagnose) instruction as well as microcode (now millicode) assist. In modern terms, it is more of paravirtualization. The facilities are documented, you can start here for instance.
Update - after reading extensive comments, a few notes and clarifications.
S/370 and its descendants never needed specialized hardware virtualization support to correctly run guest operating systems - not because the virtualization was part of the initial design and requirements - it wasn't, but because the architecture was properly designed to support secure multiuser environment. Popek and Goldberg's virtualization requirements are actually very weak - in essence, that only privileged instructions can affect system configuration. This requirement was part of even S/370's predecessor, System/360, well before first virtualized systems appeared.
Performance imporvements of VM guests proceeded along two lines.
First, paravirtualization approach - essentially developing well-architected API for guest-hypervisor communication. It's been used not only for performance, but for a wide variety of other services such as inter-VM communication. The API is documented in the manual referred to above.
Second, microcode extensions (VM microcode assist) that performed some performance sensitive hypervisor logic at microcode level, essentially hardware level. That is not paravirtualization, it is hardware virtualization support proper. But in early 370 machines this support was not architected, meaning it was model-dependent and subject to change. With 370/XA, IBM introduced a proper architectural way to support high-performance virtualization, the Start Interpretive Execution (SIE) instruction. This instruction is not documented in Principles of Operation, but rather in a separate publication, IBM System/370 XA Interpretive Execution. (This document is referenced multiple times in Principles of Operation. The link refers to the first version of the document, you can download version 2 here. I am not sure if that publication was ever updated - probably this is the latest version.) Additionally, I/O subsystem provided VM assist too.
I failed to mention the SIE instruction and the manual that documented it in my original answer, which is a crucial part of the story. I am grateful to the author of the question and the extensive comments that proded me to check my memory and figure out that I skipped an important bit of technical background. This presentation provides an excellent overview of z/VM core facilities that covers additional aspects including memory management, I/O, networking etc.
The SIE instruction is how virtualization software accesses the z/Architecture Interpretive Execution Facility (IEF). The exact details of the interface have not been published since the early 1990s.
This is a hardware-based capability. IEF provides two levels of virtualization. The first level is used by firmware (via the SIE instruction) to create partitions. In each partition you can run an operating system. One of those operating systems is z/VM. It uses the SIE instruction (running within the context of the first level SIE instruction) to run virtual machines. You can run any z/Architecture operating system in a virtual machine, including z/VM itself.
The SIE instruction takes as input the description of a virtual server (partition or virtual machine). The hardware then runs the virtual server's instruction stream, stopping only when it needs help from whatever issued the SIE instruction, whether it be the partition hypervisor or the z/VM hypervisor.

Is the Operating System a process?

I am just now learning about OSes and I stumbled upon this question from my class' lecture notes. In our class, we define a process as a program in execution and I know that an OS is itself a program. So by this definition, an OS is a process.
At the same time processes can be switched in or out via a context switch, which is something that the OS manages and handles. But what would handle the OS itself when it isn't running?
Also if it is a process, does the OS have a process control block associated with it?
There was an older question on this site that I looked at, but I felt as if the answers weren't clear enough to really outline WHY the OS is/isn't a process so I thought I'd ask again here.
First of all, an OS is multiple parts. The core piece is the kernel, which is not a process. It is a framework for running processes. In practice, a process is more than just a "program in execution". On a system with an MMU, a process is usually run in its own virtual address space. The kernel however, is usually mapped into all processes. It's always there.
Other ancillary parts of the OS exist to make it usuable. The OS may have processes that it runs as part of its management. For example, Linux has many kernel threads that are independently scheduled tasks. But these are often not crucial to the OS's operation.
Short answer: No.
Here's as good a definition of "Operating System" as any:
https://en.wikipedia.org/wiki/Operating_system
An operating system (OS) is system software that manages computer
hardware and software resources and provides common services for
computer programs. The operating system is a component of the system
software in a computer system. Application programs usually require an
operating system to function.
Even "system-level processes" (like "init" on Linux, or "svchost.exe" on Windows) rely on the "operating system" ... but are not themselves the operating system.
Agreeing to some of the comments above/below.
OS is not a process. However there are few variants in design that give the opposite illusion.
For eg: If you are running a FreeRTOS, then there is no such thing as a separate OS address space and Process address space, every thing runs as a single process, the FreeRTOS framework provides API's that allow Synchronization of different tasks.
Operating System is just a set of API's (system calls) and utilities that help to achieve Multi-processing, Resource sharing etc. For eg: schedule() is a core OS function that handles the multi-processing capabilities of the OS.
In that sense, OS is not a process. Although it attaches to every process that runs on the CPU, otherwise how will the process make use of the OS's API.
It is more like soul for the body (hardware), if you will.
It is just not one process but a set of (kernel) processes required to run user processes in the system. PID 0 being the parent of all processes providing scheduler/swapping functionality to the rest of the kernel/user processes, but it is not the only process. These kernel processes (with the help of kernel drivers) provide accessor functionality (through system calls) to the user processes.
It depends upon what you are calling the "operating system".
It depends upon what operating system you are talking about.
That said and at the risk of gross oversimplification, most of what one calls "the operating system" is generally executed from user processes while in kernel mode. The entry into kernel occurs either through an interrupt, trap or fault.
To do a context switch usually either a process causes a fault entering kernel mode to so something (like write to the disk). While in kernel mode, the process realizes it would have to wait so it yields by switching the context to another process. The other common way is a timer causes an interrupt, that forces the process into kernel mode. The process then determines who should be executed next, and switches the process context.
Some operating systems do have their own kernel process that function but that is increasingly rare.
Most operating system have components that have their own processes.

High resolution timer on Coldfire (MCF5328)

I've inherited a embedded project that requires some simple, per-function performance profiling. It consists of a Coldfire (MCF5328) running uClinux (2.6.17.7-uc1).
I'm not an expert on either the Coldfire, or uClinux (or Linux for that matter), so excuse my ignorance.
In Windows I would simply use QueryPerformanceCounter to access the x86 high-resolution timer. Record the counter before and after and compare the difference.
I've learned that Linux has a number of variations on QueryPerformanceCounter:
clock_gettime/res
getnstimeofday
ktime_x
Or even access to the Time Stamp Counter via
get_cycles
None of these are available on the uClinux build this device is running. So it appears that the OS has no high-resolution timer access.
Does this mean that the Coldfire itself provides no such feature? Or did the author of the uClinux port leave them out? Is there something on the hardware that I can use, and how would go about using it?
Given how old your kernel is, you may not have support for high-resolution timers.
If you are writing a kernel driver, the APIs are different. If get_cycles() is stubbed out, it probably means your CPU architecture doesn't support a cycle counter. Since your kernel is very old, do_gettimeofday is probably the best you can do, short of writing a driver to directly query some timer hardware.
I ended up using one of the four DMA Timers on the Coldfire. It was a simple matter to enable the timer as a free-running, non-interrupt generating, counter. This provides a 12.5ns counter (at 80Mhz).