VMCALL is quite similar to the SYSENTER instruction, differing in the way that SYSENTER is meant for system call (fast transition to the OS), while VMCALL is for hypercalls (transition to hypervisor).
My question is that while SYSENTER does not save the CPU state, does the same apply for VMCALL. Issuing a VMCALL causes a VM exit, but I am not sure if it saves the guest CPU state to the associated VMCS structure or not?
If it does save the CPU state then how exactly can we pass arguments in a hypercall?
VMCS Region is divided into 6 regions, one of which is Guest-state area.
Guest State stores RIP, RFLAGS and RSP on every VMExit. The rest of guest GPRs are live in HW immediately after a VMExit.
VMCALL only causes a VMExit unconditionally. The usage of registers as arguments is left to the api of the VMM.
From Linux KVM API documentation:
Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively.
The hypercall number should be placed in rax and the return value will be
placed in rax. No other registers will be clobbered unless explicitly stated
by the particular hypercall.
From Intel 64 and IA-32 Architectures Software Developer’s Manual:
this instruction does nothing more than cause a VM exit,
registering the appropriate exit reason.
From the above I conclude that VMCALL does not preserve any CPU state.
Related
Each process's virtual address space comprises of user space and kernel space. As pointed out by many articles, the kernel space of all processes is mapped to same physical address in memory i.e. there is only one kernel in the physical memory. But each process has its own kernel stack which is a part of the kernel space. How does same mapping work for all processes with different kernel stacks?
Note: This is the OS agnostic answer. Details do vary slightly with OS in question (e.g. Darwin and continuations..), and possibly with architectural (ARMv8, x86, etc) implementations.
When a process performs a system call, the user mode state (registers) is saved, including the user mode stack pointer. At that point, a kernel mode stack pointer is loaded, which is usually maintained somewhere in the thread control block.
You are correct in saying that there is only one kernel space. What follows is, that (in theory) one thread in kernel space could easily see and/or tamper with any others in kernel space (just like same process threads can "see" each other in user space) This, however, is (almost always) in theory only, since the kernel code presumably respects memory boundaries (as is assumed user mode does, with thread local storage, etc). That said, "almost always", because if the kernel code can be exploited, then all of kernel memory will be laid bare to the exploiter, and potentially read and/or compromised.
I'm trying just for the fun to design a more complex Z80 CP/M system with a lot of peripheral devices. When reading the documentation I stumbled over an (undocumented?) behaviour of the Z80 CPU, when accepting an interrupt in IM0.
When an interrupt occurs, the Z80 activates M1 and IORQ to signal the external device: "Hey, give me an opcode". All is well if the opcode is rst 00 or something like this. Now the documentation tells, ANY opcode of any command can be given to the cpu, for instance a CALL.
But now comes the undocumented part: "The first byte of a multi-byte instruction is read during the interrupt acknowledge cycle. Subsequent bytes are read in by a normal memory read sequence."
A "normal memory read sequence". How can I determine, if the CPU wants to get a byte from memory or instead the next byte from the device?
EDIT: I think, I found a (good?) solution: I can dectect the start of the interrupt acknowlegde cycle by analyzing IORQ and M1. Also I can detect the next "normal" opcode fetch by analyzing MREQ and M1. This way I can install a flip-flop triggered by these two ANDed signals, i.e. the flip-flop is 1 as long as the CPU reads data from the io-device. This 1 I can use to inhibit the bus drivers to and from the memory.
My intentions? I'm designing an interrupt controller with 8 prioritized inputs in a CPLD. It's registers hold a 16 bit address for each interrupt pin. Just for the fun :-)
My understanding is that the peripheral device is required:
to know how many bytes it needs to feed;
to respond to normal read cycles following the IORQ cycle; and
to arrange that whatever would normally respond to memory read cycles does not do so for the duration.
Also the behaviour was documented by Zilog in an application note, from which your quote originates (presumably uncredited).
In practice I guess 99.99% of IM0 users just use an RST and 99.99% of the rest use a known-size instruction like CALL xxxx.
(also I'm aware of a few micros that effectively guaranteed not to put anything onto the bus during an interrupt cycle, thereby turning IM0 into a synonym of IM1 owing to open collector output).
The interrupt behavior is reasonably documented in the Z80 manual:
Interupt modes, IM2 allows you to supply an 8-bit address to a 16-bit pointer. At least halfway to the desired 16-bit direct address.
How to set the interrupt modes
My understanding is that the M1 + IORQ combination is used since there was no pin left for a dedicated interrupt response. A fun detail is also that the Zilog I/O chips like PIO, SIO, CTC reads the RETI instruction (as the CPU fetches it) to learn that the CPU is ready to accept another interrupt.
I am using a STM32f103 chip with a Cortex-m3 core in a project. According to the manual 3.3.1. Cortex-M3 instructions, load a 32bit word with a single LRD instruction takes 2 CPU cycles to complete (assuming the destination is not PC).
My understanding is that this is only true for reading from internal memories (Flash or internal SRAM)
When reading from an external SRAM via the FSMC, it must take more cycles to complete the read operation. During the read operation, does the CPU stall until the FSMC is able to put the data together? In other words, do I lose CPU cycles when accessing external memories?
Thank you.
Edit 1: Also assume all access are aligned 32bit access.
LDR and STR instructions are not interruptible. The FSMC is bridged from the AHB, and can run at a much slower rate, as you already know. For reads, the pipeline will stall until the data is ready, and this may cause increased worst-case interrupt latency. The write may or may not stall the pipe, depending on configuration. The reference manual says there is a two-word write buffer, but it appears that may only be used to buffer bursting memories. If you were using a CRAM (PSRAM) with a bursting interface, subsequent writes would likely not complete before the next instruction is executing, but a subsequent read would stall (longer) to allow the write to finish before initiating the read.
If using LDM and STM instructions to perform multiple reads or writes, these instructions are interruptible, and it is implementation defined as to whether they will restart from the beginning or continue when returned to. I haven't been able to find out how ST has chosen to implement this behavior. In either case, each individual bus transaction would should not be interrupted.
In regards to LDRD and STRD for working on 64-bit values, I found this discussion which references the following from the ARM-ARM:
"... LDRD, ... STRD, ... instructions are executed as a sequence of
word-aligned word accesses. Each 32-bit word access is guaranteed to
be single-copy atomic. The architecture does not require subsequences
of two or more word accesses from the sequence to be single-copy
atomic."
So, it appears that LDRD and STRD are likely to function the same way LDM and STM function.
The STM32F1xx FSMC has programmable wait states - if for your memory that is not set to zero, then it will indeed take additional cycles. The data bus for the external memory is either 16 or 8 bits, so 32 bit accesses will also take additional cycles. Also the write FIFO can cause the insertion of wait states.
On the other hand the Cortex-M is a Harvard architecture core with different memories on different buses so that instruction and data fetches can occur simultaneously, minimising ot some extent processor stalling.
In my opinion:
soft reset: boots from the reset vector.
hard reset: pull the electrical level of the cpu.
A hard reset certainly means that the whole CPU chip and all its peripherals are reset. The causes for this could be many: reset pin pulled externally, clock failures, on-chip low-voltage detection, watchdog, illegal instruction traps etc etc.
A soft reset probably means a "dirty" branch to back to the reset vector, where the reset vector restores all CPU core registers including the stack. I would say that this is very questionable practice and I'm not sure what good it would do. The main problem is that all MCU peripheral hardware registers will -not- get reset to default when you do this. It is almost impossible not to make any assumptions about the reset state of all such registers, especially since the average MCU comes with 1000+ of them nowadays. So with this soft & dirty reset, you will most likely end up a behaviour like this:
subtle intermittent bugs <= my program <= complete haywire
More far-fetched, a soft reset could mean a reset caused by software. In that case it could be writing the wrong value to the watchdog register to enforce a reset, or failing to refresh the watchdog. Or it could be the execution of an illegal instruction. These will most likely cause a complete reset of the whole chip.
This can very from chip to chip I assume. The hard reset is probably agreed to be the reset line on the device (pin, ball, etc) when pulled in a certain direction puts some or all of the chip in reset. Soft reset, could be as simple as a branch to zero or branch to the reset vector, or it could be a register you write or a bit in a register that causes a hard reset, or perhaps something close to a hard reset, imagine a layer inside the chip, the hard reset hits the outer layer, the soft reset hits some inside layer possibly not the whole chip. for example you maybe you dont want to fall off the pcie bus so you leave that alone. Normally jtag (or some portion of it) for example shouldnt be touched by either reset. When software pulls a reset line it kills itself, who is going to release that reset? Something in hardware, many ways to solve this, but if solved with something that has a digital component to it that digital section shouldnt get hit with the reset the software has hit or you again get stuck not being able to release it.
On an Intel platform, a soft reset (writing 0x4 to port 0xcf9) is a warm CPU reset, i.e. a reset while the CPU is running. A warm reset (writing 0x6 to port 0xcf9) is a host reset without a power cycle, and a hard reset (writing 0xe to port 0xcf9) is a host reset with a power cycle. A global reset is a reset of the Intel ME combined with a host reset.
A cold CPU reset is the assertion of RESET# while power is initially being supplied to the CPU. A warm CPU reset is when INIT# or RESET# occurs while V_cc and CLK remain in specified operating limits. If you just INIT# then it just flushes the BTB and TLBs and only initialises integer registers and goes to the restart MSROM routine (no longer just 0xFFFFFFF0 on UEFI systems). If you RESET# then it flushes the caches as well and initialises FP registers and not just the integer registers. This is the initial state of the registers, I think before the microcode begins. If you assert both INIT# and RESET# together then it does a BIST as well. I think in this case it reperforms the BSP selection process aka. the MP Initialisation protocol, because a BIPI is sent to all-including-self after the BIST completes, and I think it also performs BSP selection when there is no BIST i.e. just a RESET# when warm/cold (this talks about sending a BIPI after an optional BIST on reset). On modern Intel CPUs, I think RESET# is one per socket and resets all cores, and is tied to the PCH PLTRST#, whereas INIT is sent by the PCH over DMI in a PCIe VLW transaction, and is distributed on a core by core basis to the specified cores configured in a CPU register like QPIPNCB.
A warm reset is an assertion of PLTRST# by the PCH which goes to many components, and the system stays in S0. On a hard reset, the system cycles through SLP_S0# to SLP_S5# and then cycles up through SLP_S5# to SLP_S0# to end up in S0 C0 (when PLTRST# is eventually deasserted), this will result in DRAM being reset, which PLTRST# on its own doesn't do. SLP_S0 - S5# high means the CPU is in S0 C0. SLP_S0# low means it is in S0 Cx, SLP_S0# and SLP_S3# low means it is in S3, SLP_S0# and SLP_S3# and SLP_S4# low means it is in S4 and so on.
A cold reset I think is when the system boots from G3 and needs to go through PCH_RTCRST# and EC_RSMRST# before returning to the state it was in before the G3, which could be DeepSx, S5 or S4. But you will see people call the hard reset a cold reset, and the cold reset a cold boot. I would probably use the terms hard reset and cold boot. A warm boot would be a S3 resume and a cold boot would be booting from S4/S5/G3, maybe you could all S4/S5 a hard boot and G3 a cold boot.
It can mean whatever the system designer wants it to mean. There is no generic definition. For example, the content of RAM may be maintained through a soft reset, but not through a hard reset, or it may simply be the difference between an external hardware reset signal and a software RESET instruction.
I was just reading up on how linux works in my OS-book when I came across this..
[...] the kernel is created as a single, monolitic binary. The main reason is to improve performance. Because all kernel code and data structures are kept in a single address space, no context switches are necessary when a process calls an operating-system function or when a hardware interrup is delivered.
That sounded quite amazing to me, surely it must store the process's context before running off into kernel mode to handle an interrupt.. But ok, I'll buy it for now. A few pages on, while describing a process's scheduling context, it said:
Both system calls and interrups that occur while the process is executing will use this stack.
"this stack" being the place where the kernel stores the process's registers and such.
Isn't this a direct contradiction to the first quote? Am I missinterpreting it somehow?
I think the first quote is referring to the differences between a monolithic kernel and a microkernel.
Linux being monolithic, all its kernel components (device drivers, scheduler, VM manager) run at ring 0. Therefore, no context switch is necessary when performing system calls and handling interrupts.
Contrast microkernels, where components like device drivers and IPC providers run in user space, outside of ring 0. Therefore, this architecture requires additional context switches when performing system calls (because the performing module might reside in user space) and handling interrupts (to relay the interrupts to the device drivers).
"Context switch" could mean one of a couple of things, both relevant: (1) switching from user to kernel mode to process the system call, or an involuntary switch to kernel mode to process an interrupt against the interrupt stack, or (2) switching to run another user process in user space, with a jump to kernel space in between the two.
Any movement from user space to kernel space implies saving enough user-space to return to it reliably. If the kernel-space code decides that - while you're no longer running the user-code for that process - it's time to let another user-process run, it gets in.
So at the least, you're talking 2-3 stacks or places to store a "context": hardware-interrupts need a kernel-level stack to say what to return to; user method/subroutine calls use a standard stack for getting that done. Etc.
The original Unix kernels - and the model isn't that different now for this part - ran the system calls like a short-order cook processing breakfast orders: move this over on the stove to make room for the order of bacon that just arrived, start the bacon, go back to the first order. All in kernel switching context. Was not a huge monitoring application, which probably drove the IBM and DEC software folks mad.
When making a system call in Linux, a context switch is done from user-space to kernel space (ring3 to ring0). Each process has an associated kernel mode stack, that is used by the system call. Before the system call is executed, the CPU registers of the process are stored on its user-mode stack, this stack is different from the kernel mode stack, and is the one which the process uses for user-space executions.
When a process is in kernel mode (or user mode), calling functions of the same mode will not require a context switch. This is what is referred by the first quote.
The second quote refers to the kernel mode stack, and not the user-mode stack.
Having said this, I must mention Linux optimisations, where no transition is needed to the kernel space for executing a system call, i.e. all processing related to the system call is done in the user space itself (thus no context switch). vsyscall, and VDSO are such techniques. The idea behind them is quite simple. It is to send to the user space, the data that is required for execution of the corresponding system call. More info can be found in this LWN article.
In addition to this, there have been some research projects in which all the execution happens in the same ring. User space programs, and the OS code, both reside in the same ring. Idea is to get rid of the overhead of ring switches. Microsoft's [singularity][2] OS is one such project.