race condition in KVM with hypercall KVM_HC_KICK_CPU - virtual-machine

To implement efficient spinlocks in the VM enviroment, KVM documentation says that
a vcpu waiting for spinlock can execute halt instruction and let the spinlock holder
vcpu get chance for execution, this spinlock holder vcpu can then execute KVM_HC_KICK_CPU hypercall and awake the waiting vcpu.
Now here is my question:
Imagine below sequence of instructions
CHECK_SPIN_LOCK_FLAG
// <------------ waiting vCPU get scheduled out at exactly before executing hlt
hlt
now, when the spinlock holder vcpu wakes up, releases the spinlock and then
tries to wake the cpu, there is nothing to do as cpu is already running. However,
when the spinlock waiting cpu get scheduled, it will execute hlt instruction and remain there.
is this a race condition in this hypercall design?
The following is excerpt from hypercall.rst in the Documentation/virt/kvm/x86/hypercalls.rst
5. KVM_HC_KICK_CPU
------------------
:Architecture: x86
:Status: active
:Purpose: Hypercall used to wakeup a vcpu from HLT state
:Usage example:
A vcpu of a paravirtualized guest that is busywaiting in guest
kernel mode for an event to occur (ex: a spinlock to become available) can
execute HLT instruction once it has busy-waited for more than a threshold
time-interval. Execution of HLT instruction would cause the hypervisor to put
the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
is used in the hypercall for future use.

okay, I am not sure whether I solved the problem or not
but it seems there is one more hypercall
7. KVM_HC_SCHED_YIELD
---------------------
:Architecture: x86
:Status: active
:Purpose: Hypercall used to yield if the IPI target vCPU is preempted
a0: destination APIC ID
:Usage example: When sending a call-function IPI-many to vCPUs, yield if
any of the IPI target vCPUs was preempted.
We can use above hypercall before waking up the cpu, to make sure the
target vcpu was indeed in halted state. If it was preempted, then yield and resume when target is halted rather than preempted.
Solution as of now:
struct mutex_t {
uint64_t intent_cpus;
uint64_t waiting_cpus;
uint64_t m;
}
get_lock () {
// rcx has the cpu id, rdx has the address of mutex
mov $1, %rax
xorq %rbx, %rbx
// set intent for doing lock
lock bts %rcx, (%rdx)
// rdx has mutex address, modifies ZF
lock cmpxchg %rbx, 16(%rdx)
jnz skip_hlt
// Will be waiting for the kick
lock bts %rcx, 8(%rdx)
hlt
skip_hlt:
// reset intent for lock
lock btr (%rdx), %rcx // modifies CF
}
release_lock() { // rdx has mutex address
xor %rax, %rax
lock xchgq %rax, 16(%rdx)
bsf (%rdx), %rcx // rcx has the least significant set bit
jz no_cpus_with_intent
check_for_intent:
bt %rcx, (%rdx)
jnc cpu_does_not_have_intent
bt %rcx, 8(%rdx)
jc yield_once_and_return
/* it has intent, but no waiting bit,
yield so that it can either halt or get lock */
yield_to_vcpu(%rcx)
j check_for_intent
yield_once_and_return:
/* here either it is halted or preempted
just before executing hlt */
yield_to_vcpu(%rcx)
kick_the_vcpu(%rcx)
cpu_does_not_have_intent:
no_cpus_with_intent:
}

Related

How do I prevent the Raspberry Pi Pico from taking ALL background actions and interrupts to create a pulse in software?

My goal is to create a single pulse with the RPI pico by clearing and setting a hardware pin in software. I am attempting to do this in software because I did not see a way to provide a single non-repetitive pulse through one of the timer channels.
The resolution of the pulse width can be as much as 32ns, which should be easy to achieve with a 125MHz clock. Any pulse longer than 1us on the hardware pin will physically destroy the circuit.
In the simplest form the code should initialize a pin to the high state, pull the pin low, wait, and then set the pin high. There should be a way to predictably adjust the time between the low and high states.
In the code below, I would expect that every NOP between the gpio_clr and gpio_set commands increased the pulse width by either 8ns or 16ns. However, there is no consistency to the relation between pulse width and the number of NOP instructions between the gpio_clr and gpio_set commands. Sometimes it is 16ns, other times it is nearly a microsecond.
I have tried porting the C code to assembly and it did not change the outcome. Neither did save_and_disable_interrupts(). When placed into a while loop, the first pulse is usually over a microsecond and the other pulses are usually a consistent width.
When I view the disassembly of the C code, there are no instructions between the gpio_clr and gpio_set functions.
I have the impression that the RPI is taking a background action in between the setting and clearing of the pin. I am hoping someone knows how to make this code execute sequentially as it was written.
I would accept an answer that demonstrated how to use a timer channel to provide a single, non-repetitive pulse with 32ns of resolution. The alarm functions seem to have approximately 7us of overhead which makes them unusable.
#include "pico/stdlib.h"
#include "hardware/sync.h"
int main() {
const uint pulse_len = 1;
gpio_init(6);
sio_hw->gpio_set = 1u<<6;
gpio_set_dir(6, GPIO_OUT);
sio_hw->gpio_set = 1u<<6;
uint i=0;
//__asm("cpsid if");
uint32_t istatus = save_and_disable_interrupts();
//provide a single pulse
sio_hw->gpio_clr = 1u<<6;
//__asm("nop"); I would expect each NOP to increase pulse width by the same amount
//__asm("nop");
//__asm("nop");
//__asm("nop");
//__asm("nop");
//__asm("nop");
sio_hw->gpio_set = 1u<<6;
restore_interrupts(istatus);
//__asm("cpsie if");
while(true);
}

How is the cpu clock connected to other components

How is the cpu clock connected to other components. and what do people mean by saying all operations start at a clock tick.
The CPU Clock drives the CPU. Internally there is a bus system, basically a bundle of electrical connections. Someone, and only one, may put their data on it. For example a register output, or the ALU result (Well, they don't want to, they're told to do so by the control unit, which makes sure only one entity may access the bus in write mode).
This operation is unsafe as logic may fluctuate the electrical signal several times while it moves through the logic gates until it stabilizes, or some signals will come earlier than others. This depends on capacitive and inductive effects and such, delaying the signals.
Because of this, no one will read the data off the bus until the clock triggers. The clock pulse indicates that enough time has passed that the signals ought to be safe and it is assumed the signal on the bus is stable.
This is done by simply using an and-gate or an edge detector with the clock signal on the devices that want to read from the bus.
Example:
Data-Bus ----/8----- [ ]
Data-In ----------- [ R ]
Data-Out ----------- [ E ]
Clk ----------- [ G ]
' Data-Out may be asyncronous like this, though not recommended, or on falling/low clock pulse, or the Data-Out signal is clock synced:
Data-Bus[0] = Data-Out AND Data[0]
Data-Bus[1] = Data-Out AND Data[1]
Data-Bus[2] = Data-Out AND Data[2]
[...]
' Data-In will almost always be clock synced
If (Data-In AND Raising-Clk-Edge)
{
Data[0] = Data-Bus[0]
Data[1] = Data-Bus[1]
[...]
}
This is of course highly dependant on your actual hardware. For example Read-Enable, Write-Enable and Output-Enable can be active low, etc.
There is a great youtube series of a guy actually building a CPU on a breadboard. While of course this is overly simplified in regards what a modern cpu does, it helps to understand the basics.
The cpu clock itself is usually not directly connected to other hardware, and instead the cpu generates the trigger pulses that tell others it's safe to read/write from/to the bus now.
Each instruction may be made up from several microinstructions. For example:
LDA #5, Load 5 into the A-Register
' Fetch
Put IP on Address bus, Enable memory out => Opcode for LDA-Immediate is now on Data-Bus
Write-Enable Instruction Register, Increment IP
' Decode with combinatorial logic
' CU realizes it needs another word from memory (the value) and it needs to go to the A-Register
' Execute
Put IP on Address bus, Enable memory out => #5 now on data-bus
Increment IP, Write-Enable A-Register => #5 now in A-Register
' Done
This is done by the cpu clock. the system clock has little to do with that.

After a process calls syscall wait(), who will wake it up?

I have a general idea that a process can be in ready_queue where CPU selects candidate to run next. And there are these other queues on which a process waits for (broadly speaking) events. I know from OS courses long time ago that there are wait queues for IO and interrupts. My questions are:
There are many events a process can wait on. Is there a wait queue corresponding to each such event?
Are these wait queues created/destroyed dynamically? If so, which kernel module is responsible for managing these queues? The scheduler? Are there any predefined queues that will always exist?
To eventually get a waiting process off a wait queue, does the kernel have a way of mapping from each actual event (either hardware or software) to the wait queue, and then remove ALL processes on that queue? If so, what mechanisms does a kernel employ?
To give an example:
....
pid = fork();
if (pid == 0) { // child process
// Do something for a second;
}
else { // parent process
wait(NULL);
printf("Child completed.");
}
....
wait(NULL) is a blocking system call. I want to know the rest of the journey the parent process goes through. My take of the story line is as follows, PLEASE correct me if I miss crucial steps or if I am completely wrong:
Normal system call setup through libc runtime. Now parent process is in kernel mode, ready to execute whatever is in wait() syscall.
wait(NULL) creates a wait queue where the kernel can later find this queue.
wait(NULL) puts the parent process onto this queue, creates an entry in some map that says "If I (the kernel) ever receives an software interrupt, signal, or whatever that indicates that the child process is finished, scheduler should come look at this wait queue".
Child process finishes and kernel somehow noticed this fact. Kernel context switches to scheduler, which looks up in the map to find the wait queue where the parent process is on.
Scheduler moves the parent process to ready queue, does its magic and sometime later the parent process is finally selected to run.
Parent process is still in kernel mode, inside wait(NULL) syscall. Now the main job of rest of the syscall is to exit kernel mode and eventually return the parent process to user land.
The process continues its journey on the next instruction, and may later be waiting on other wait queues until it finishes.
PS: I am hoping to know the inner workings of the OS kernel, what stages a process goes through in the kernel and how the kernel interact and manipulate these processes. I do know the semantics and the contract of the wait() Syscall APIs and that is not what I want to know from this question.
Let's explore the kernel sources. First of all, it seems all the
various wait routines (wait, waitid, waitpid, wait3, wait4) end up in the
same system call, wait4. These days you can find system calls in the
kernel by looking for the macros SYSCALL_DEFINE1 and so, where the number
is the number of parameters, which for wait4 is coincidentally 4. Using the
google-based freetext search in the Free Electrons Linux Cross
Reference we eventually find the definition:
1674 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
1675 int, options, struct rusage __user *, ru)
Here the macro seems to split each parameter into its type and name. This
wait4 routine does some parameter checking, copies them into a wait_opts
structure, and calls do_wait(), which is a few lines up in the same file:
1677 struct wait_opts wo;
1705 ret = do_wait(&wo);
1551 static long do_wait(struct wait_opts *wo)
(I'm missing out lines in these excerpts as you can tell by the
non-consecutive line numbers).
do_wait() sets another field of the structure to the name of a function,
child_wait_callback() which is a few lines up in the same file. Another
field is set to current. This is a major "global" that points to
information held about the current task:
1558 init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
1559 wo->child_wait.private = current;
The structure is then added to a queue specifically designed for a process
to wait for SIGCHLD signals, current->signal->wait_chldexit:
1560 add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
Let's look at current. It is quite hard to find its definition as it
varies per architecture, and following it to find the final structure is a
bit of a rabbit warren. Eg current.h
6 #define get_current() (current_thread_info()->task)
7 #define current get_current()
then thread_info.h
163 static inline struct thread_info *current_thread_info(void)
165 return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
55 struct thread_info {
56 struct task_struct *task; /* main task structure */
So current points to a task_struct, which we find in sched.h
1460 struct task_struct {
1461 volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
1659 /* signal handlers */
1660 struct signal_struct *signal;
So we have found current->signal out of current->signal->wait_chldexit,
and the struct signal_struct is in the same file:
670 struct signal_struct {
677 wait_queue_head_t wait_chldexit; /* for wait4() */
So the add_wait_queue() call we had got to above refers to this
wait_chldexit structure of type wait_queue_head_t.
A wait queue is simply an initially empty, doubly-linked list of structures that contain a
struct list_head types.h
184 struct list_head {
185 struct list_head *next, *prev;
186 };
The call add_wait_queue()
wait.c
temporarily locks the structure and via an inline function
wait.h
calls list_add() which you can find in
list.h.
This sets the next and prev pointers appropriately to add the new item on
the list.
An empty list has the two pointers pointing at the list_head structure.
After adding the new entry to the list, the wait4() system call sets a
flag that will remove the process from the runnable queue on the next
reschedule and calls do_wait_thread():
1573 set_current_state(TASK_INTERRUPTIBLE);
1577 retval = do_wait_thread(wo, tsk);
This routine calls wait_consider_task() for each child of the process:
1501 static int do_wait_thread(struct wait_opts *wo, struct task_struct *tsk)
1505 list_for_each_entry(p, &tsk->children, sibling) {
1506 int ret = wait_consider_task(wo, 0, p);
which goes very deep but in fact is just trying to see if any child already
satisfies the syscall, and we can return with the data immediately. The
interesting case for you is when nothing is found, but there are still running
children. We end up calling schedule(), which is when the process gives
up the cpu and our system call "hangs" for a future event.
1594 if (!signal_pending(current)) {
1595 schedule();
1596 goto repeat;
1597 }
When the process is woken up, it will continue with the code after
schedule() and again go through all the children to see if the wait
condition is satisfied, and probably return to the caller.
What wakes up the process to do that? A child dies and generates a SIGCHLD
signal.
In signal.c
do_notify_parent() is called by a process as it dies:
1566 * Let a parent know about the death of a child.
1572 bool do_notify_parent(struct task_struct *tsk, int sig)
1656 __wake_up_parent(tsk, tsk->parent);
__wake_up_parent() calls __wake_up_sync_key() and uses exactly the
wait_chldexit wait queue we set up previously.
exit.c
1545 void __wake_up_parent(struct task_struct *p, struct task_struct *parent)
1547 __wake_up_sync_key(&parent->signal->wait_chldexit,
1548 TASK_INTERRUPTIBLE, 1, p);
I think we should stop there, as wait() is clearly one of the more
complex examples of a system call and the use of wait queues. You can find
a simpler presentation of the mechanism in this 3 page Linux Journal
article from 2005. Many things
have changed, but the principle is explained. You might also buy the books
"Linux Device Drivers" and "Linux Kernel Development", or check out the
earlier editions of these that can be found online.
For the "Anatomy Of A System Call" on the way from user space to the kernel
you might read these lwn articles.
Wait queues are heavily used throughout the kernel whenever a task,
needs to wait for some condition. A grep through the kernel sources finds
over 1200 calls of init_waitqueue_head() which is how you initialise a
waitqueue you have dynamically created by simply kmalloc()-ing the space
to hold the structure.
A grep for the DECLARE_WAIT_QUEUE_HEAD() macro finds over 150 uses of
this declaration of a static waitqueue structure. There is no intrinsic
difference between these. A driver, for example, can choose either method
to create a wait queue, often depending on whether it can manage
many similar devices, each with their own queue, or is only expecting one device.
No central code is responsible for these queues, though there is common
code to simplify their use. A driver, for example, might create an empty
wait queue when it is installed and initialised. When you use it to read data from some
hardware, it might start the read operation by writing directly into the
registers of the hardware, then queue an entry (for "this" task, i.e. current) on its wait queue to give up
the cpu until the hardware has the data ready.
The hardware would then interrupt the cpu, and the kernel would call the
driver's interrupt handler (registered at initialisation). The handler code
would simply call wake_up() on the wait queue, for the kernel to
put all tasks on the wait queue back in the run queue.
When the task gets the cpu again, it continues where it left off (in
schedule()) and checks that the hardware has completed the operation, and
can then return the data to the user.
So the kernel is not responsible for the driver's wait queue, as it only
looks at it when the driver calls it to do so. There is no mapping from the
hardware interrupt to the wait queue, for example.
If there are several tasks on the same wait queue, there are variants of
the wake_up() call that can be used to wake up only 1 task, or all of
them, or only those that are in an interruptable wait (i.e. are designed to
be able to cancel the operation and return to the user in case of a
signal), and so on.
In order to wait for a child process to terminate, a parent process will just execute a wait() system call. This call will suspend the parent process until any of its child processes terminates, at which time the wait() call returns and the parent process can continue.
The prototype for the wait( call is:
#include <sys/types.h>
#include <sys/wait.h>
pid_t wait(int *status);
The return value from wait is the PID of the child process which terminated. The parameter to wait() is a pointer to a location which will receive the child's exit status value when it terminates.
When a process terminates it executes an exit() system call, either directly in its own code, or indirectly via library code. The prototype for the exit() call is:
#include <std1ib.h>
void exit(int status);
The exit() call has no return value as the process that calls it terminates and so couldn't receive a value anyway. Notice, however, that exit() does take a parameter value - status. As well as causing a waiting parent process to resume execution, exit() also returns the status parameter value to the parent process via the location pointed to by the wait() parameter.
In fact, wait() can return several different pieces of information via the value to which the status parameter points. Consequently, a macro is provided called WEXITSTATUS() (accessed via ) which can extract and return the child's exit status. The following code fragment shows its use:
#include <sys/wait.h>
int statval, exstat;
pid_t pid;
pid = wait(&statval);
exstat = WEXITSTATUS(statval);
In fact, the version of wait() that we have just seen is only the simplest version available under Linux. The new POSIX version is called waitpid. The prototype for waitpid() is:
#include <sys/types.h>
#include <sys/wait.h>
pid_t waitpid(pid_t pid, int *status, int options);
where pid specifies what to wait for, status is the same as the simple wait() parameter and options allows you to specify that a call to waitpid() should not suspend the parent process if no child process is ready to report its exit status.
The various possibilities for the pid parameter are:
< -1 wait for a child whose PGID is -pid
-1 same behavior as standard wait()
0 wait for child whose PGID = PGID of calling process
> 0 wait for a child whose PID = pid
The standard wait() call is now redundant as the following waitpid() call is exactly equivalent:
#include <sys/wait.h>
int statval;
pid_t pid;
pid = waitpid(-1, &statval, 0);
It is possible for a child process which only executes for a very short time to terminate before its parent process has had the chance to wait() for it. In these circumstances the child process will enter a state, known as a zombie state, in which all its resources have been released back to the system except for its process data structure, which holds its exit status. When the parent eventually wait()s for the child, the exit status is delivered immediately and then the process data structure can also be released back to the system.

Implementing SPI slave ISR on PIC32?

I have two PIC32MX microcontrollers that are connected over a 1.53MHz SPI bus with Chip Select. I am having trouble getting my slave side interrupt service routine to transmit data correctly. As a test case, I'm having the master send out two bytes (0x01, 0x00) every 10 ms. The slave is supposed to receive the 0x01 command id and respond with a 0x02 when the master sends the 2nd byte (the dummy 0x00).
Ideally each transfer should look like this.
Master Slave
0x01 0x00
0x00 0x02
I'm really not sure where to start with the slave interrupt though. I'm using a fifo buffer called airsysTx to hold data that needs to be shifted out the next time the master makes a request. The slave receives the 0x01 from the master just fine and writes 0x02 to the fifo buffer when it does. I'm not sure how to code the interrupt so that it will be sure to transmit correctly. The code I have below is a good start, but it's wrong. Suggestions?
/*******************************************************************************
* Interrupt service routine for SPI3 interrupts from Air MCU.
* The user's code at this vector should perform any application specific
* operations and MUST clear the SPI3 interrupt flags before exiting.
******************************************************************************/
void __ISR(_SPI_3_VECTOR, ipl7) _SPI3Interrupt()
{
BYTE MasterCMD;
SET_D1();//Set debug LED
// RX INTERRUPT
if(IFS0bits.SPI3RXIF) // receive data available in SPI3BUF Rx buffer
{
MasterCMD = SPI3BUF;
if(AirCMD == 0x01)
{
airsysTxFlush();
airsysTxWrite(0x02);
}
}
//Transmit data if needed.
if(SPI3STATbits.SPITBE)
{
if(!airsysTxIsEmpty())
{
SPI3BUF = airsysTxRead();
}
else
{
//Else write 0 to the tx buffer to clear the spi shift reg
SPI3BUF = 0x00;
}
}
IFS0bits.SPI3RXIF = 0;
IFS0bits.SPI3TXIF = 0;
IFS0bits.SPI3EIF = 0;
SPI3STATbits.SPIROV = 0;// clear the Overflow
CLEAR_D1();//CLEAR Debug LED
} // end ISR
What this code is actually transmitting is something like this:
Ideally each transfer should look like this.
Master Slave
0x01 0x02
0x00 0x01
Generally you can't write a slave SPI driver to interact in the way you describe because you can't control the timing precisely as a slave. What generates your ISR, is it Rx of first byte from master or assertion of chip select?
As the slave, you need to have set up the data bytes you want to transmit before the master starts the transaction. You usually don't have time to react to the first byte. There are a couple of ways to do this:
1) You could use a protocol where master does a 1 or 2 byte write-only transaction that tells the slave what it wants to read. Then master waits a few milliseconds to allow the slave to prepare the response. Then master does a read-only transaction to get the slave response.
2) If using DMA or FIFO, slave preloads the first padding byte(s) into the fifo before master starts the transaction. Then as you get the ISR you put the remaining response data into the fifo (without a flush). You need to have enough pad bytes to accommodate the slave ISR latency in forming the response. So for example, you may define your protocol where master knows that the first N bytes of response are pad bytes, followed by response data. Padding requirement would depend on your master clock speed and slave CPU speed/interrupt latency.

Zilog Z80 - How to use Interrupt mode 1 (IM 1 Instruction)

I want to use IM 1 interrupt mode on Z80.
In Interrupt mode 1 processor jumps to 38h address in memory(am I right?) and then continues interrupt. How can I specify this in my code?
I have read about:
defs [,] ds [,] This pseudo
instruction inserts a block of bytes into the code segment
I need some sample source code.
Kind Regards
RafaƂ R.
First off, I don't have a Z80 in front of me.
Referencing: Z80asm directives
Use org to 'manually' locate a 'function' at a specified address.
So, to write an IM1 handler:
org 0x38
; IM1 handler
ld a, 100 ; ... whatever
ret
Also, I'm not sure of your normal starting address is, but the original Z80s started at location 0. If this is the case you should JMP past the 0x38 handler very early in your code. (You only have 56 bytes to play with)
Happy Coding!
In IM 1, upon spotting a pending interrupt (which is sampled on the rising edge of the last cycle before the end of an opcode; the IRQ line is just sampled, unlike NMI) IFF1 and 2 are cleared and an RST 38h is executed. So you should end up with the PC at 0x38, interrupts disabled and the old program counter on the top of the stack. You'll want to do whatever you have to do to respond to the interrupt, then perform an EI, RET or EI, RETI (there being no difference here because the two IFF flags have the same value following the interrupt acknowledge).
On a Z80 the PC is set to 0 upon power up or reset so probably you already have some control over the code down at that end of memory. Exact syntax depends on your assembler, but you probably want something like:
org 0
; setup initial state here, probably JP somewhere at the end
; possibly squeeze in another routine if you've the space
org 0x38
; respond to interrupt
EI
RET
I have figured out what to do when you are not starting from 0h:
org 1800h
START: ;Do the start, but It can't take more than 38 instructions
LD SP, 0x2000 ;Initialize SP!
JP MAIN ;Continue to rest of the program
ds 0x1838-$,0 ;Allocate block of memory for interrupt handler
INT:
;interrupt sub
LD E, 0
LD A, E
OUT (066), A
EI
RETI
ds 0x1840-$,0 ;Alloc space for the rest of program.
MAIN:
;Rest of program here
As far as you make it like this, processor will put the JP 01838h instruction at the address 038h.
So the handler is right. Also, remember to initialize stack pointer. If you doesnt, you will not be able to return from interrupt handler to program.