Intentional infinite loop in Metal shader function doesn't block the GPU - gpu

I am building a command-line tool using Metal. I intentionally put a for(;;) inside my kernel function, and I expect to see the mac's display will freeze until timeout happens. I tried in some Metal application with a MTLView, and the machine got hanged.
The error message is
Execution of the command buffer was aborted due to an error during execution.
Caused GPU Timeout Error (IOAF code 2)
But this doesn't happen to the command-line tool I am building. I don't know why it doesn't block the GPU. I thought all of my threads will keep taking the GPU resource and thus the display will freeze.
Why this is not happening?

Related

exceptions vs interrupts in RiscV?

I read that exceptions are raised by instructions and interrupts are raised by external events.
But I couldn't understand few things:
In which type we stop immediately and in which we let the current line of code finish?
In both cases and after doing 1 we run the next program, is that right?
The terminology is used differently by different authors.
The hardware exception mechanism is used by both software caused exceptions, and by external interrupts.
In which type we stop immediately and in which we let the current line of code finish?
The processor can do what its designers want up to a point.
If there is an external interrupt, the processor may choose to finish some of the work in progress before taking the interrupt.
If there is a software caused interrupt, the processor simply cannot execute that instruction (or any instructions after) and must abort its execution, and take the exception without completing the offending instruction.
In both cases and after doing 1 we run the next program, is that right?
In the case of external interrupt, the most efficient is to resume the program that was interrupted, as the caches and page tables are hot for that process.  However, an external interrupt may change the runnable status of another higher priority process, or may end the timeslice of the interrupted process, indicating another process should now get CPU time instead of the one interrupted.
In the case of a software caused exception, it is could be a non-resumable situation (null pointer dereference), in which case, yes another process get the CPU after.  It could be a resumable situation, like I/O request or page fault.  If servicing the situation requires interfacing with peripherals, that may block the interrupted process, so another process would get some CPU time.

Would a Vulkan program run on a device without gpu (discrete or integrated)?

Perhaps this question could be rephrased as 'what would happen if I were to try and run a Vulkan program on a cpu-only build'.
I'm wondering whether the program would run but not produce output, crash or not build in the first place (although I expect the building process to be for a cpu architecture instead of a gpu architecture).
Would it use the on-motherboard graphics to produce output? In that case, what would happen if the program was run on a cpu-only server?
Depends on how the program initialized vulkan.
Any build can have the vulkan loader installed this is the dynamically loaded library that finds the actual driver, if that is missing the program would be unable to load the loader and may either fail to start or show an error message, depending on how they try and load that.
If no device is available then the number of devices is 0. This is again up to the application to manage. Either by going for an alternative graphics API (opengl) or a error message and failing to start.

How can I reset the CUDA error to success with Driver API after a trap instruction?

I have a kernel, which might call asm("trap;") inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.
In CUDA Runtime API, we can use cudaGetLastError to get the last error and in the mean time, reset it to cudaSuccess.
Is there a way to do that with Driver API?
This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function.
There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".
Errors that involve unexpected termination of a CUDA kernel (including your trap example, also in-kernel assert() failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError(). The only method to clear these errors in the runtime API is cudaDeviceReset() (which eliminates all device allocations, and wipes out the context).
The corresponding driver API function is cuDevicePrimaryCtxReset()
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

how to detect ECC error in memroy testing under UEFI shell

I wrote a EFI binary file to test physical DIMMs under UEFI shell, the process is quite simple - first write a test pattern in to a physical address, then read it out and compare with the original pattern.
However, the DIMMs might encounter correctable or uncorrectable errors. Normally all the correctable ECC would be corrected by hardware automatically and BIOS would handle this (log this error and clean the error registers), uncorrectable errors would typically caused BIOS to issue a NMI, then system hang.
The problem is my test program doesn't know error happens - correctable errors are masked by BIOS FW and uncorrectable errors make system hang...
Is there any method to let the test program know ECC error happens? I would appreciate any advice you may have. Thanks!
I believe that to do this your program will need ultimate control of the hardware. That means it needs to boot completely and remove the EFI environment.
Once you have done that then your program can handle all of the interrupts and CPU registers that indicate ECC errors.
Once done your program would do a soft reset and that would boot the system back into EFI.

Linux Kernel Code Execution Contexts

When a process executing in the user space issues a system call or triggers an exception, it enters into the kernel space and kernel starts executing on behalf of the process. Kernel is said to be executing in the process context. Similarly when an interrupt occurs kernel executes in the interrupt context. I have studied about kernel execution in kernel thread, where kernel processes runs in the background.
My Questions are :
Does the kernel execute in any other contexts?
Suppose a process in the user space never executes a system call or triggers an exception or no interrupt occurs, does the kernel code ever execute ?
The kernel runs periodically, it sets a timer to fire an interrupt at some predefined frequency (100 Hz (Linux 2.4/x86), 1000Hz (early Linux 2.6/x86), 250Hz (newer Linux 2.6/x86)).
The kernel need to do this in order to do preemptive multitasking. OTOH, OSes only doing cooperative multitasking (Windows 3.1, classic Mac OS) needn't do this, and only switch tasks on response to some call from the running task (which could lead to runaway tasks hanging the whole system).
Note that there is some effort to optimize the use of this timer: newer Linux is smarter when there are no runnable tasks, it sets the timer as far in the future as it can, to allow the CPU to sleep longer and deeper, and preserve power (the CONFIG_NOHZ kernel config option). Running powertop will show the number of wakeups per second, which on an idle system can be much lower than the 250 wakeups per second you'd expect of a traditional implementation.
Suppose a process in the user space never executes a system call or triggers an exception or no interrupt occurs, does the kernel code ever execute ?
Assume you have a process p that is running the following code: while(1);. This code will never call into the kernel and won't cause any faults. (It might have set an alarm(3) earlier, causing a signal to be delivered in the future, or it might exceed the setrlimit(2) CPU limit, in which cases the kernel will deliver a signal to the process.)
Or, if another process sends p a signal via kill(2), the kernel will deliver that signal to the process as well.
The signal delivery will either cause a signal handler to run, do nothing (if the signal is ignored or masked), or take the default signal action (which might be nothing or termination).
And, of course, the process execution can be interrupted so the processor can handle interrupts; or a higher-priority process can preempt it.