There are many different types of traps listed in processor datasheets, e.g. BusFault, MemManage Fault, Usage Fault and Address Error.
What is their purpose? How can they be utilized in fault handling?
Traps are essentially subroutine calls that are forced by the processor when it detects something unusual in your stream of instructions. (Some processors make them into interrupts, but that's mostly just pushing more context onto the stack; this gets more interesting if the trap includes a switch between user and system address spaces).
This is useful for handling conditions that occur rarely but need to be addressed, such as division by zero. Normally, it is useless overhead to have an extra pair of instructions to test the divisor for zero before executing a divide instruction, since the divisor is never expected to be zero. So the architects have the processor do this check in parallel with the actual divide as part of the divide instruction, and cause the processor to trap to a divide-by-zero routine if the divisor is zero. Another interesting case is illegal-memory-address; clearly, you don't want to have to code a test to check each address before you use it.
Often there are a variety of fault conditions of potential interest and the processor by design will pass control to a different trap routine (often set up as a vector) for each different type of fault.
Once the processor has a trap facility, the CPU architects find lots of uses. A common use is debugger breakpoints, and trap-to-OS for the purposes of executing a operating system call.
Microprocessors have traps for various fault conditions. They are synchronous interrupts that allow the running OS / software to take appropriate action on the error. Traps interrupt program flow and set register bits to indicate the fault. Debugger breakpoints are also implemented using traps.
In a typical computing environment, the operating system takes care of CPU traps triggered by user processes. Let's consider what happens when I run the following program:
int main(void)
{
volatile int a = 1, b = 0;
a = a % b; /* div by zero */
return 0;
}
An error message was displayed, and my box is still running like nothing happened. My operating system's approach to fault handling in this case was to kill the offending process and inform the user with the error message Floating point exception.
Traps in kernel mode are more problematic. It is not as strightforward for the OS to take corrective action if it is itself at fault. For a system process there is no underlying layer of protection. This is why faulty device drivers can cause real problems.
When working on bare metal, without the comforting protection of an operating system, the situation is much similar to the one above. Number one objective for achieving continuous and correct operation is to catch all potential trap conditions before they get to trigger any traps, using assertions and higher-level error handlers. Consider traps as the last line of defense, a safety net you don't intentionally want to fall into.
Defining behaviors for trap handlers is worth some thought, even if they "should never happen". They will be executed when things go wrong in an unanticipated manner, be it due to cosmic rays altering RAM in the most extreme case. Unfortunately, there is no single correct answer to what error handlers should do.
Code Complete, 2nd ed:
The style of error processing that is most appropriate depends on the kind of software the error occurs in and generally favors more correctness or more robustness. Strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; no result is better than an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that leads to results that are inaccurate sometimes.
Clearly, my operating system's fault handling is designed with robustness in mind; I can execute flawed code and do pretty much anything without crashing the system. Designing solely for robustness would mean a recovery attempt whenever possible, and if all else fails, reset. This is a suitable approach if your product is e.g. a toy.
Safety critical applications need a bit more paranoia and should favor correctness instead; when a fault is detected, write error log, shutdown. We don't want our radiation therapy unit to pick dosage levels from invalid garbage values.
The ARMv7-M (not to be confused with the ARM7 nor the ARMv7-A) Cortex-M3 technical reference manual, which may also be part of one of the new ARM ARMs (ARM Architectural Reference Manual)
has a section describing each one of these faults.
Now the whys versus the whats are perhaps at the root of the question. The why is usually so you have a chance to recover. Imagine your set-top box or telephone that hits one of these do you want it to hang or if possible try to recover? Unless you are expecting one of these faults (which in this context you shouldnt be, x86 systems and some of their faults are a completely different story) if you survive long enough to hit one of these you would most likely end up pulling the trigger on yourself (the software trying to kill itself by resetting the processor/system). You can go through the long list and try to find ones you can recover from. Divide by zero, how is the exception handler to know what the math mistake is that lead to this? In general it cant. Unaligned load or store, how is the handler to know what that code was trying to do, like divide by zero it is probably a software bug. Undefined instruction, the code went into the weeds and executed data most likely by this point you are already too far gone and couldnt recover. Any kind of memory bus fault the handler cannot repair the hardware.
You have to go through every fault, and for each fault define how you are going to handle it, all the ways you could have gotten to that one fault and the ways you can get out or handle each one of those paths. On occasion you might be able to recover, otherwise you need a default action, hang the processor in an infinite loop in the handler for example so that the software engineer, if available, can try to use a debugger to get in and find where the code stopped. Or have a watchdog timer, inside or outside the chip depending on the chip and board design (often outside the chip the WDT will reset the whole board). You might have some non-volatile memory that you attempt to store the fault in, before letting or causing the reset, the time and code it takes to do that might lead you to another fault depending on what is failing.
Simply put, they allow you to execute code when something happens in the processor. They're sometimes used by the OS for error recovery.
Related
I'm asking because I wonder how robust I should make my programs against device losses.
Should I only expect devices to be lost in the case of, say, hardware errors, driver bugs, improper API usage or non-terminating shader programs; or should I also expect device loss in such cases as, say, suspending and resuming my laptop, minimizing the application window, or just randomly because the implementation felt like it?
It's unfortunately going to vary by GPU, driver, and OS, which leads to the somewhat vague spec wording that krOoze quoted:
A logical device may become lost because of hardware errors, execution timeouts, power management events and/or platform-specific events.
For reference, there is nothing in the Android OS itself that would require a device lost -- e.g. it doesn't force a device-lost when an app goes into the background or the screen is turned off.
But it's likely that some driver/hardware combinations will report a device lost error if there is a GPU exception (or reset), unless the driver can guarantee that nothing from your VkDevice could have been affected. That's a surprisingly difficult guarantee to make, e.g. if your queues weren't running at the time the problem occurred, but there still might have been some of your data in dirty cache lines and the reset invalidates those lines instead of writing them back to memory, your data will be corrupted. An exception/reset can be caused by hardware or driver bugs, or by any app on the system hitting a watchdog timeout (infinite loop in shader is the easy example, but even making progress but simply taking too long can happen).
In practice, these should be fairly rare events, and I believe (without data) that these days it's primarily caused by hotplug (rare) or misbehaving hardware/driver/app events rather than more routine things like device sleep.
Since testing your recovery code is going to be difficult and it'll therefore likely be buggy, my recommendation would be to just do something heavy-handed but simple, like saving application state and either restarting your app automatically, or quitting and asking the user to restart. Depending on what you're building, it might be reasonable to do something more sophisticated like tearing down and restarting+restoring your renderer system without taking down the rest of your app.
I just wanted to know the basic difference among them.
I found at some places that TRAP is essentially also called as a software interrupt,or something like an exception.
Also what is the basic difference between a software interrupt and an exception.
Software interrupt can be generated by the INT instruction,but TRAP can only be generated by certain scenarios like divide by zero? Is that right?
Kindly give a suitable answer to this query,which covers,s/w interrupts trap and exception .
The terminology is indeed a bit blurry and may depend on the CPU vendor.
What is clear is that a hardware interrupt is triggered by a hardware signal and makes the CPU enter a predefined ISR. These are exceptions triggered by (typically external) hardware.
The notation of a trap varies a bit between CPU vendors. Traps on non-Intel CPUs can (e.g. on a 68000 or PowerPC CPU) be software interrupts. Those CPUs have a TRAP instruction. On an x86, this instruction would be INT xxx, on an ARM CPU SWI/SVC, on a PowerPC TRAP #xx. This would be an exception deliberately triggered by a user program (typically used to enter the operating system)
Traps in the Intel world are exceptional conditions, like division by zero or other errors like invalid memory access (but might also be triggered by a set single step flag). Other CPU vendors simply call that an exception. Those are exceptions typically triggered by erroneous programs or conditions the CPU is not able to handle during normal program flow.
And all of them are normally called exceptions.
I'm trying to figure out this basic scenario:
Suppose my cpu received an exception or an interrupt. What I do know, is that the cpu starts to perform an interrupt service routine (looks at the idtr register to locate the idt table, and goes to the appropriate entry to receive the isr address), but in what context is the code running?
Meaning if I have a thread currently running and generating an interrupt of some sort, in which context will the isr run, in the initial process that "holds" the thread, or in some other magical thread?
Thanks!
Interesting question, which raises a few different issues.
The first is that interrupts don’t actually run inside of any thread from the CPU’s perspective. Indeed, the CPU itself is barely aware of threads; it may know a bit more if it has hyper threading or some similar technology, but a thread is really an operating system thing (or, sometimes, an application thing).
The second is that ISRs (Interrupt Service Routines) generally run at some elevated privilege level; you don’t really say which processor family you’re talking about, so it’s difficult to be specific, but modern processors normally have at least one special mode that they enter for handling interrupts — often with its own register bank. One might also ask, as part of your question, whose page table is active during an interrupt?
Third is the question of whose memory map ISRs have when they are entered. The answer, again, is going to be highly processor specific; it’s possible to imagine architectures that disable paging on ISR entry, other architectures that switch automatically to an interrupt page table, and (probably the most common approach) those that decide not to bother doing anything about the page table when entering an ISR.
The fourth is that some operating systems have policies of their own on these kinds of things. A common approach on modern operating systems is to make ISRs themselves as short as possible, and where any significant work needs to be done, convert the interrupt into some kind of event that can be handled by a kernel thread (or even, potentially, by a user thread). In this kind of system, the code that actually handles an interrupt may well be running in a specific thread, though it probably isn’t actually an interrupt service routine at that point.
Summary:
ISRs themselves don’t really run in the context of any given thread.
ISRs may run with the page table of the interrupted thread (depends on architecture).
ISRs may start with a copy of that thread’s registers (depends on architecture).
In modern systems, ISRs commonly try to schedule an event and then exit quickly. That event might be handled by a specific thread (e.g. for processor exceptions, it’s usually delivered as a signal or Structured Exception or similar to the thread that caused it); or by a pool of threads (e.g. to service I/O in the kernel).
If you’re interested in the specifics for x86 (I guess you are, as you use some Intel specific terms in your question), you need to look at the Intel 64 and IA-32 Architectures Software Developer’s Manual, volume 3B, and you’ll need to look at the operating system documentation. x86 is a very complicated architecture compared to some others — for instance, it can optionally perform a task switch on interrupt delivery (if you put a “task gate” in the IDT), in which case it will certainly have its own set of registers and quite possibly its own page table; even if this feature is used by a given operating system, there is no guarantee that x86 tasks map straightforwardly (or at all) to operating system processes and/or threads.
Is it possible that a computer gives a wrong result due to hardware error? For example, if I told the CPU to calculate 6 times 9 (both integer) for many times, will all the calculations give the correct answer? If there is the possibility that some of the calculations go wrong, why is that, and is there any mechanism that blocks the wrong answer inside the CPU?
There are a few possibilities:
Operating the CPU outside of specification could result in erratic behavior (such as too much heat, too much voltage)
If an interrupt fires in the middle of a non-atomic operation, and the interrupt modifies the result, weird behavior could happen. (For example: if you're attempting a 16-bit operation on a 8-bit processor. Perhaps you are calculating A * B, but you have a timer that fires and changes the value of A halfway through the multiplication. This is really considered a software bug, not a fault.
There's always cosmic rays. Chips are so small these days that you can't really blame any problems on them, but they are a concern if you have a multi-year autonomous system.
As for preventing faults, during the Space Race, the launch computer used triple-redundant logic to verify every computation. STMicroelectronics has a line of fault-tolerant dual-core microcontrollers that run both cores with the same code, and a fault condition arises if either core disagrees.
An educational principle is: There is not such a thing as a stupid question. The basic idea behind this is that people learn by asking.
I was asked to: "Can you show and explain at a programming level what bad will happen if every task could execute all instructions."
I did give the code
main(){
_asm_("cli;");
while(1);
}
and explained it (the system frozen for good- UP)
Then I was asked: "Is it possible give an example so that system do not freeze even this clearing interrupts is done?"
I did modify the previous example:
I did give the code
main(){
_asm_("cli;");
i=i/0;
while(1);
}
and explained it.
Trivially: If we have demand paging i=i/0 causes first a page fault (the data page not present) and an other task can be scheduled to run interrupts enabled during the disk read and later on divide by zero will throw this task away for good.
But the answers were based on UP. What about SMP? I must tell that answers are incomplete.
It still easy enough to construct:
int i;
main(){
for(i=0;i<100;i++)// Suppose we have less than 100 CPUs
if(fork())
{ sleep(5);//The generating task has (most probable) time to do all forks
_asm_("cli;");
while(1);
}
}
which will disable interrupts for all CPUs, because every CPU gets a poisonous task to run.
Even so far a stupid question did reveal many things good to learn to a beginner: privileged instructions, paging, fault handling, scheduling during DMA, fork.....
But a minor doubt remains (shame on me) about the first program running on a SMP.
Will one CPU be out permanently or not?
Other CPUs continue and can send re_schedule() IPI message.
What happens then?
It can be easy to speculate that the frozen CPU do not wake up, because interrupts are disabled.
But to be perfectly sure must know more.
My question was:
Is the Inter Processor Interrupt (IPI) maskable or non-maskable?
I mean in the most common "popular" implementations?
Excuse my stupid question. It can't be very difficult to find an answer. I will seek it.
I mean interrupt pin number (telling maskable, I guess).
My own answer - correct?
I studied the issue, because nobody else did like it, coming to following thoughts:
With important real-time applications we have had for a long time a watchdog timer (HW interrupting cpu to answer somehow "I am alive").
For example we have main control computer and standby computer taking care of the system if the main computer is down.
What about Linux?
What kind watchdog- have we one?
We can compile the Linux kernel with or without watchdog.
What the Linux watchdog does?
On many(!) x86/x86-64 type hardware there is a feature that enables us to generate 'watchdog NMI interrupts'.
It's even possible to disable the NMI watchdog in run-time by writing "0" to /proc/sys/kernel/nmi_watchdog.
If any CPU in the system does not execute the period local timer interrupt for more than 5 seconds, APIC tries to fix the situation by a non-maskable interrupt (cpu executes the handler, and kills the process)!
(SCC Linux is an different case as to NMI.)
My answers (in the original question) were based on the system without watchdog!
It is problematic to answer at a general level and give examples based on some fixed system. The answers can be correct or not depending the cpu and configuration and settings.
Anyway, talking about NMI did make some sense? Did it?
If the CPU didn't restrict access to some instructions, it would be too easy to accidentally or deliberately cause a catastrophe.
push $0
push $0
lidt (%esp)
int $42
This code sequence will reset an x86 processor. Here's why:
The code loads the IDTR register with an interrupt descriptor table (IDT) at linear address 0, with a size of one byte.
Raises interrupt 42, which can't work because it is beyond the 1-byte limit of the IDT.
The CPU tries to raise a general protection fault, interrupt 13. This fails too, because interrupt 13 is beyond the one byte limit.
The CPU tries to raise a double fault exception, interrupt 8. This fails too, interrupt 8 is beyond the limit of the IDT.
This is known as a triple-fault. The CPU does a shutdown bus cycle to tell the motherboard that it is now ignoring everything and stopping execution. The motherboard asserts reset, rebooting the machine.
This is actually negligible compared to what code could do. A code sequence could easily hijack the machine altogether and start destroying all of the data on the hard drive, it could send all of your files to a malicious server on the internet, it could change your password, enable remote access, connect out to a malicious server and grant an attacker unlimited shell access. There's no limit on what a program could do.
Processors have privileged instructions for two reasons, the primary purpose is to protect the operating system from buggy programs that might accidentally do something to bring down or hijack the whole machine. The secondary purpose is to restrict deliberately malicious programs from doing the same.