Should I expect "device lost" conditions as normal under Vulkan? - error-handling

I'm asking because I wonder how robust I should make my programs against device losses.
Should I only expect devices to be lost in the case of, say, hardware errors, driver bugs, improper API usage or non-terminating shader programs; or should I also expect device loss in such cases as, say, suspending and resuming my laptop, minimizing the application window, or just randomly because the implementation felt like it?

It's unfortunately going to vary by GPU, driver, and OS, which leads to the somewhat vague spec wording that krOoze quoted:
A logical device may become lost because of hardware errors, execution timeouts, power management events and/or platform-specific events.
For reference, there is nothing in the Android OS itself that would require a device lost -- e.g. it doesn't force a device-lost when an app goes into the background or the screen is turned off.
But it's likely that some driver/hardware combinations will report a device lost error if there is a GPU exception (or reset), unless the driver can guarantee that nothing from your VkDevice could have been affected. That's a surprisingly difficult guarantee to make, e.g. if your queues weren't running at the time the problem occurred, but there still might have been some of your data in dirty cache lines and the reset invalidates those lines instead of writing them back to memory, your data will be corrupted. An exception/reset can be caused by hardware or driver bugs, or by any app on the system hitting a watchdog timeout (infinite loop in shader is the easy example, but even making progress but simply taking too long can happen).
In practice, these should be fairly rare events, and I believe (without data) that these days it's primarily caused by hotplug (rare) or misbehaving hardware/driver/app events rather than more routine things like device sleep.
Since testing your recovery code is going to be difficult and it'll therefore likely be buggy, my recommendation would be to just do something heavy-handed but simple, like saving application state and either restarting your app automatically, or quitting and asking the user to restart. Depending on what you're building, it might be reasonable to do something more sophisticated like tearing down and restarting+restoring your renderer system without taking down the rest of your app.

Related

How to interpret Hardware watchdog exceptions on a ESP chip?

For one of our Projects we have a Hardware Watchdog reset which happens on roughly 0.1% of our devices each day, resulting in many unwanted hardware resets.
We are trying to figure out what causes this Hardware Watchdog reset, but have failed to find anything relevant in our code which would result in this behavior.
We are using the Arduino 2.4.2 Version, we are not sure since when the Problem has bugged our solution since we had other issues which have now mainly been resolved.
Luckily our devices send us their reboot reasons when they reconnect, there we are receiving the following:
ResetReason=Hardware Watchdog;ResetInfo=Fatal exception:4 flag:1 (WDT)
epc1:0x40102329 epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000
depc:0x00000000;
We have looked for any thing, when this through the EspStackTraceDecoder we ended up with:
0x40102329: wDev_ProcessFiq at ??:?
A search looking at varies project which have asked similar questions mostly seemed to include a dns query. But not all, so it seems to be a general issue?
What additional information could we extract that might help us identity the issue?
Some Additional Information
Memory is stable and we have ~15-17Kb of free Heap, depending on the mode and the amount of data queued to send / receive queue.
Our side of the code uses yield, delay etc. so the S/W watchdog should always be fed. This also applies to the Async callback code.
Check whether you are doing any wrong memory read. The main reason for HW WDT is that it can trigger the reset if the software (or) cpu is not working anymore.
your CPU might have been stuck while executing some instructions and does't return.

can a computer work with software interrupts only?

can software interrupts do some of hardware interrupts?
can it detect power failure and things and then rely only on software interrupts?
so then we wont need special hardware like interrupt controllers
While this may be technically possible I doubt you'll end up with a system that's stable or even reliable. Interrupts are especially important as hardware because they, well, interrupt the processing of other tasks asynchronously. This allows physical components at their lowest level to quickly and correctly respond to events.
Let's play out the scenario you mention and imagine a component on the motherboard detecting a power failure. Without an interrupt the best it can do is write to a register or cache. It must then rely on another piece of hardware or even the operating system to check that value. This basically means periodic polling which is not as efficient. Furthermore, if there is currently a large instruction set running that might be hogging the resources necessary to check the value you have no deterministic way of knowing when that check might occur. It could be near-instant, or it could be a second from now. If it's the latter, your computer loses power and shuts down before it can react.

Programmable "real-time" MIDI processing

In my band, all musicians have both hands busy at any time. However, we want to add whole synthesizer chords (1/4 .. whole note length), maybe triggered by a simple foot switch every time (because playing along a sequencer is currently too difficult for us).
Some time ago I wrote a (Windows) console application in C (MinGW) that converted incoming MIDI events to text, piped that text to an external program (AWK script), and re-converted that external program's text output back to MIDI events.
Basically every sort of filtering or event generation was possible; I actually produced chords triggered by simple control messages; I kept note-ONs in memory to be able to -OFF them whenever a new chord was sent, etc. - the actual processing (execution) times were not a problem at all(!)
But I had to understand that not only latency, but also the notoriously unreliable (with respect to "when", "for how long") user application OS multitasking/switching made this concept practically worthless at least for "real-time" use. There were always clearly perceivable delays, of unpredictable duration.
I read about user-mode driver programming and downloaded some resources, but somehow stopped working on that project without a real result.
Apart from that specific project, I even have some experience in writing small "virtual" machines that allow for expressing exactly the variables, conditionals and math, stored as a token tree and processed quite fast. Maybe there is also the option to embed Lua, V8, or anything like that. So calling another (external) program is not necessarily the issue here, since that can be avoided.
The problem that remains is that the processing as a whole is still done by a (user) application. So I figure there is no way around a (user mode) driver, in this scenario.
Alternatively, I was even considering (more "real-time") hardware - a Raspi or the like - but then the MIDI interface may be an additional challenge.
Is there any hardware or software solution (or project) available that may serve as a base for such a _Generic MIDI filter/processor_? Apart from predictable timing behaviour, it is desirable not to need a (C) compilation environment when building filters/rules, since that "creative" step will probably happen in our rehearsal room (laptop available), which is certainly not a "programming lab". Text-based "programs" are fine - for long-term I'll maybe build a GUI for wiring/generating rules anyway.
MIDI is handled pretty well in Windows. I'm not sure the source of the original problems you had. No doubt there is some latency though.
You can handle this in real time with a microcontroller. The good news is that you don't even have to build the hardware. Off-the-shelf controllers are available for this. For example: http://www.midisolutions.com/prodevp.htm

Is there any feature of programming that automatically detects computational repetition?

I'm new to programming, taking MIT's 6.00. While watching the Dynamic Programming lecture a simple question occurred to me: Is there any kind of built-in feature (for computers in general) to detect repetitive tasks and compensate?
I realize that's quite vague. I was working on my grandfather's computer because he had been complaining that it was slow. Indeed, it would lag for up to 15 seconds at a time, waiting for programs to open, etc. When I upgraded the RAM, the problem was gone. So if the computer was constantly having to write page ins and page outs to disk, why couldn't it have just popped up a little message suggesting a RAM upgrade? That would save quite a bit of time.
Computers are good at performing tasks quickly but slow code can be, well, slow. Can that be automated? Is this even a legitimate question?
In the example you describe the code isn't slow because it's reading/writing to disk. It's slow because it isn't actually doing anything but instead is waiting for the OS to page in and out to disk.
Also, a RAM upgrade isn't always the solution to frequent paging (say buggy program leaking memory or something).
It's not really possible in the general sense for the OS to detect what all the possible issues are and suggest a solution. That is in fact a variation of the Halting Problem.
It's impossible in general for a computer to know whether a slowness was because it's running an operation that fundamentally takes a long time to finish, or whether it's taking more time than it should really be.
Also, even if you've identified that an operation is slow, it's even more difficult to diagnose the precise reason why it is slow. Sometimes it's because you need more RAM, other times because slow network, or slow disk, or slow CPU. This is even more harder if the checker is running inside the same machine that it is running on since it's also experiencing the slowness itself.
However there are several things that can be done under certain limited situations. Many popular OSes (e.g. Windows, Linux, Android) can detect slow response to user input, and will offer to either give more time or force close applications (Android) or draw the not responding window in grayscale (Linux), or in bluish tint (Windows), if the application fails to respond to user input within certain period of time.

What are traps?

There are many different types of traps listed in processor datasheets, e.g. BusFault, MemManage Fault, Usage Fault and Address Error.
What is their purpose? How can they be utilized in fault handling?
Traps are essentially subroutine calls that are forced by the processor when it detects something unusual in your stream of instructions. (Some processors make them into interrupts, but that's mostly just pushing more context onto the stack; this gets more interesting if the trap includes a switch between user and system address spaces).
This is useful for handling conditions that occur rarely but need to be addressed, such as division by zero. Normally, it is useless overhead to have an extra pair of instructions to test the divisor for zero before executing a divide instruction, since the divisor is never expected to be zero. So the architects have the processor do this check in parallel with the actual divide as part of the divide instruction, and cause the processor to trap to a divide-by-zero routine if the divisor is zero. Another interesting case is illegal-memory-address; clearly, you don't want to have to code a test to check each address before you use it.
Often there are a variety of fault conditions of potential interest and the processor by design will pass control to a different trap routine (often set up as a vector) for each different type of fault.
Once the processor has a trap facility, the CPU architects find lots of uses. A common use is debugger breakpoints, and trap-to-OS for the purposes of executing a operating system call.
Microprocessors have traps for various fault conditions. They are synchronous interrupts that allow the running OS / software to take appropriate action on the error. Traps interrupt program flow and set register bits to indicate the fault. Debugger breakpoints are also implemented using traps.
In a typical computing environment, the operating system takes care of CPU traps triggered by user processes. Let's consider what happens when I run the following program:
int main(void)
{
volatile int a = 1, b = 0;
a = a % b; /* div by zero */
return 0;
}
An error message was displayed, and my box is still running like nothing happened. My operating system's approach to fault handling in this case was to kill the offending process and inform the user with the error message Floating point exception.
Traps in kernel mode are more problematic. It is not as strightforward for the OS to take corrective action if it is itself at fault. For a system process there is no underlying layer of protection. This is why faulty device drivers can cause real problems.
When working on bare metal, without the comforting protection of an operating system, the situation is much similar to the one above. Number one objective for achieving continuous and correct operation is to catch all potential trap conditions before they get to trigger any traps, using assertions and higher-level error handlers. Consider traps as the last line of defense, a safety net you don't intentionally want to fall into.
Defining behaviors for trap handlers is worth some thought, even if they "should never happen". They will be executed when things go wrong in an unanticipated manner, be it due to cosmic rays altering RAM in the most extreme case. Unfortunately, there is no single correct answer to what error handlers should do.
Code Complete, 2nd ed:
The style of error processing that is most appropriate depends on the kind of software the error occurs in and generally favors more correctness or more robustness. Strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; no result is better than an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that leads to results that are inaccurate sometimes.
Clearly, my operating system's fault handling is designed with robustness in mind; I can execute flawed code and do pretty much anything without crashing the system. Designing solely for robustness would mean a recovery attempt whenever possible, and if all else fails, reset. This is a suitable approach if your product is e.g. a toy.
Safety critical applications need a bit more paranoia and should favor correctness instead; when a fault is detected, write error log, shutdown. We don't want our radiation therapy unit to pick dosage levels from invalid garbage values.
The ARMv7-M (not to be confused with the ARM7 nor the ARMv7-A) Cortex-M3 technical reference manual, which may also be part of one of the new ARM ARMs (ARM Architectural Reference Manual)
has a section describing each one of these faults.
Now the whys versus the whats are perhaps at the root of the question. The why is usually so you have a chance to recover. Imagine your set-top box or telephone that hits one of these do you want it to hang or if possible try to recover? Unless you are expecting one of these faults (which in this context you shouldnt be, x86 systems and some of their faults are a completely different story) if you survive long enough to hit one of these you would most likely end up pulling the trigger on yourself (the software trying to kill itself by resetting the processor/system). You can go through the long list and try to find ones you can recover from. Divide by zero, how is the exception handler to know what the math mistake is that lead to this? In general it cant. Unaligned load or store, how is the handler to know what that code was trying to do, like divide by zero it is probably a software bug. Undefined instruction, the code went into the weeds and executed data most likely by this point you are already too far gone and couldnt recover. Any kind of memory bus fault the handler cannot repair the hardware.
You have to go through every fault, and for each fault define how you are going to handle it, all the ways you could have gotten to that one fault and the ways you can get out or handle each one of those paths. On occasion you might be able to recover, otherwise you need a default action, hang the processor in an infinite loop in the handler for example so that the software engineer, if available, can try to use a debugger to get in and find where the code stopped. Or have a watchdog timer, inside or outside the chip depending on the chip and board design (often outside the chip the WDT will reset the whole board). You might have some non-volatile memory that you attempt to store the fault in, before letting or causing the reset, the time and code it takes to do that might lead you to another fault depending on what is failing.
Simply put, they allow you to execute code when something happens in the processor. They're sometimes used by the OS for error recovery.