How to interpret Hardware watchdog exceptions on a ESP chip?

How to interpret Hardware watchdog exceptions on a ESP chip? - hardware

For one of our Projects we have a Hardware Watchdog reset which happens on roughly 0.1% of our devices each day, resulting in many unwanted hardware resets.
We are trying to figure out what causes this Hardware Watchdog reset, but have failed to find anything relevant in our code which would result in this behavior.
We are using the Arduino 2.4.2 Version, we are not sure since when the Problem has bugged our solution since we had other issues which have now mainly been resolved.
Luckily our devices send us their reboot reasons when they reconnect, there we are receiving the following:
ResetReason=Hardware Watchdog;ResetInfo=Fatal exception:4 flag:1 (WDT)
epc1:0x40102329 epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000
depc:0x00000000;
We have looked for any thing, when this through the EspStackTraceDecoder we ended up with:
0x40102329: wDev_ProcessFiq at ??:?
A search looking at varies project which have asked similar questions mostly seemed to include a dns query. But not all, so it seems to be a general issue?
What additional information could we extract that might help us identity the issue?
Some Additional Information
Memory is stable and we have ~15-17Kb of free Heap, depending on the mode and the amount of data queued to send / receive queue.
Our side of the code uses yield, delay etc. so the S/W watchdog should always be fed. This also applies to the Async callback code.

Check whether you are doing any wrong memory read. The main reason for HW WDT is that it can trigger the reset if the software (or) cpu is not working anymore.
your CPU might have been stuck while executing some instructions and does't return.

Related

Should I expect "device lost" conditions as normal under Vulkan?

I'm asking because I wonder how robust I should make my programs against device losses.
Should I only expect devices to be lost in the case of, say, hardware errors, driver bugs, improper API usage or non-terminating shader programs; or should I also expect device loss in such cases as, say, suspending and resuming my laptop, minimizing the application window, or just randomly because the implementation felt like it?

It's unfortunately going to vary by GPU, driver, and OS, which leads to the somewhat vague spec wording that krOoze quoted:
A logical device may become lost because of hardware errors, execution timeouts, power management events and/or platform-specific events.
For reference, there is nothing in the Android OS itself that would require a device lost -- e.g. it doesn't force a device-lost when an app goes into the background or the screen is turned off.
But it's likely that some driver/hardware combinations will report a device lost error if there is a GPU exception (or reset), unless the driver can guarantee that nothing from your VkDevice could have been affected. That's a surprisingly difficult guarantee to make, e.g. if your queues weren't running at the time the problem occurred, but there still might have been some of your data in dirty cache lines and the reset invalidates those lines instead of writing them back to memory, your data will be corrupted. An exception/reset can be caused by hardware or driver bugs, or by any app on the system hitting a watchdog timeout (infinite loop in shader is the easy example, but even making progress but simply taking too long can happen).
In practice, these should be fairly rare events, and I believe (without data) that these days it's primarily caused by hotplug (rare) or misbehaving hardware/driver/app events rather than more routine things like device sleep.
Since testing your recovery code is going to be difficult and it'll therefore likely be buggy, my recommendation would be to just do something heavy-handed but simple, like saving application state and either restarting your app automatically, or quitting and asking the user to restart. Depending on what you're building, it might be reasonable to do something more sophisticated like tearing down and restarting+restoring your renderer system without taking down the rest of your app.

Inter Processor Interrupt usage

An educational principle is: There is not such a thing as a stupid question. The basic idea behind this is that people learn by asking.
I was asked to: "Can you show and explain at a programming level what bad will happen if every task could execute all instructions."
I did give the code
main(){
_asm_("cli;");
while(1);
}
and explained it (the system frozen for good- UP)
Then I was asked: "Is it possible give an example so that system do not freeze even this clearing interrupts is done?"
I did modify the previous example:
I did give the code
main(){
_asm_("cli;");
i=i/0;
while(1);
}
and explained it.
Trivially: If we have demand paging i=i/0 causes first a page fault (the data page not present) and an other task can be scheduled to run interrupts enabled during the disk read and later on divide by zero will throw this task away for good.
But the answers were based on UP. What about SMP? I must tell that answers are incomplete.
It still easy enough to construct:
int i;
main(){
for(i=0;i<100;i++)// Suppose we have less than 100 CPUs
if(fork())
{ sleep(5);//The generating task has (most probable) time to do all forks
_asm_("cli;");
while(1);
}
}
which will disable interrupts for all CPUs, because every CPU gets a poisonous task to run.
Even so far a stupid question did reveal many things good to learn to a beginner: privileged instructions, paging, fault handling, scheduling during DMA, fork.....
But a minor doubt remains (shame on me) about the first program running on a SMP.
Will one CPU be out permanently or not?
Other CPUs continue and can send re_schedule() IPI message.
What happens then?
It can be easy to speculate that the frozen CPU do not wake up, because interrupts are disabled.
But to be perfectly sure must know more.
My question was:
Is the Inter Processor Interrupt (IPI) maskable or non-maskable?
I mean in the most common "popular" implementations?
Excuse my stupid question. It can't be very difficult to find an answer. I will seek it.
I mean interrupt pin number (telling maskable, I guess).
My own answer - correct?
I studied the issue, because nobody else did like it, coming to following thoughts:
With important real-time applications we have had for a long time a watchdog timer (HW interrupting cpu to answer somehow "I am alive").
For example we have main control computer and standby computer taking care of the system if the main computer is down.
What about Linux?
What kind watchdog- have we one?
We can compile the Linux kernel with or without watchdog.
What the Linux watchdog does?
On many(!) x86/x86-64 type hardware there is a feature that enables us to generate 'watchdog NMI interrupts'.
It's even possible to disable the NMI watchdog in run-time by writing "0" to /proc/sys/kernel/nmi_watchdog.
If any CPU in the system does not execute the period local timer interrupt for more than 5 seconds, APIC tries to fix the situation by a non-maskable interrupt (cpu executes the handler, and kills the process)!
(SCC Linux is an different case as to NMI.)
My answers (in the original question) were based on the system without watchdog!
It is problematic to answer at a general level and give examples based on some fixed system. The answers can be correct or not depending the cpu and configuration and settings.
Anyway, talking about NMI did make some sense? Did it?

If the CPU didn't restrict access to some instructions, it would be too easy to accidentally or deliberately cause a catastrophe.
push $0
push $0
lidt (%esp)
int $42
This code sequence will reset an x86 processor. Here's why:
The code loads the IDTR register with an interrupt descriptor table (IDT) at linear address 0, with a size of one byte.
Raises interrupt 42, which can't work because it is beyond the 1-byte limit of the IDT.
The CPU tries to raise a general protection fault, interrupt 13. This fails too, because interrupt 13 is beyond the one byte limit.
The CPU tries to raise a double fault exception, interrupt 8. This fails too, interrupt 8 is beyond the limit of the IDT.
This is known as a triple-fault. The CPU does a shutdown bus cycle to tell the motherboard that it is now ignoring everything and stopping execution. The motherboard asserts reset, rebooting the machine.
This is actually negligible compared to what code could do. A code sequence could easily hijack the machine altogether and start destroying all of the data on the hard drive, it could send all of your files to a malicious server on the internet, it could change your password, enable remote access, connect out to a malicious server and grant an attacker unlimited shell access. There's no limit on what a program could do.
Processors have privileged instructions for two reasons, the primary purpose is to protect the operating system from buggy programs that might accidentally do something to bring down or hijack the whole machine. The secondary purpose is to restrict deliberately malicious programs from doing the same.

When to use windowed watchdog for embedded systems

This post is not for asking how to use it, but when.
There is a lot of documentation about windowed watchdogs (WW), and most microcontrollers already include it. Every vendor states that WW are meant for safety applications, but no one says more about this topic.
I would like to be pointed to specific examples, but examples that could be a little more than "for a car's brakes system".
We all know that a WW must be fed neither too early nor too late, but how will this scenario help to improve safeness?
Thank you!!

The overall point of a Watchdog is to ensure that the firmware is executing as expected. The theory is that if your firmware can periodically kick the watchdog, then the other functions it is responsible for are also happening.
From a system design, they're the last level of fail safe. It's basically saying "we don't know what the system is doing, because it's not able to kick the watchdog. So, reset the device and hope the problem goes away."
They can protect you from accidental infinite loops, stack corruptions, RAM bit twiddles, etc.
A Windowed Watchdog is a better solution than a single-sided Watchdog as the window can protect against more things... For example, with a single-sided, if the loop you're stuck in includes the watchdog kick, you'd never know you had a problem. For a Windowed Watchdog, you have a better chance of resetting due to the likelyhood of kicking too fast...
So, to answer your question. You'd use a Windowed Watchdog any time you wanted to be reasonably sure that the firmware is doing what it is supposed to, or to fall back to a safe state if it's not. They are generally focused on in safety systems, but all embedded devices can benefit from their use. (For example, a house thermostat is not considered a safety-critical system, however if it completely locks up and requires someone to remove the batteries to restart it that would be an annoyance.)

Does SPI really need waiting loop?

I am using msp430f5418, with IAR Embedded workbench 5.10.
A Graphical LCD (ST7565R) is connected through SPI into the MSP..
MSP master uses 8-bit, MSB first mode with SMCLK.
Normally we have to check the busy bit before transferring a byte using SPI, right?
But for my case, even if I send data continuously without checking the busy bit, it works fine and I can view the display data correctly.
Can anybody explain why is it working??
Is there any need to check for the ready bit or is it safe??
Thank you,

Your software is probably slow enough that the spi transaction completes every time. If you can verify that that is the case and always will be the case then you can argue not to add even more code to do the check. Removing the code that does the check might speed up your routine just enough to be too fast for the spi interface and cause collisions.
In general you should make sure one thing finishes before another starts. And in general how you make sure can be to use hardware features or through analysis or experiments. If the hardware has the feature and you somehow determine you dont need the check it is still a good idea to do a performance test with and without the check. If the performance is not critical or there isnt much difference it is still probably safer to leave the check in, somewhere down the road, even if your code is heavily commented with warnings, a compiler or code change might be just enough to have it not work without the check.

About Watchdog Timer

Can anyone tell me whether we should enable or disable watch dog during the startup/boot code executes? My friend told me that we usually disable watch dog in the boot code. Can anyone one tell me what is the advantage or disadvantage of doing so?

It really depends on your project. The watchdog is there to help you ensure that your program won't get "stuck" while executing code. -- If there is a chance that your program may hang during the boot-procedure, it may make sense to incorporate the watchdog there too.
That being said, I generally start the watchdog at the end of my boot-up procedures.

Usually the WD (watchdog) is enabled after the boot-up procedure, because this is when the program enters its "loop" and periodically kicks the WD. During boot-up, by which I suppose you mean linear initialization of hardware and peripherals, there's much less periodicity in your code and hard to insert a WD kicking cycle.

Production code should always enable the watchdog. Hobby and/or prototype projects are obviously a special case that may not require the watchdog.
If the watchdog is enabled during boot, there is a special case which must be considered. Erasing and writing memory take a long time (erasing an entire device may take seconds to complete). So you must insure that your erase and write routines periodically service the watchdog to prevent a reset.

If you're debugging, you want it off or the device will reboot on your when you try to step through code. Otherwise it's up to you. I've seen watchdogs save projects' butts and I've seen watchdogs lead to inadvertent reboot loops that cause customers to clog up the support lines and thus cost the company a ton.
You make the call.

The best practice would be to have the watchdog activate automatically on power up. If your hardware is not designed for that then switch it on as soon as possible. Generally I set the watchdog up for long duration during boot up but once I am past boot up I go for a short time out and service the watchdog regularly.
You might not always be around to reset a board that hanged after a plant shut down and restart at a remote location. Or the board is located in a inaccessible basement crawl space and it did not restart after a power dip. Lab easy practices is not real world best practices.
Try and design your hardware so that your software can check the reset cause at boot up and report. If you get a watchdog timeout you need to know because it is a failure in your system and ignoring it can cause problems later.
It is easier to debug with the watchdog off but during development regularly test with the watchdog on to ensure everything is on track.

I always have it enabled. What is the advantage of disabling it? So what if I have to reset it during the bootup code?

Watchdogs IMHO serve three two, but distinct, primary purposes, along with a third, less-strongly-related purpose: (1) Ensure that in all cases where the system is knocked out of whack, it will recover, eventually; (2) Ensure that when hardware is enabled which must not go too long without service, anything that would prevent such servicing shuts down the system, reasonably quickly; (3) Provide a means by which a system can go to sleep for awhile, without sleeping forever.
While disabling a watchdog during a boot loader may not interfere with purpose #2, it may interfere with purpose #1. My preference is to leave watchdogs enabled during a boot loader, and have the boot loader hit the watchdog any time something happens to indicate that the system is really supposed to be in the boot loader (e.g. every time it receives a valid boot-loader-command packet). On one project where I didn't do this, and just had the boot loader blindly feed the watchdog, static zaps could sometimes knock units into bootloader mode where they would sit, forever. Having watchdog kick the system out of the boot loader when no actual boot-loading is going on alleviates that problem.
Incidentally, if I were designing my 'ideal' embedded-watchdog circuit, I would have a hardware-configurable parameter for maximum watchdog time, and would have software settings for 'requested watchdog time' and 'maximum watchdog time'. Initially, both software settings would be set to maximum; any time the watchdog is fed, the time would be set to the minimum of the three settings. Software could change the 'requested watchdog time' any time, to any value; the 'maximum watchdog time' setting could be decreased at any time, but could only be increased via system reset.
BTW, I might also include a "periodic reset" timer, which would force the system to unconditionally reset at some interval. Software would not be able to override the behavior of this timer, but would be able to query it and request a reset early. Even systems which try to do everything right with a watchdog can still fall into states which are 'broken' but the watchdog gets fed just fine. If periodic scheduled downtime is acceptable, periodic resets can avoid such issues. One may minimize the effect of such resets on system usefulness by performing them early whenever it wouldn't disrupt some action in progress which would be disrupted. For example, if the reset interval is set to seven hours one could, any time the clock got down to one hour, ask that no further actions be requested, wait a few seconds to see if anyone tried to send an action just as they were asked to stop, and if no actions were requested, reset, and then invite further requests. A request which would have been sent just as the system was about to reset would be delayed until after the reset occurred, but provided no requests would take longer than an hour to complete, no requests would be lost or disrupted.

Fewer transistors switching, I suppose, so minuscule power savings. Depending on how much you sleep, this might actually be a big savings. Your friend might be referring to the practice of turning off the WDT when you're actually doing something, then turning it on when you sleep. There's a nice little point that Microchip gives about their PICs:
"If the WDT is disabled during normal operation (FWDTEN = 0), then the SWDTEN bit (RCON<5>) can be used to turn on the WDT just before entering Sleep mode"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas