Hardware Watchdog for OH on Raspberry PI

zisisp · August 27, 2019, 9:19am

Can you tell us what configuration did you use for watchdog?
Did you just follow the post’s configuration or did you use something else?
I have a similar problem.
RPi seems to have crashed.
-Does not send any data.
-Does not respond to ssh.
-Is powered on and ethernet connected
When I ping it, it replies.
After that, I can ssh.
When I did a top, momentarily I saw 300% cpu then 100% then it dropped to 4%.

By rule of thumb, I would set it to 150-200% load for reset and see how this goes

peterhoeg · May 27, 2022, 8:15am

Apologies in advance for both semi-hijacking as well as necrobumping this issue, but I think there is validity to the request being made here.

It seems to me that people are talking past each other a little bit, so let me try to recap the situation.

The problem: openHAB, the OS or the machine on which it runs can all become unresponsive for a variety of reasons. The cause of this (bad hardware, insufficient power supply, solar flares, bad karma, software bugs or any number of other things) isn’t really that interesting when you want service restored as quickly as possible. If things aren’t running, automation becomes quite tricky.

Now, there are some mechanisms that will help with at least some of this. As an example, if the hardware has a hardware watchdog (which is what OP is talking about), watchdog software on the machine can be told to configure that hardware and ping it (think heartbeat, not ICMP ping) on a regular schedule. If the deadline is missed, the hardware will trigger a reboot of the machine, so even a dead OS will can be brought back to life. systemd will do this on Linux distributions using it with a single line in /etc/systemd/system.conf. There is also other watchdog software for systems not using systemd.

The linux kernel can be told to reboot on kernel panic.

systemd takes it further with programs adapted for it, where systemd will expect the application to send a heardbeat every X seconds and if it isn’t either the application is restarted or the machine itself. The action is configurable.

Unresponsive can of course mean a lot of things and what I describe above is obviously only a small part of it, but some cases can be dealt with when all you care about is “get it back up and running”.

I am also of course not advocating not looking into why it required a restart in the first place but if I’m not physically in front of the device, I would much prefer a forced reboot and restoring the service and then looking into why it crashed later rather than have no automation at all.

The hardware thing has nothing to do with openHAB as such. Anyone on a machine with a hardware watchdog and an OS that supports it can go right ahead and enable it.

For the service watchdog, that will require cooperation by openHAB where it needs to use the sd_notify call (or equivalent) to signal its status to say both “now I’m ready” and “I’m still alive” and this would be very neat if we could do that in openHAB.

mstormi · May 27, 2022, 8:27am

Err, I’m unsure what you want to tell or request with your post.
Given the post before is 3 years old and if I understood you correctly you tell anyone who wants to have a hardware watchdog to take care of that on his own (because as you rightly comment that this is all unrelated to openHAB), I suggest you create another fresh thread and describe your request for a service watchdog in detail there.
Note though that no software can supervise itself. So you must not address OH developers and expect any of them to develop any sort of service watchdog, that would be broken by design.
If you have one available already (systemd based?), you can implement any “I’m alive” signal using openHAB rules yourself, that you don’t need any new OH feature for.
Show your systemd setup, eventually it can be generalized and added to openHABian for others to also benefit from it.

peterhoeg · June 2, 2022, 11:42am

Err, I’m unsure what you want to tell or request with your post.

It was really two-fold. First of all, the way I read the openHAB response was “if you need to restart your machine, you’ve got other problems so go deal with that”, which I think is missing the point - there are definitely mechanisms (as outlined in my post) that can be trivially implemented to provide at least some added service reliability.

The other reason was simply to get the conversation started (and thank you for responding) on what can and possibly should be done to improve the experience for everyone.

you tell anyone who wants to have a hardware watchdog to take care of that on his own (because as you rightly comment that this is all unrelated to openHAB)

I’m saying it can be done outside of openHAB (the application) but should possibly be considered part of the official best practice and as a consequence probably also openHABian. Side note - I haven’t been able to find any documentation on best practices - would you be interested in taking a PR for that again assuming that there is consensus on what constitutes “best practice”.

I suggest you create another fresh thread and describe your request for a service watchdog in detail there.

Will do.

Note though that no software can supervise itself.
So you must not address OH developers and expect any of them to develop any sort of service watchdog, that would be broken by design.

Of course, I wouldn’t expect them to either.

If you have one available already (systemd based?), you can implement any “I’m alive” signal using openHAB rules yourself, that you don’t need any new OH feature for.

While you probably can (I’m not sure it’s possible to read environment variables from rules in order to obtain the socket to use for communicating with systemd), the question is of course if it is the best way to do it. Given that openHAB itself would have a much better view of its own health. In any case, I’ll go through that in the new post.

Show your systemd setup, eventually it can be generalized and added to openHABian for others to also benefit from it.

That’s fortunately very straight forward - add the following to /etc/systemd/system.conf:

[Manager]
RuntimeWatchdogSec=1m

mstormi · June 2, 2022, 5:20pm

This can and probably will quickly turn into some dogmatic discussion but here goes:
IMHO that statement IS very much to the point.
Watchdog boots are just workarounds not solutions and will not result in any acceptable service level quality per se. They can even make things worse because most programming people do isn’t really done to be reboot safe. Lots of data can get lost or corrupted that otherwise might be saveable with some manual intervention.
Followup problems might even go unnoticed (at least for some time).

Only investigating the source of your problems can get you outta the pithole.
If you boot a machine you’re loosing state and data you might need to find the cause.
For example if you’re using the openHABian default, logs will be on ZRAM and get lost on every reboot. But these are the prime source you will need for any sort of analysis towards the root cause of your problems.
There have been various threads e.g. on how to have a shutdown switch in OH GUI that are showing how many people think. To many, a watchdog reset is just a nice excuse to avoid putting any more work into digging deeper.