Be keen to know what you do to monitor your system and restart/reboot or whatever (check logs?) when issues occur.
Im considering a rule to restart OH2 or the system, if the ZWave thing goes offline.
I’m trying to have a no touch system once its setup.
I have to ask Kris… do you have issues with zwave thing going offline? I run zwave and have never seen thing go offline… ever
Not when the stick is directly connected to the OH2 server, but yes, I do when the stick is remote using things like socat/ser2net. It’s rare but it happens.
I guess I would have no issues being direct but in any case, the system needs some level of self aware/self healing features which OH2 does not have.
oh yeah… ok gotcha… I could see that
Honestly, having gone down that route before, developing a no touch system and having a system that stays up to date with patches and upgrades is impossible. So I’ve found the effort involved in creating a even a partially no-touch system far outweighs the labor savings.
Typically if you want to build a fault tolerant system like this, which is really what you are after, you need something outside the thing being monitored to detect and reset the thing when a problem is detected. Often this is called a watch dog system. I strongly recommend that this monitor be implemented outside of OH itself. The problem is typically if the thing you are monitoring experiences a fault, you can’t trust that thing to operate correctly. If you can’t trust it to operate correctly you cannot trust the watch dog itself to work correctly.
So it sounds like you are using socat/ser2net to access your zwave controller and periodically that goes offline. So the watchdog in this case is watching the zwave serial Thing and when it goes offline it needs to restart just OH? The whole machine? Is there any way you can see if there is a problem with socat/ser2net without polling OH directly, perhaps checking the logs?
Let’s say the whole machine. But what if the device is legitimately OFFLINE? You don’t want your watchdog to put your computer into a reboot loop. So you need to add logic to only reboot once and if that doesn’t work don’t reboot again until it comes back online.
Depending on what the source is to tell you that you have a problem (system logs, polling the Thing status through OH’s REST API, etc) you could probably handle this with a cron job and a shell script. Systemd has a watch dog service built in as well which you can use for this, but I’ve found the tutorials for it to be a bit more difficult to understand.
Ive no desire for the system to stay up to date, quite the opposite actually. Once it works, i dont want to touch it - people touching things is generally what breaks it.
The ser2net/socat isn’t ideal, I was hoping OH2 and the ZWave binding to be updated to allow this functionality to work out of the box but it doesnt. Far from it. That said, for my fathers installation it doesnt need this so I guess thats one complexity I dont need.
Im talking more generally here though, what are tools/techniques that people use to self heal the system if theres errors etc.
That’s what the whole discussion about watch dogs was about.