Automation Health Monitoring

There has been a lot of discussion on the forum about backups and steps to restore systems when components fail.

I’m wondering how others are monitoring their systems so they are alerted when, for any reason, the automation stops. Much of the work that automation is doing is invisible, so it might be some minutes or hours before we know the system has crashed.

I’m curious to know what others may be doing.

I am using my KNX installation together with my KNX logic engine from MDT to set up a watchdog for openHAB:

I have a JS Scripting rule (most of my rules are JS Scripting) that sends command ON to a switch Item every 30 seconds, which then gets transmitted to the KNX bus over IP.
The logic engine checks whether the ON command is send regularly, if it does not receive it, an alert is send to my KNX push buttons. These start flashing red and my MDT Glastasters (MDT push buttons with touch buttons and a LCD screen) show a message: „openHAB offline“.

This way I can check the most critical parts of my openHAB installation: KNX connection, JS Scripting automation and of course general openHAB availability.

My openHAB also checks the connection to my other devices, e.g. DoorBird, via rules and sends a push notification and a message to my KNX pushbuttons if my DoorBird goes offline.
For the DoorBird, I am using the network binding with ping, and if the ping fails for x seconds, I send a message using a JS Scripting rule.

I have a lot of stuff besides OH that I want to monitor so my setup is probably more than most average OH users would need to worry about.

I have two tiers of monitoring:

Tier 1: Status that impacts the home automation

There are some cases where OH cares whether or not a device is online or not and it might take action or take a different action based on that status. For example, I’ve a rule that will send a notification when someone tries to trigger the garage door opener but the opener is offline.

In that case I add a status Item to the Equipment and set it to OFF when the device goes offline.

How do I detect it’s offline? It depends.

  • Devices integrated with OH over MQTT use the LWT topics to drive the status Item
  • Devices that use a binding but are otherwise on the network use the Network binding.
  • In cases where a Thing may go OFFLINE when the device is unavailable I use Thing Status Reporting [4.0.0.0;4.9.9.9] to detect and update the status Item accordingly
  • Finally, in cases where detecting if it’s offline means that the device just stops reporting for a time I use Threshold Alert and Open Reminder [4.0.0.0;4.9.9.9].

I use Service Status Standalone Widget to report the online statuses of the home automation Equipment.

I also have a rule that sends me an email when a device goes offline (also based on the Threshold and Alert rule template) and another that sends me an email in the morning with a list of all the devices that are still offline.

Tier 2: General service and machine monitoring

To detect when OH itself goes offline, as well as any of the other many services I run (Plex, Nextcloud, Calibre, etc.) I use Zabbix. This runs an agent on my various machines and reports their status. When ever a docker container or service or machine goes offline (or some other system statistic goes out of range) I get an email.

Of course if the VM where Zabbix itself is running goes offline I’m won’t get the email but so far that’s not happened. But I could set up a watchdog to send me an email if I ever get worried about that. I wonder if Proxmox has something I can set up to do that…

2 Likes

I’m primarily focused on the health of the OH system so reporting to a watchdog machine is what I was thinking.
I was thinking of standing up another machine just for this purpose…maybe another instance of OH, but I’m open to other applications more specifically suited to the job.
I’ll take a look at Zabbix.

If you are on Linux then you can use systemd process watchdog to make sure that launcher will detect eventual hang of the service. It might helps with most basic of memory errors:

It won’t help with communication issues.

I was not aware of karaf having systemd support - that’s great news. However, in config.properties it is mentioned that:

In addition to setting the flag to true, the JNA library needs to be made
available to the main classloader by adding the two following libraries
to the lib/boot directory
mvn:net.java.dev.jna/jna/${jna.version}
mvn:net.java.dev.jna/jna-platform/${jna.version}
or by building a custom distribution and adding the following lines for the
karaf maven plugin configuration:
mvn:net.java.dev.jna/jna/${jna.version};type:=boot;export:=false
mvn:net.java.dev.jna/jna-platform/${jna.version};type:=boot;export:=false

Where does one get those libraries?

Secondly, another way to monitor it externally is to hit /rest/metrics/prometheus even if you are not using prometheus (which you should IMHO!) as that’s not likely to work if openhab has crashed.

The JNA is part of openHAB distribution, however its version might vary between releases. You can get them via find $OH_BASE -name "*jna*.jar". For me (an ancient OH 3.0.x) it returns:

(...)/system/net/java/dev/jna/jna-platform/5.4.0/jna-platform-5.4.0.jar
(...)/system/net/java/dev/jna/jna/5.4.0/jna-5.4.0.jar

You can copy or link these to $OH_BASE/lib/boot/ directory.

Cool, thanks for that. One questions though - the upstream jna distribution also has a number of OS specific jars (linux, darwin, windows, etc) which are not vendored in the OH distribution. Do you know anything about that?

The jna-x.y.z.jar contains native libraries for all supported platforms, so you don’t need to worry about specific OS. It is all-in-one. :slight_smile:

1 Like

I have a simple watchdog rule, which monitors the updates from Z-wave temperature sensor, which sends updates every 10 mins. If there is an update received the rule sends reset to Wi-Fi watchdog relay.
If there are no updates within 20 mins the rule sends an email and blocks watchdog relay reset, which after 5 mins initiates hard powering off of the active OH controller and switching to backup controller(there is a separate forum topic on this) .
In this way the whole communication path and all critical SW logic is monitored - from Zwave interface to OH/Node-red and to WiFi interface. (I don’t have other issues/features, which require watchdogs. In 8 years I had only two failed Z-wave devices out of about 40). When backup controller is online it sends an email once a day, that system is working on reserve controller - so I will need to perform some action. The whole switchover takes about 30 mins in worst case and is fully automatic.

Adding to Rich. I’m also using Zabbix for operational monitoring of various system components I have in my house. OH is only one of many VM’s network components etc. I’m running. So having another VM for monitoring is a must for me.

What I’m using for year next to it is Splunk for logs collection, analysis and alerts. It is great tool which helps me work with all those hunderts of logs I have. Including logs from OH. But all these tools comes with a cost at the moment you want to use more advanced features.

I also use OH for status reporting of health from other components in my setup. As OH is the visualization I use every day, it is actually the best place for me to see if something happens with my heating, alarm, home control system etc. And also a place, where it is easy to creating alerts and event actions as Rich described.

After all the years I have to say that OH is not the place which is typical failure point. More common for me is component or communication failure.
That of course depends on the HW setup use. Reliability of RPi will be different from ESX machine with redundant disk array and power supply.

1 Like

Hi @Artyom_Syomushkin,

I would be very interested in the topic, searched, but found nothing special.

Do you have a link to the topic? That would be great!

Have a good one!

You can look in this topic: What's your HA reliable architecture ? Mesh + central controller? - #47 by Artyom_Syomushkin

1 Like

Thank you!

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.