Monitoring a lot of status informations

Hi community!

Problem Statement

My openhab-installation manages round about 70 hardware-switches, controlls heating, blinds, lights… So its a larger setup. A lot of the used components have any kind of status, which are more or less important for the system stability.

Some examples:

  • battery-level of zigbee-devices
  • network-availability of routers, hue-bridges, other network-devices
  • availability of internet
  • System-parameters of the RasPi, which my openhab is running at
    • disk-space
    • memory-space
    • cpu-load
    • cpu-temp
  • register values of i2c-things
  • … whatever you want to monitor
    • lowest outsite-temperature of the last 24h…

It is a good idea to have all this things in mind, otherwise on a certain day you will be overtaken by an outage. I want to master all of them and dont want to overlook one.

Concept

So here is my solution:

I created a framework for all this informations, to bring them all in line and manage them as a whole. My solution simply consists of items and rules. It all beginns with a definition of “status”. My status-items are from typ string and can assume one of this four values:

“OK” -> Yea, its ok.
“INFO” -> Here is something you should notice. But its not a problem.
“WARN” -> Hey! Look at it, otherwise it becomes a problem in the future!
“CRIT” -> Something is really bad!

The central item in my framework is a status-item which describe the whole status of my houseautomation. Its the one and only overall-status.

String ST_House_State

If this item changed, I got am messge via telegram. Additional, you can trigger a status-light (red/yellow/green) by this item.

So, who set this main status?

All items I want to monitor has a second status-item with type string. Currently, in my homeautomatisation there are 74 items like that. I collected all of this status-items in the group GR_Warnings. For every change of an item in this group, I’ve got a telegram-message, too. And, if one item in this group changed, a rule (rule “sumarize all warning-states”) is triggert. This rule checks all the status-items and set the “hightest” status (crit > warn > info > ok) into the ST_House_State.

So if all status-items are “OK”, the overall-status is “OK”. But if only one status is “INFO”, the overall-status is “INFO”.

But who set this individual status-items? There are different ways, depends on the meaning of the base-item.

simple example
The simplest case: My zigbee-devices has an item to notify me about “battery-low”. For example, we call this items “sensor_1_batteryLow”, “sensor_2_batteryLow”, … All this items are in the group GR_Monitor_Switch_On_Warning. And if one item in this group changed, a rule (rule “check GR_Monitor_Switch_On_Warning”) is triggert. This rule looks for an item named [original_itemname]_State and set it to “WARN”, if the original-item is ON. Set it to “OK” if the original-item is OFF. This *_State-items are in the group “GR_Warnings”, so if the battery of a device goes low, the associated status-item goes “WARN”, which triggers the overall-status to go “WARN” (if its not CRIT at this time)

Example-Items:

Switch SN_H_U_B_HueMotion_BatteryLow (GR_Monitor_Switch_On_Warning) { channel="hue:0106:1:motionsensor1Light:battery_low" }
String SN_H_U_B_HueMotion_BatteryLow_State "SN_H_U_B_HueMotion_BatteryLow_State [%s]" (GR_Warnings)

A more complex example

I want to monitor network-ping-times of hosts in my local network (My router, my ZigBee-Bridge…) and in the internet (google, my provider homepage). So I have to monitor number-values (provided by the network:pingdevice-Binding) and set a status to INFO,WARN,CRIT by value-limits. This is also managed by groups:

// Check a ping-value of nessesary, intern hosts
// > 500 -> Warn
// > 1000 -> CRIT
Group GR_Monitor_Ping_intern

// Check a ping-value of optional, external hosts
// > 1000 -> WARN
Group GR_Monitor_Ping_extern

Number SN_Fritzbox_Ping "Fritzbox Ping [%d ms]" (GR_Adminsite_Systemhealth, GR_Monitor_Ping_intern) { channel="network:pingdevice:fritzbox:latency" }
String SN_Fritzbox_Ping_State "SN_Fritzbox_Ping_State [%s]" (GR_Warnings)

Number SN_HueBridge_Ping "HueBridge [%d ms]" (GR_Adminsite_Systemhealth, GR_Monitor_Ping_intern) { channel="network:pingdevice:huebridge:latency" }
String SN_HueBridge_Ping_State "SN_HueBridge_Ping_State [%s]" (GR_Warnings)

// External Services
Number SN_Google_Ping "Google Ping [%d ms]" (GR_Adminsite_Systemhealth, GR_Monitor_Ping_extern) { channel="network:pingdevice:google:latency" }
String SN_Google_Ping_State "Google Ping [%s]" (GR_Warnings)

Number SN_Netcologne_Ping "NetCologne Ping [%d ms]" (GR_Adminsite_Systemhealth, GR_Monitor_Ping_extern) { channel="network:pingdevice:netcologne:latency" }
String SN_Netcologne_Ping_State "NetCologne Ping [%s]" (GR_Warnings)

Most complex example

There are also items I expect to change often. For example temperature-sensors for my heatingcontrol. If this values dont change all, say, 30m, there is something wrong with this sensor. So I want a check for the last update of my temperature-items. Therefore, I created a group “GR_LastUpdate”. If an item of this group changes, an item called [itemname]_LastUpdate will be set to the current timestamp. This *_LastUpdate-Items are put to groups like

Group GR_Monitor_LastUpdate_10_30_60 // creates an info, if lastUpdate older then 10 Minutes, an warn, if older then 30 min and an crit, of older than 60 min
Group GR_Monitor_LastUpdate_30_90_xx // info after 30 min, warn after 90 min, no crit..
Group GR_Monitor_LastUpdate_60_120_xx

So all items for one temperature-sensor are:

Number:Temperature SN_H_U_B_Temp "Temp H_U_B [%.1f %unit%]" <temperature> (GR_LastUpdate)          { channel="hue:0302:1:motionsensor1Temp:temperature"}
DateTime SN_H_U_B_Temp_LastUpdate (GR_Monitor_LastUpdate_60_120_xx)
String   SN_H_U_B_Temp_LastUpdate_State "SN_H_U_B_Temp_LastUpdate_State [%s]" (GR_Warnings)

Sitemap
My integration to a sitemap is very simple.

Frame label="Haus" {
      [... other items ....]
      Text item=ST_House_State label="Status [%s]" icon="status" valuecolor=[=="OK"="green", =="WARN"="orange", =="CRIT"="red"] {
      		Group item=GR_Warnings label="Warnungen" icon="status"
      		Group item=GR_Adminsite_Systemhealth label="System-Health" icon="status"
       }		
}

This shows me the main status with colored OK/INFO/WARN/CRIT. And also it shows me all the Status-Item als a group. Additional, I have a group (GR_Adminsite_Systemhealth) with selected Items which are intresting sometimes. There are ping-values, cpu-load, and so on…

conclusion
The basic thing of this is to transform everything you want to a generic OK/WARN/INFO/CRIT and manage all of them together. You can add as much status-items as you want to the GR_Warnings-Group and monitor them all at ones. With this system, I’m able to have the overview over my systems all the time. So hopefully I always detect problems faster than my wife, to increase the WAF-factor ot my openhab. Im using this for years and are very happy with it. For me, it provides a good feeling to know “Its all OK with my systems”, and I know “Im informed directly if something went wrong”.

I hope this is helpfull to you. I’m happy to get every kind of feedback!

Happy coding!
igor

PS: Credits: Found the _LastUpdate-function here: Design Pattern: Associated Items

monitoring_checkers.rules.txt (3.6 KB) monitoring_lastupdate.rules.txt (3.9 KB) monitoring_main.items.txt (1.2 KB) monitoring_network.items.txt (1.4 KB) monitoring_show_state.rules.txt (1.4 KB) monitoring_systeminfo.rules.txt (1.8 KB)

4 Likes

Thanks, this looks promising. I am going to give this a go.