Monitoring Rules

As the number of sensors and Items grows, the number of rules depending on them will likely also expand. It is fun at first to look are nice UIs and charts but that last only a while. Soon I was no longer staring at my Grafana charts.

I started implementing Monitoring rules to let OpenHab notify me when something requires my attentation. I would like to share a few of them and hopefully, some will share what they do.

Context

I am using persistence and mainly influxdb.
I am using also Grafana so in theory I could manage alerting from OpenHab and/or Grafana.
The issue with using Grafana is that you rely on the information being in your monitoring DB 9ie influxdb). If the information did not make it in, you cannt tell much beside that a device no longer reports data but that could be legit for a device that was removed from the system.

So in general, I like to do my montoring in OH directly and send notifications via PushOver.

This thread is NOT related to gathering metrics for the fun of gathering metrics.
It is more targeted at monitoring to ensure that OH and the installation keeps running as intended.

Batteries

I batteries die… it is game over.

For that I am using:

  • A group called BatteryPercent (that b/c some devices return a percentage and other a Level in Volts)
  • Persistance is set for that group
  • a rule “on change” that tells me if batteries jump below a threshold
  • a rule “on cron” that tells me if batteries are below a threshold
rule "Alert low battery on change"
when
    Member of BatteryPercent changed
then
    val item = triggeringItem
    val limit = 20

    if (item.state >= limit) return;

    val pushover = getActions("pushover", "...")
    logInfo("LOW Battery", "Low Battery Level for {}: {}% < {}%", item.name, item.state, limit)
    val msg = String::format("Low battery for %s: %s%% < %s%%", item.name, item.state as Number, limit as Number)
    val title = String::format("Low battery for %s: %s%%", item.name, item.state as Number)
    pushover.sendPriorityMessage(msg, title, 1)
end

OnlineMonitored

I started running into issues with some devices (like the Sonoff Occupancy sensors). They would “lose it” and stop working until I would reset them. This is not really an issue if I know WHAT device lost it.

I am using:

  • a group OnlineMonitored
  • Persistence is set for that group
  • a cron based rule sending a notificication if a device did not change in 24 hours
val pushover = ...

rule "Alert on stale Monitored Items"
when
    Time cron "0 0 * ? * * *" // Every hour on the hour
then
    val threshold = now.minusHours(24)

    OnlineMonitored.members.forEach[ item |
        val lastUpdate = item.lastUpdate("influxdb")
        if (lastUpdate === null) {
            logInfo("LagAlert", "No update recorded for item '{}'", item.name)
        } else if (lastUpdate.isBefore(threshold)) {
            val msg = String::format("Item '%s' last updated at %s, which is older than 24h!", item.name, lastUpdate)
            val title = String::format("LAG Alert: %s", item.name)
            logInfo("lag-Alert", msg)
            pushover.sendMessage(msg, title)
        }
    ]
end

Load / Temp

I had issues in the past where OpenHab itself would lose it without me even noticing it.
I am now monitoring the temp and load. I usually get “alerts” when I work heavily on the system but those are simple to ignore.

rule "Send alert if OH temperature is too high"
when
    Item SysteminfoSensorsCpuTemp changed
then
    val pushover = getActions("pushover", "pushover...")
    val temp = SysteminfoSensorsCpuTemp.state as QuantityType<Number>
    val threshold = 45.0 | °C

    if (temp < threshold) return;

    val title = String::format("OH running hot: %s °C > %s °C", temp  , threshold )
    val msg = String::format("OH temperature is higher than normal: %s > %s. Check your recent rule changes", temp, threshold)

    logInfo("system", msg)
    pushover.sendPriorityMessage(msg, title, 1)
end
rule "Send alert if OH load is too high"
when
    Item SysteminfoCpuLoad5 changed
then
    val pushover = getActions("pushover", "pushover:...")
    val load = SysteminfoCpuLoad5.state as QuantityType<Number>
    val threshold = 0.8

    if (load < threshold)  return;

    val title = String::format("OH running high load: %s", load)
    val msg = String::format("OH load is higher than normal: %s > %s. Check your recent rule changes", load, threshold)

    logInfo("system", msg)
    pushover.sendPriorityMessage(msg, title, 1)
end

Obviously, your theshold may vary but that gives an idea.

What are you monitoring ? Why ?
Are you monitoring other metrics, are you doing it differently ?

Nice write up!

For this functionality I also like using the item expire functionality, then catching devices’ UNDEF or other expire defined states, with item/device specific times set depending on how important a device is :wink: and not requiring any persistence if not required for graphing or other purposes.

1 Like

You asked for it! :wink:

All of my monitoring is based off of Threshold Alert and Open Reminder [4.0.0.0;4.9.9.9]. This is a rule template which you can install from the marketplace and instantiate and configure rules. The main thing this does is if an Item meets a certain critera for a certain amount of time, call another rule. For example, if an Item doesn’t receive an update for a given amount of time, if it remains above a certain number for a given amount of time, it remains in a certain state for a given amount of time.

This is a really common thing one needs to do in openHAB. You can do a lot with this one rule:

  • turn on a humidifier when the humidity remains below 35% for five minutes
  • send an alert if a window remains open for too long
  • control a light based on a motion sensor
  • send an alert if a sensor doesn’t report for too long a time
  • send an alert if a battery is below a certain level

The rule template only does the part that detects the event. It then calls a rule or script you write to respond to the event.

Batteries

I don’t have the same situation you do. All my battery Items report in a percentage. I’ve configured the Threshold Alert rule as follows:

  • Group: AllBatteries
  • Threshold: 10 %
  • Operator: <
  • Alert Delay: PT1S
  • Reminder Period: PT12H
  • Alert Rule: battery_alert
  • Do Not Disturb Start: 22:00
  • Do Not Disturb End: 08:00

All other properties are the defaults. With this configuration, if any member of AllBatteries remain < 10% for at least one second battery_alert will be called. If battery_alert was previously called and the Item becomes >= 10% battery_alert will be called. If either of those occur between 22:00 and 08:00, the rule will not be called until 08:00 (who wants to be woken up at night to be told a battery is low?). If the Item remains < 10% for another 12 hours after that first alert, the alert rule will be called again and repeat every 12 hours until the Item becomes >= 10%.

My battery_alert rule is as follows:

configuration: {}
triggers: []
conditions: []
actions:
  - inputs: {}
    id: "1"
    configuration:
      type: application/javascript;version=ECMAScript-2021
      script: >
        var {alerting} = require('rlk_personal');

        var logger = log('Low Battery');

        alerting.sendAlert('The following batteries are below 10%: ' +
        threshItemLabels, logger);
    type: script.ScriptAction

or just the JS code:

var {alerting} = require('rlk_personal');
var logger = log('Low Battery');
alerting.sendAlert('The following batteries are below 10%: ' + threshItemLabels, logger);

The rule only gets called when a battery first goes below 10% and thershItemLabels is passed from Threshold Alert and is a comma separated String of all the Items that are in the alerting state (i.e. < 10%).

Because I get this alert every 12 hours as long as there is at least one battery < 10%, I don’t need another rule to send me a summary daily or anything like that.

But notice, all this comes from installing the template, setting 8 properties and writing three lines of code.

Online Monitoring

I have two flavors of this use case. A lot of devices can go offline and be detected through the status of their Things, pinging with the Network binding, they stop reporting for a time, etc. So I have several different ways to detect that a device has gone offline.

But I have a standard approach to representing the online status of my devices. I use the semantic model and each Equipment that I want to track has a Status Item. This Item is ON when the device is online and OFF when the device is offline.

I have a widget on MainUI that shows those status Items that are OFF. It’s not on the marketplace but it works very much like Battery Level Status.

Things

I use Thing Status Reporting [4.0.0.0;4.9.9.9] to call a rule that looks for those Things I care about and find and update the relevant status Item for the equipment based on the status of the Thing.

Network Devices

I use the Network binding to ping servers and services that are relevant and link that directly to the status Item.

Sensors

I use Threshold Alert for this too. This is more of an advanced usage and it does require a little bit of extra config. First I pick one Item from the equipment to monitor that reports a new value relatively frequently.

I use the following config for the Threashold Alert rule:

  • Group: Sensors
  • Threshold: UnDefType
  • Operator: ==
  • Invert: true
  • Alert Delay: PT15M
  • Reminder Period: PT12H
  • Reschedule: true
  • Metadata Namespace: sensorStatus
  • Alert Rule: new-sensor-status-proc
  • Initial Alert Rule: new-sensor-status-proc
  • Do Not Disturb Start: 22:00
  • Do Not Disturb End: 08:00
  • Gatekeeper Delay: 500

This configuration will call new-sensor-status-proc immediately when any Item changes from NULL or UNDEF to anything else. It’s a little backwards in we are treating the online state as the alerting state.

As long as the Item updates to something other than UNDEF or NULL the alerting timer gets rescheduled on every update (that’s what reschedule=true does). If the Item doesn’t update for the alert delay amount of time, new-sensor-status-proc gets called.

To avoid hitting the new-sensor-status-proc rule too fast, we add a half second gatekeeper delay which prevents the rule from being called faster than every half second.

Metadata can be applied to individual Items to override any of the properties above for that Item so we can have a different alert delay on a per Item basis.

The alert rule will be called every 12 hours as long as the sensor doesn’t update to a state different from NULL or UNDEF.

And of course, we don’t need alerts while we are sleeping.

The script is called with several values which can be used to tell if this is when the Item first goes into the alerting state, the Item has been in the alerting state for the alert delay amount of time, or the Item has exited the alert state. We use those to determine if the sensor has gone offline or come online. We use the semantic model actions to get to the status Item for the equipment.

configuration: {}
triggers: []
conditions:
  - inputs: {}
    id: "2"
    configuration:
      type: application/javascript
      script: >
        var equipment = actions.Semantics.getEquipment(items[alertItem]);

        var statusItem = items[equipment.name+'_Status'];

        if(equipment === null || statusItem === null) {
          console.warn(alertItem + ' does not belong to an equipment or equipment doesn\'t have a Status Item!');
          false;
        }

        else {
          var statusItem = items[equipment.name+'_Status'];
          console.debug('Sensor status reporting called for ' + alertItem + ', equipment ' + equipment.label + ', is alerting ' + isAlerting + ', and is initial alert ' + isInitialAlert 
                       + ', current equipment state is ' + statusItem.state);
          // Sensor is offline                         Sensor back online
          (isAlerting && statusItem.state != 'OFF') || (!isAlerting && statusItem.state != 'ON');
        }
    type: script.ScriptCondition
actions:
  - inputs: {}
    id: "1"
    configuration:
      type: application/javascript
      script: >
        var {alerting} = require('rlk_personal');

        var logger = log('Sensor Alert');

        var equipment = actions.Semantics.getEquipment(items[alertItem]);

        var statusItem = items[equipment.name+'_Status'];


        if(isAlerting && statusItem.state != 'OFF') {
          statusItem.postUpdate('OFF');
          alerting.sendAlert('Offline: ' + equipment.label + ' has stopped reporting', logger);
        }

        else if(!isAlerting && statusItem.state != 'ON') {
          statusItem.postUpdate('ON');
          alerting.sendAlert('Online: ' + equipment.label + ' is back', logger);
        }

        else {
          console.info('Sensor status update alerting ' + isAlerting + ' initial ' + isInitialAlert + ' equipment ' + equipment.label + ' status ' + statusItem.state);
        }
    type: script.ScriptAction

The condition of this rule checks to see if the equipment has a status Item. If it does, it checks to ensure that the status Item isn’t already in the proper state. Remember the alerting state is that the equipment is online so is isAlerting is true, the status should be ON.

var equipment = actions.Semantics.getEquipment(items[alertItem]);
var statusItem = items[equipment.name+'_Status'];
if(equipment === null || statusItem === null) {
  console.warn(alertItem + ' does not belong to an equipment or equipment doesn\'t have a Status Item!');
  false;
}
else {
  var statusItem = items[equipment.name+'_Status'];
  console.debug('Sensor status reporting called for ' + alertItem + ', equipment ' + equipment.label + ', is alerting ' + isAlerting + ', and is initial alert ' + isInitialAlert 
               + ', current equipment state is ' + statusItem.state);
  // Sensor is offline                         Sensor back online
  (isAlerting && statusItem.state != 'OFF') || (!isAlerting && statusItem.state != 'ON');
}

If the condition passes, the action finds the status Item and updates it as required and sends an alert.

var {alerting} = require('rlk_personal');
var logger = log('Sensor Alert');
var equipment = actions.Semantics.getEquipment(items[alertItem]);
var statusItem = items[equipment.name+'_Status'];

if(isAlerting && statusItem.state != 'OFF') {
  statusItem.postUpdate('OFF');
  alerting.sendAlert('Offline: ' + equipment.label + ' has stopped reporting', logger);
}
else if(!isAlerting && statusItem.state != 'ON') {
  statusItem.postUpdate('ON');
  alerting.sendAlert('Online: ' + equipment.label + ' is back', logger);
}
else {
  console.info('Sensor status update alerting ' + isAlerting + ' initial ' + isInitialAlert + ' equipment ' + equipment.label + ' status ' + statusItem.state);
}

Services

I aggregate all the status reporting from Things, Network, and Sensors into a Services Offline pair of rules. Again I use Threshold Alert.

  • Group: ServiceStatuses
  • Threshold: OFF
  • Operator: ==
  • Invert: false
  • Alert Delay: PT5M
  • Reminder Period: PT12H
  • Metadata Namespace: sensorStatus
  • Alert Rule: serviceStatusProc
  • EndAlert Rule: serviceStatusProc
  • Do Not Disturb Start: 22:00
  • Do Not Disturb End: 08:00
  • Gatekeeper Delay: 1000

With this configuration serviceStatusProc is called if any member of ServiceStatuses becomes OFF for five minutes. If the Item becomes something other than OFF (e.g. ON) the rule is called to indicate the service is back ON.

The called rule is:

configuration: {}
triggers: []
conditions: []
actions:
  - inputs: {}
    id: "1"
    configuration:
      type: application/javascript
      script: >-
        var {alerting} = require('rlk_personal');

        var logger = log('Service Alert');


        var service = items.getItem(alertItem).label.replace(' Online Status',
        '');

        if(isAlerting) {
          cache.private.put('alerted', true);
          alerting.sendAlert('Service offline: ' + service);
        } 

        else if(cache.private.get('alerted', () => false)) {
          cache.private.put('alerted', false);
          alerting.sendAlert('Service online: ' + service);
        }

        // else don't send any alert if we haven't alerted offline previously
    type: script.ScriptAction

This just sends an alert when a device goes offline or online with some extra code in there I need to remove (I added it while debugging a problem).

I’m going to do cancelable notifications eventually instead of sending separate alerts for offline and online. I haven’t gotten around to it yet.

Load / Temp

I use Zabbix to get alerts on these. But if I were to put this into OH I’d use good old Threashold Alert again. I’d use the following properties for the temperature:

  • Group: doesn’t matter, since there’s only one Item I’d manually change the trigger to just that one Item
  • Threshold: 45 °C
  • Operator: >=
  • Invert: false
  • Alert Delay: PT1M
  • Alert Rule: cpuTooHot
  • End Alert Rule: cpuTooHot

Pretty much the same configuration for the system load too.

The called rule would have some JS code along the following:

var {alerting} = require('rlk_personal');
if(isAlerting) alerting.sendAlert("The CPU is too hot!");
else alerting.sendAlert("The CPU temp is now normal.");

Same for load.

I also use Threhshold alert to control some dumb humidifiers based on a near-by humidity sensor, send me an alert of a motion sensor at my dad’s house doesn’t detect motion for too long of a time, and to get an alert if one of the doors remains open for too long of a time.

Yes but I use Zabbix for that. openHAB only monitors those things which are directly related to home automation. All IT/homelab related stuff is monitored outside of OH. And even there, I mainly just get emails when something starts to exceed thresholds. But every now an then it’s hand to see exactly when something went wonky and see what the RAM and CPU were doing at that time.

2 Likes

I use metadata to have per Item threshold for batteries. The reason is that the value I need is considerably different based on the device. My locks are in serious trouble if they go below 75 or 80%, I have ZigBee temperature sensors that continue to report quite well even at 8 to 10%, and I have some Zooz Zwave door sensors that report 0% yet work for many months (I have no solution for them - I don’t want to waste batteries by replacing them too early yet I need the sensors to work - probably going to toss the sensors and replace with something else).

i would like to use your approach, yesterday i play with it, but have a problem about threshold

i have an humidity sensor, my goal is that the humidity not change for x time (like 30 minutes) this meaning that something not work as expect and i want an alert
in your examples the threshold is UNDEF/NULL or a fixed string (on/off/a fixed number etc)
in my case the threshold is variable (the humidity in %), my goal is to check if the value remain the same for 30 minutes i want an alert (the sensor also if connect not send data)

can you give me an hint how proceed, i think i do something wrong with template

It’s actually “UnDefType” which the rule will use to check for both NULL and UNDEF.

Mine too for the sensor status monitoring rule. So I configured the rule to call the alerting rule when the Item doesn’t update for a time. To achieve that you have to invert your thinking a little.

The alerting state is when the Item has a state that isn’t NULL nor UNDEF (UnDefType). So the threshold is set to UnDefType, the comparison is = and the invert flag is set to true.

But we want the call when the Item doesn’t update for 30 minutes so we set reschedule to true. This means if a change comes before 30 minutes, the timer is rescheduled. The alert rule will only be called when the Item doesn’t receive a change for 30 minutes.

So the config for your case would be:

Parameter Value Purpose
Group what ever Group you put your humidity Item(s) into The rule is designed to work with a Group to trigger the rule
Threshold UnDefType We really care about every other state but we can’t put every number between -infinity and infinity as a threshold, so we put what it’s not here
Operator == We want to match UnDefType
Invert true We want to invert the operator. I honestly don’t remember why I added this invert option to the rule because it seems != as the Operator would do the same thing, but there was some weird edge case I don’t remember that required this invert property. With these three properties, the alerting state is any state that isn’t NULL or UNDEF.
Alert Delay PT30M The Item needs to be in the alerting state for 30 minutes to generate the call to the alerting rule
Reschedule true If a change comes before the alert timer goes off, reschedule the timer for another 30 minutes.
Alert Rule UID of the rule you write to process the alert This rule will be called 30 minutes after the last change to the Item

If you want to send another alert when the sensor comes back, add the Initial Alert Rule property. This will be called immediately after the Item changes for the first time after the alert rule was called.

You can use the same rule for both. isAlerting will be true when the Item has not responded for awhile and isAlerting will be false when the Item comes back.

i am sure do something wrong

not work

i receive this notification

with this confirguration
configuration:
dndEnd: 00:00
rateLimit: “”
invert: true
initAlertRule: “”
dndStart: 00:00
thresholdState: UnDefType
alertRule: f09c7addb0
endAlertRule: “”
operator: ==
defaultRemPeriod: “”
hysteresis: “”
reschedule: true
namespace: thresholdAlert
gkDelay: 0
defaultAlertDelay: PT5M
group: GruppoSensoriUmidita

the rule f09c7addb0 is a simple notification with telegram

telegram.sendTelegram('Sensor status proc called with Item ’ + alertItem + ', state ’ + alertState + ’ initialAlert ’ + isInitialAlert + ’ and alerting ’ + isAlerting);

how you can see in the same minute i receive notification , but the value is changed

i don’t understand what i do wrong

my group is

i have 1 sensor that not work, and my expectation is to receive the notification only for this one “Salone Umidità”

Open a new thread, tag me in it. Put the threshold alert to into debug mode and paste the logs. Paste the text, not screenshots. Include relevant logs from events.log.