Simple sensor alive monitoring solution using rules

Jan221 · October 9, 2020, 9:56pm

Hi,

In this thread I want to share my solution for sensor ‘alive’ or sensor ‘heartbeat’ monitoring.

Preface

The initial trigger for this was a failure of my zwave stick. It stopped working for any reason. Because of that OH missed some room temperature values and that caused my room automation not to work properly for some days until I realized that something is going wrong.

After that I was looking for a possibility to monitor my zigbee and zwave thermostats which are reporting room temperature values, because I want to get informed about such problems just in time.

I started to write some rule-code and looked around a little bit in the forums. I found some possibilities to realise my requirements but wanted to find an easy and simple solution (at least it seems to me )

.

Requirement

Monitoring of sensor values (=items) to recognize that an item has not been updated within a maximum period of time. E.g. “Item_XY has not been updated for 1 hour”.
I call this a “heartbeat timeout”.
Notification as a Telegram message
“Easy” to implement

.

Thoughts

My first idea was to create a rule for every item, save a timestamp and check it periodically, etc. But that seemed much too ineffective to me.

Then I wanted to use persistence service and check for each item if its value has not changed for a certain period of time. While I was trying that I struggled with a known issue that the .changedSince() method does not work with influxdb what I am using. And also this approach was not really satisfying.

Alternatively, I wanted to use the .lastUpdate () method. The problem is that my sensor values, which are recorded in the influxdb persistence service, are logged every minute, so lastUpdate gives me timestamps one minute ago no matter if an item update has really been received.
But I also use the mapdb persistence service. So far only to log some items on every change so that they can be restored after a reset.

Decisive idea
Now I have created an extra group and have set the persistence trigger to everyUpdate in the mapdb.persistence file for all items in this group. So I can use the .lastUpdate () method for these items to check when they were updated the last time.
Another advantage of this group: I can loop through all items in this group in one rule and don’t have to process each item individually.

.

Final solution

Now my final solution. It consists mainly of three parts:

An item group
A persistence service
A rule for the group

In groups.items I have a group defined:

//Items to be monitored for heartbeat. See system.rules
Group G_MonitorHeartbeat_A      "Heartbeat monit. category A"

Note: I have “category A” in the name, because there is also “B” and “C”, because I use different heartbeats (different time periods) for different sensors

.
My Items which should be monitored become members of this group. Example:

Number   I_FlOG_HVAC_SensorTemperature   "Current temperature [%.1f °]"     (G_Hvac, G_IfxEvMin, G_MonitorHeartbeat_A) {channel="zwave:device:e2101bdd:node5:sensor_temperature"}

Note the G_MonitorHeartbeat_A group of the item.

.
In mapdb.persist I define persistence for all items of this group G_MonitorHeartbeat_A:

Items {
    //All items of this group are collected on every change and restored on startup
    G_Restore* : strategy = everyChange, restoreOnStartup

    //All items of this group are collected on every update and restored on startup
    //This group is for heartbeat monitoring to detect last update of an item
    G_MonitorHeartbeat_A* : strategy = everyUpdate, restoreOnStartup

    //All items of this group are collected on every update and restored on startup
    //This group is for heartbeat monitoring to detect last update of an item
    G_MonitorHeartbeat_B* : strategy = everyUpdate, restoreOnStartup
}

Warning: This only works if the affected items are not also saved to mapdb persistence service via cron (periodically). They must only be saved on update!

.
Now the rule:

The rule trigger is a cron statement:

  Time cron "0 0/1 * * * ?"   // every 1 minute

When the rule triggers, we loop throug all items of our group:

G_MonitorHeartbeat_A.allMembers.forEach[item | 
  //Job to do here
  ]

Now in our for loop the core of it all are these lines:

    var DateTime LastUpd = item.lastUpdate("mapdb").toDateTime
    if (LastUpd.plusHours(3).isBeforeNow) {
      //Heartbeat timeout exceeded
    }

With these lines we check the timestamp of the last update of this item in the mapdb persistence service. If this timestamp plus 3 hours is before now than we know that the item has not been refreshed for at least 3 hours.

.
Retriggering
There is still one challenge to overcome:
When the rule detects an heartbeat timeout I want to get informed via telegram message. But I do not want to get the same message every minute as long as the timeout stays exceeded.
Conclusion: I need a way to remember the state of my heartbeat check for every item so that the rule can check if it’s a new or a known timeout violation.

This is done by a list of the item names that have triggered an heartbeat timeout:

import java.util.List

var List<String> Heartbeat_A_TriggerList = newArrayList()

Note: The list has to be defined outside the rule!

Now we can add an item name to the list by HeartbeatTriggerList.add(item.name), remove it with HeartbeatTriggerList.remove(item.name) and check if it is already in that list by if (HeartbeatTriggerList.contains(item.name)).

Finally my complete rule:

rule "System_HeartbeatMonitoring_A"
when
  Time cron "0 0/1 * * * ?"   // every 1 minute
then
  logInfo("System_HeartbeatMonitoring3h", "Checking all items of group " + G_MonitorHeartbeat_A.label + " for heartbeat timeout...")
  
  G_MonitorHeartbeat_A.allMembers.forEach[item | 
    var DateTime LastUpd = item.lastUpdate("mapdb").toDateTime
    if (LastUpd.plusHours(3).isBeforeNow) {
      //Heartbeat timeout exceeded
      if (Heartbeat_A_TriggerList.contains(item.name)) {
        //Already triggered this alarm
        logInfo("System_HeartbeatMonitoring3h", "Still heartbeat alarm for " + item.label + " (" + item.name + "). Last update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
      }
      else {
        //New alarm
        logInfo("System_HeartbeatMonitoring3h", "Heartbeat alarm for " + item.label + " (" + item.name + ")! Last update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
        //Add to trigger list
        Heartbeat_A_TriggerList.add(item.name)
        sendTelegram("JanOnly", "\u203c \ud83d\ude32 Heartbeatalarm!\n" + "Item: " + item.label + " (" + item.name + ")\n" + "Letztes update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
      }
    }
    else {
      if (Heartbeat_A_TriggerList.contains(item.name)) {
        //This one has been triggered and is now back here again. :-)
        logInfo("System_HeartbeatMonitoring3h", "Heartbeatalarm terminated :-) for " + item.label + " (" + item.name + "). Last update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
        sendTelegram("JanOnly", "\ud83d\ude0a Heartbeatalarm beendet.\n" + "Item: " + item.label + " (" + item.name + ")\n" + "Letztes update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
        //remove from trigger list
        Heartbeat_A_TriggerList.remove(item.name)
      }
      else {
        //Everthing is fine
        //logInfo("System_HeartbeatMonitoring3h", "No Heartbeatalarm :-) for " + item.label + " (" + item.name + "). Last update: " + LastUpd.toString("dd.MM.yyyy HH:mm:ss (Z)"))
      }
    }
  ]
end

.
I have used this whole mechanism with several groups to realize different timeouts like 1 hour, 3 hours, etc.

Finally

I hope that I have not created a duplicate with my thread and I am happy if someone can use it to get some ideas.

rlkoshak · October 10, 2020, 4:26pm

Not a duplicate but an alternative to Design Patterns: Generic Is Alive which doesn’t require persistence and doesn’t require polling (i.e. the every minute Rule).

The absolute simplest approach to this would be to use the Expire binding to reset your Item to UNDEF or NULL when it hasn’t received an update for too long. You can trigger a rule when those Items go to NULL/UNDEF to generate your alert. Only one alert message will be sent. See the “Expire Binding” example in the DP linked above.

Jan221 · October 12, 2020, 6:55am

Thanks for the alternative!

I had seen this thread, but have not had a deeper look. It seemed a bit complicated at first glance. After looking again it is probably not.

However I’m not sure if I would like the idea of changing the value to UNDEF. This has both advantages and disadvantages for the subsequent processing of the values. But it is definitely a possibility.

Like always: There’s more than one way to skin a cat!

rlkoshak · October 12, 2020, 2:54pm

Well, this would be consistent with the behavior of many bindings and the purpose of the UNDEF state. If the sensor has not reported for too long of a time, you have no idea what state the device is actually in. UNDEF is the state that means “I don’t know the state right now”.

Jan221 · October 13, 2020, 11:04am

I understand what you mean. But for me, setting to UNDEF is not the only truth.
Apart of OH I mainly work with systems that do not support “undef”, “invalid” or systems which separate the info into different datapoints.
KNX is an example for that. A measured value for temperature for example is presented by a 2-byte float value without any valid or invalid labelling. A On/Off value is only one bit, i.e. only two states, etc.

This is a fundamental topic of information processing: you get data and you need information about the reliability of the data. One can fundamentally question whether this reliability should be encoded in the actual value in the form of valid / invalid or whether one should separate the information and, for example, draw conclusions about the reliability based on the age of the data or a complete seperate data object.

rossko57 · October 13, 2020, 3:05pm

Fundamental to home automation, “don’t keep heating a room based on a broken temperature sensor”.
The implementation of UNDEF provides a very simple mechanism for that.
But you don’t have to use this, and can construct your own scheme as fancy as you wish.

Jan221 · October 13, 2020, 6:36pm

Naturally. That was the motivation of the whole thread. But that has nothing to do with what I called fundamental above.
.

I know I don’t have to. Nor does anyone have to apply my suggestion.
In this thread I only wanted to present my solution and in my replies to rich I just explained that it is not by definition “better” to change a value to UNDEF. It’s just another way to get there and for example depends on further processing or connected systems.

Everyone is welcome to decide which path to choose. We live in a free world!