Rule 'Total timeout 10000 ms elapsed'

All

I have an extremly simple rule which sends an alert when one of my devices is not reachable. This rule gets triggered when an item of a group changes to ON or OFF. In the rule I loop through all items and check which item has state OFF.
The rule seems not reliable and very often I get an timeout error:

2021-03-10 07:29:38.800 [ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'CheckDevices' failed: Total timeout 10000 ms elapsed

When I manualy run the rule, it’s always working fine.

triggers:
  - id: "1"
    configuration:
      groupName: Network_DevicesStatus
      state: OFF
      previousState: ON
    type: core.GroupStateChangeTrigger
  - id: "2"
    configuration:
      groupName: Network_DevicesStatus
      state: ON
      previousState: OFF
    type: core.GroupStateChangeTrigger
conditions: []
actions:
  - inputs: {}
    id: "3"
    configuration:
      type: application/vnd.openhab.dsl.rule
      script: >-
        logInfo("Network.Status", "Devices Group: " +
        Network_DevicesStatus.state.toString)
        val StringBuilder message = new StringBuilder
        if (Network_DevicesStatus.state == OFF) {
          Network_DevicesStatus.members.forEach[ 
              status | if (status.state == OFF) { 
                  message.append(status.label + " is offline\n")
              }
          ]
        }
        else {
            message.append("All devices online\n")
        }
        logInfo("Network.Status", message.toString)
        val actions = getActions("pushover", "pushover:pushover-account:default")
        actions.sendMessage(message.toString, "Device status")
        logInfo("Network.Status", "Pushover sent")
    type: script.ScriptAction

You’ve got some helpful logInfo() in your rule. What do they tell you about rule progress when this happens, or about multiple re-runs?

When the rule give the error, the last part I see is displaying the message logInfo("Network.Status", message.toString).
So it seems that it gets stuck from val actions...

But I pressed the Run Now button multiple times in the rules creation windows, and this is working fine all the time.

Maybe it’s the sendMessage() that is delayed and blocking.

For testing I’m gonna replace pushover with mail.
In OH 2.x I was still using the pushover & mail config with (cfg) files. First time I use the new bindings.

This presents some new information. I didn’t know there was a 10 second timeout for a rule. That’s good to know.

It would be helpful to see the logs generated by this rule, in particular the timestamps on the logs statements when it worked and when it didn’t.

It might also be useful to see what the CPU/Memory situation on your machine is when it times out and what’s going on immediately before and during the rule’s timeout in events.log. Maybe the Things went offline and are coming back online or something.

As an aside, and not something to worry about until we figure out this timeout issue, you could use a map/reduce to generate the list of names.

var message = "All devices online\n"
if(Network_Devices_Status.state == OFF){
    message = Network_DevicesStatus
                .members.filter[ s | s.state == OFF ]
                .map[ label ]
                .reduce[ msg, label | msg = msg + label + " is offline\n" ]
}

This doesn’t actually address your problem though.

Also, here is the code I use for offline alerting of my services for comparison.

triggers:
  - id: "1"
    configuration:
      groupName: ServiceStatuses
      state: ON
    type: core.GroupStateChangeTrigger
  - id: "2"
    configuration:
      groupName: ServiceStatuses
      state: OFF
    type: core.GroupStateChangeTrigger
conditions:
  - inputs: {}
    id: "4"
    configuration:
      type: application/javascript
      script: event.oldItemState.class != UnDefType.class
    type: script.ScriptCondition
actions:
  - inputs: {}
    id: "3"
    configuration:
      type: application/javascript
      script: >-
        var logger =
        Java.type("org.slf4j.LoggerFactory").getLogger("org.openhab.model.script.Rules.OfflineAlert");

        scriptExtension.importPreset("default");

        this.OPENHAB_CONF = (this.OPENHAB_CONF === undefined) ? java.lang.System.getenv("OPENHAB_CONF") : this.OPENHAB_CONF;

        load(OPENHAB_CONF + "/automation/lib/javascript/personal/alerting.js");

        load(OPENHAB_CONF + "/automation/lib/javascript/personal/metadata.js");

        load(OPENHAB_CONF + "/automation/lib/javascript/community/timerMgr.js");


        this.tm = (this.tm === undefined) ? new TimerMgr() : this.tm;


        var currState = event.itemState;


        var alertGenerator = function(itemName, itemState, getName, sendAlert) {
          return function() {
            if(items[itemName] == itemState) {
              var status = (itemState == ON) ? "online" : "offline";
              sendAlert(getName(itemName) + " is now " + status + "!", logger);      
            }
            else {
              logger.warn("Offline alert timer expired but Item " + itemName + " has returned to it's previous state!")
            }
          }
        }


        this.tm.check(event.itemName, "5m", alertGenerator(event.itemName, event.itemState, this.getName, this.sendAlert));
    type: script.ScriptAction

You can find timerMgr at GitHub - rkoshak/openhab-rules-tools: Library functions, classes, and examples to reuse in the development of new Rules.. I have a little metadata library that gets the name of the device from metadata but I’m probably going to move to using label instead like you do. sendAlert is a personal library where I centralize my alerting code (sending the email, pushover, broadcast notification type stuff like is at the bottom of your rule). I centralize these in a library so that I can change how notifications work in one place (e.g. move from pushover to email).

Note, the above rule alerts on each service individually. It does not summarize all the offline devices in each alert. I’ve another rule that runs in the morning with a summary.

I think I need to find the reason in this direction. OH2.x is running on a Pi, but because I started testing OH3, I’m running it in a container on my Synology (together with 31 other containers :laughing:). I see now that it’s taking 69% of my CPU and 1,8GB RAM. During the night I also have some backups running and most of the time it was during this timeframe that the rule was/is failing.
So I probably need to go back to a Pi or to find another solution to run OH3. I was thinking on moving al my containers to a NUC, maybe I need to dive into this…

I am also having the same rule timeout issue on OH3 running on a NUC when sending a Pushover message with an image attachement:

[ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'IPC-2' failed: Total timeout 10000 ms elapsed in IPC

The image is an animated gif around 1MB.

When I perform the same action in a terminal, I can see that the upload is nearly instantaneous, but processing on Pushover server is quite slow and it takes around 13-14 seconds to get the HTTP result back. Without attachement the message is processed in 1 second. So this is not a resource issue on the machine running openhab.

It seems so far that the error does not prevent the notification from being sent. Yet, it would be helpful to be able to increase the timeout in rules to avoid this error (showing up in the log) and furthermore allowing the return value to be processed. Is this configurable in any way?

I think it is fixed in 3.1RC1 !

I’m not sure that will help with how long the ‘parent’ rule execution takes.

A possible circumvention would be to spin off the task outside of the rule -inside a createTimer() body.