Unreliable rule triggering (cron) with openhab 2.3


(R.R.) #1

Recently I tried to update from openhab 2.1 to openhab 2.3 .

There were massive problems, because rules were not triggered reliable:

  • CRON trigger didn‘t work (sometimes)
  • rules were not triggered by homematic (but the new values were in events.log)

My configuration was adapted to version 2.3 via vscode and all warnings removed. As a I was using docker, I could easily do a fresh restart (clear userdata, docker rm openhab) to exclude configuration problems.

Sometimes the cron was working, but not always. Mostly after a restart openhab had stopped triggering rules (by cron).

Later I tried 2.2 with similar issues. Now, I‘m back on 2.1 and all is working fine. (No code changes besides some depreciated DateTime functions.)

There seems to be problems with the openhab core!?

My system:

  • Ubuntu 16 LTS, docker: openhab:2.3.0-amd64-debian
  • bindings: homematic (2x: original ccu2 + raspi-ccu), hue, modbus, http, jdbc, mapdb, weather, astro

(Harry) #2

Post the code please.


(R.R.) #3

No chance to send all my code. Of course I cannot expect help that way.

But:

  • The code was working sometimes!
  • I’m not the only one, who suffers from not triggered rules.
rule triggerHmrHeartbeat
when
    Time cron "0 * * * * ?"
    or System started
then
    val String logClass = "rrinfo"
    val String logKey = "triggerHmrHeartbeat - "
    // do some stuff
    logInfo(logClass, logKey + "out: triggered")
end

(Harry) #4

Try

  System started or
  Time cron "0 * * ? * * *"

but, I tested you code and it runs under 2.3 without problems


(Markus Storm) #5

Enable debugging via Karaf console for org.quartz.core and org.eclipse.smarthome.model to find out if the rule really is not triggered. Maybe it is but does not execute properly.
Try duplicating the rule into two with a single trigger each to see if that makes a difference.
Does the “non-trigger” problem happen at specific times only such as only on startup or at midnight ?
Try not to trigger at 0 seconds but 1 or some other value.


(R.R.) #6

Obviously a ReentrantLock causes some kind of deadlock in openhab 2.3, the rules are called but the locked area gets not executed.

At a fresh install/start (empty userdata) all is working fine. After 2 or 3 “docker restart openhub” somewhen this situation happens (docker start/stop seems to work better - more gracefully…?).

After the situation happend one, it does not goes away. Even a restart does not help. (I don’t understand why the ReentrantLock should be persitent?!). I have to delete the docker container and the userdata dir.

(I checked for the unlock. All rules have the same structure and there are no returns which could bypass the finally.)

#openhab.log
... [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing startup rule 'updateLedR07'
... [ERROR] [ntime.internal.engine.RuleEngineImpl] - Error during the execution of startup rule 'updateLedR07': Could not cast NULL to org.eclipse.smarthome.core.library.types.DecimalType; line 621, column 40, length 35
#comment: I don't understand, why this error not get logged by my own catch!?
...
... [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing startup rule 'triggerHmrHeartbeat'
... [INFO ] [clipse.smarthome.model.script.rrinfo] - triggerHmrHeartbeat - in
... [INFO ] [clipse.smarthome.model.script.rrinfo] - triggerHmrHeartbeat - locked
... [INFO ] [clipse.smarthome.model.script.rrinfo] - triggerHmrHeartbeat - out: triggered
#comment: ok (function in and out)
...
... [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'updateLedR05'
... [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'updateLedR04'
...
... [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.led_status.rules#triggerHmrHeartbeat#36 * * * * ?
... [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'triggerHmrHeartbeat'
... [INFO ] [clipse.smarthome.model.script.rrinfo] - triggerHmrHeartbeat - in
#comment: obviously the ReentrantLock stays locked and so it gets not further! 

My code looks like:

var java.util.concurrent.locks.ReentrantLock lockLedStatus  = new java.util.concurrent.locks.ReentrantLock()

// 0 GREEN; 1 ORANGE; 2 nzw. sonst RED
val Functions$Function2<StringItem, Integer, String> setLedStatus =
    [ StringItem itemLed, Integer stateWindows |
        try {
            val String logClass = "rrinfo"
            val String logKey = "setLedStatus - "
            if (itemLed != NULL) {
                //logKey = logKey + itemLed.name

                //! LED-Statusanzeige (16-fach): HM-OU-LED16: 0 aus; 1 rot; 2 grün; 3 orange
                var String stateOut = "RED";
                if ( stateWindows == 0) {
                    stateOut = "GREEN";
                } else if ( stateWindows == 1) {
                    stateOut = "ORANGE";
                }

                // Test: Verzögerung,wegen Anzeige-Problemen (überlagerte Kommunikation im Homematic-System)
                //Thread::sleep(30)

                //var String stateCurrent = itemLed.state.toString
                //if (stateCurrent == stateOut) {
                    //logDebug(logClass, logKey + " - no change == " + stateOut)
                //} else {
                    itemLed.sendCommand(stateOut)
                //    stateOut = stateCurrent + " => " + stateOut
                //}

                return stateOut
            }
            else {
                logDebug(logClass, logKey + " itemLed is NULL!")

                return "ERROR"
            }
        }
        catch (Throwable t) {
            logError(logClass, logKey + t.toString)
            return "ERROR"
        }            
    ]


rule updateLedR07
when
    System started
    or Item hmLightTerrace changed
    or Time cron "23 */5 * * * ?"
then
	val String logClass = "rrinfo"
    val String logKey = "updateLedR07 - "
    try {
        lockLedStatus.lock()

        var Integer stateWindows = 0
        try {
            var Integer valueDimmer = (hmLightTerrace.state as DecimalType).intValue()
            if (valueDimmer > 0) stateWindows = 2

        } catch (Throwable t) {
            logError(logClass, logKey + t.toString)
            stateWindows = 0
        }

        var String result = setLedStatus.apply(hmLedHall15, stateWindows)
        logDebug(logClass, logKey + "out: " + result)
    }
    catch (Throwable t) {
        logError(logClass, logKey + t.toString)
    }
    finally {
        lockLedStatus.unlock()
    }
end


rule triggerHmrHeartbeat
when
    Time cron "36 * * * * ?"
    or System started
then
    val String logClass = "rrinfo"
    val String logKey = "triggerHmrHeartbeat - "
    try {
        logInfo(logClass, logKey + "in")
        lockLedStatus.lock()
        logInfo(logClass, logKey + "locked")

        hmrHeartBeat.sendCommand("ON")
        setLedStatus.apply(hmLedHall16, 0)

        logInfo(logClass, logKey + "out: triggered")
    }
    catch (Throwable t) {
        logError(logClass, logKey + t.toString)
    }
    finally {
        lockLedStatus.unlock()
    }
end

(Markus Storm) #7

Well now sounds like it is a programming problem of yours rather than an OH one.
Haven’t tried to debug your code (too lazy, sorry) but first thing to hit the eye are all those Java-isms.
Remind you the rules DSL isn’t plain Java. Can’t tell for sure (haven’t ever tried catching exceptions), it possibly does not work at all to catch these.
You’re also using intValue() instead of omitting () and use some more Java-isms such as Integers… change them to be of Number type, that might fix your error as well.


(R.R.) #8

So the question is, if I can rely on catch or finally!

Your answer sounds like: No !?

That’s quite interesting…

By the way, the lock with try, catch,finally is a openhab design pattern (https://community.openhab.org/t/design-pattern-rule-latching/32748


(Harry) #9

Your code works without problems and errors in the log

but I think, it’s not your whole code, because line 621.
Check this line.
I would use

or Time cron "23 0/5 * ? * * *"

(R.R.) #10

Thanks Harry, but the cron rule is working. The rules get called, but I could not see that before, because I had my log-calls at the wrong places (after .lock()).

(I would also recommend: or Time cron “23 0/5 * * * ? *” # The last star is for the optional year.)

The problem is, if I can rely on this pattern for synchronization:

import java.util.concurrent.locks.ReentrantLock;
var ReentrantLock lockLedStatus  = new ReentrantLock()

logDebug(logClass, logKey + "in")
lockLedStatus.lock()
try {
	// do some stuff
	    
    logDebug(logClass, logKey + "out")
}
catch (Throwable t) {
    logError(logClass, logKey + t.toString)
}
finally {
	// has to be called always
    lockLedStatus.unlock()
}

Even in the posted logs, there is an error which got not catched by my own catch. This is weird and dangerous.


(Rich Koshak) #11

I can’t say for certain that the finally can be reliably relied upon or not. I have some anecdotal evidence dating back to the 2.0 time frame that you can’t, but I’ve no recent experiences to say whether this is still the case and I didn’t have the time or expertise back when I did see this to fully explore the problem.

Anyway, I agree with Markus. You can program Fortran in any language and this looks like programming Java in Rules DSL. I can’t say if they are causing any problems per se, but that does stand out. And if you are running on an RPi there is some evidence that using primitives (.intValue) greatly increases the amount of time it takes to parse the .rules files for some reason (NOTE: I have more evidence for this then I do for finally not always running).

Anyway, as I documented here (Why have my Rules stopped running? Why Thread::sleep is a bad idea), any time I see Thread::sleep, locks, executeCommandLine, or sendHttp*Request calls I immediately think we have a running out of threads problem.

Another problem could be that you have too many cron triggered rules trying to trigger at the same time. There are only two threads available for cron triggered Rules so that could be a problem.

But why are you using a lock in a cron triggered Rule in the first place? What problem are you trying to solve with this lock? Are these the only two Rules that use this lock?

It isn’t. There really is no way the lock could be persistent across reloads of the same .rules file let alone a reboot of OH. Well, actually there is one way I know of. I’ve seen it reported that sometimes Timers that are created and then the .rules file is reloaded the Timers stick around and still trigger. So if you had a reference to your lock inside the body of such a Timer then it could stick around on the reload of the .rules file, but not a restart of OH.

I suspect what is happening is there is some edge case or unexpected data taking place during the Rules that trigger on System started in those cases that is preventing the lock from becoming unlocked.

Personally, were this my code, I’d focus on writing this in a way that either doesn’t require the locks in the first place or in a way that centralizes the lock in one Rule (see Design Pattern: Gate Keeper for example). The code will be less brittle and less complex overall as a result.

If you do want to debug this problem you can first look to see if you are in fact running out of cron triggered Rules threads (you only get 2 by default). After doing what Markus recommends you should be able to see if Rules are trying to trigger or not. Then you can see how many of your threads are in use using the commands Scott posted here.

Why did this code work before and doesn’t work now? I cannot say. Maybe quartz had more threads on those earlier versions and 2 is just not enough the way this code executes. Maybe the finally isn’t running (in which case I’d be very interested to know that is the case) so the lock isn’t being unlocked. Maybe it is just a change in timing and this has always been a problem but the timing is different when later versions of OH run.


(R.R.) #12

hi guys,

thank you very much. my system is working now with 2.3 and it survives restarts.

I changed:

  • reduced the amount of ReentrantLock signifiantly
  • reduced the remaining locks (ReentrantLock) to the minimal scope
  • removed Thread:sleeps

Nevertheless I’m still suspicious, that this special situation shouldn’t had happened. I could see an error in the logs (RuleEngineImpl - Error during the execution), which should have been catched (and logged) be a catch in my lambda function or by a catch in my rule. Now I cannot reproduce this error any more.

But I see that the error handling is not working properly in lamda functions. The error is catched in the rule, not in the lambda function:

val Functions.Function1<NumberItem, Number> determineFactorX =
    [ NumberItem itemScaleFactor |
        val String logClass = "rrinfo"
        var String logKey = "determineFactorX"
        var Number powerScaleFactor = 0.0  // multiply with

        try {
        	powerScaleFactor = (NULL as Number).doubleValue()   // ERROR  !!!!!!!

        	// ...
        } catch (Throwable t) {
            logError(logClass, logKey + t.toString)
        }
        return powerScaleFactor
    ]


rule testCatches
when
    Time cron "11/4 * * * * ?"
then
    val String logClass = "rrdebug"
    val String logKey = "testCatches - "
    try {
        logInfo(logClass, logKey + "in")
        val Number scaleFactor = determineFactorX.apply(mbRawPvMpptPowerSf)
        // ...
        logInfo(logClass, logKey + "out")
    }
    catch (Throwable t) {
        logError(logClass, logKey + t.toString) 	// ERROR get catch here
    }
end

Log output:

// mostly
... testCatches - org.eclipse.smarthome.model.script.engine.ScriptExecutionException: Could not cast NULL to java.lang.Number ...
// sometimes (but I get this error also at calling other lambdas randomly)
... testCatches - java.lang.IllegalStateException

Filed an issue:
https://github.com/eclipse/smarthome/issues/6218


(R.R.) #13

There is also a problem with lambda functions, which seems not to be thread safe.
https://community.openhab.org/t/lambda-functions-fail-not-thread-safe/51696


(Markus Storm) #14

You’re obviously a Java programmer, but to clearly state this again: the rules DSL is NOT Java.
That implies a lot, namely that you mustn’t apply Java concepts. You expect everything to be threadsafe and capable of handling exceptions, and while it would of course be nice if that was the case, there should be nowhere that is promising you it is.
Sorry to be harsh, but get that Java picture off your head.

PS: yes still file that issue, please. Developers welcome any hints. Ask them if try/catch was supposed to work at all (I don’t think it ever has been, let alone being threadsafe).