(OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea

sjcliffe · August 20, 2018, 5:49am

MySQL persistence was indeed the culprit. The only rule I have that uses it started and never finished. I tried to restart the binding but the bundle:stop hung. I’ll rehash things to use rrd4j instead.

Thanks for all the help!

wars · September 7, 2018, 4:10pm

I replaced my while loop with the timer, but get an error when executing this line

if(NHC_garage.state == ON) timerPort.reschedule(now.plusSeconds(10))

Error:

Caused by: org.eclipse.smarthome.model.script.engine.ScriptExecutionException: 'reschedule' is not a member of 'org.eclipse.smarthome.model.script.internal.actions.TimerImpl'; line 75, column 40, length 41

	at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:125) ~[?:?]

rlkoshak · September 7, 2018, 4:29pm

Do you see that every time the Rule runs or only when OH starts up?

Show the line that defines timerPort.

wars · September 7, 2018, 4:39pm

Every time the rule runs, but offcourse only when NHC_garage.state == ON

var timerPort = null
timerPort = createTimer(now, [ |
    //some code
    if (NHC_garage.state == ON) timerPort.reschedule(now.plusSeconds(10))
    else timerPort = null
])

rlkoshak · September 7, 2018, 4:55pm

OK, I see the problem. Well, actually there might be two problems here.

The main problem is because you are initializing timerPort to null, there is nothing to tell the Rules DSL why type timerPort should be, so it falls back to the lowest common denominator type which is Object and Object doesn’t have a reschedule method.

So you need to give timerPort a type when you declare it:

var Timer timerPort = null

The second potential problem is timerPort must be declared outside of the Rule as a global. Otherwise it goes away once the Rule exits and when the timer body runs timerPort may no longer exist.

I’m not 100% certain about this though. It may work as written, assuming you give timerPort a type.

wars · September 7, 2018, 5:33pm

stupid me… timerPort was indeed not a timer type.

I left the declaration inside my rule, I don’t need this outside this rule.

thanks for helping, now my rule is running every x minutes like I want and not every random minutes when I used Thread::sleep

rlkoshak · September 7, 2018, 5:40pm

Oh but you do. The body of the Timer runs outside of the Rule.

Now it still may work assuming that the Timer is given it’s own reference to that variable in which case it will work fine.

However my understanding is that the Timer gets a copy of the variables from the Rule, not a direct reference. If that is the case, the timerPort isn’t actually assigned the new Timer until after the Timer is created, meaning that timerPort is null when the Timer is created. If the timer body only gets a copy of the variables then timerPort will be null insdie the Timer.

vzorglub · September 8, 2018, 9:23pm

Would it not be “Are two or more long running Rules…”

“At my request…”

You know, just in case you want to publish this in the main docs…

rlkoshak · September 9, 2018, 12:34am

The they I’m referring to is the Prometheus binding folks so any docs would be written by them. It isn’t something that is generally available so doesn’t really belong in the main docs.

Yes, that should have been “Are two…”

Nicholas_Waterton · October 2, 2018, 2:20pm

I have also been fighting this issue for some time (months). My rules can work fine for weeks, then they can stop running after 1/2 hour, or a few days. It’s maddening trying to find the issue.

I’ve increased my thread pool to 50, but it still happens.

Is there no simple way for OH to log when it runs out of threads for the rules engine? or some way to log the number of threads in use?

My cron rules and timers still work, so my work around is to detect when rules stop triggering (via a timer) and restart OH2. It’s a horrible band-aid though. It also means that I can’t troubleshoot what’s going on (as OH has already restarted when the problem occurs).

I travel for work, so I need OH to be reliable when I am away.

I feel like I have a problem rule(s) - either takes too long/never returns, persistence problem or something, and I can’t find it. I have over 7000 lines of rules built up over 3 years, and I can’t find the problem with the current debugging tools OH has.

This thread is very useful, but I’ve tried everything, and still randomly have this problem.

We need a way to tell how many rules are running, other than through the console, that way, I could log what is happening, and maybe catch the problem, before the rules grind to a halt.

dan12345 · October 2, 2018, 2:36pm

why not set up a bash script that checks the number of threads running every 5 minutes (say) and pushes it to an openhab item, which you can then log in openhab? Then if each of your rules logs when it starts and when it finishes, you should be able to spot quite quickly which one(s) are causing the number of threads to explode.

job · October 2, 2018, 2:36pm

You should refactor your rules. Start with your most complex ones.

I have quite soemthing going on and just have 120 rules with about 2000 lines of code (and additional 1000 lines of code of script). But it is mostly simple 3-liners now.

Nicholas_Waterton · October 2, 2018, 2:41pm

I am wondering the same thing, my results are 51, 8, 10, and 6. My thread pool for rules is set to 50, so does this mean all my threads are in use? When I list them, none show as BLOCKED, they all say WAITING or TIMED_WAITING. What does this mean?

Nicholas_Waterton · October 2, 2018, 2:45pm

I have refactored my rules many times, usually to take advantage of a new feature (like the new Trigger function).

I have boiled my rules down to several lambdas that are called by rules that are 3 or 4 lines (mostly logging to try to find this problem).

The problem is lack of information, not lack of coding skills.

job · October 2, 2018, 2:52pm

This was not what i ment to say.

No “coding skills” apply to anything in openHAB rule engine. It is always the “other way round”. (A little bit exaggerated, but i guess most will agree, at least partial.)

rlkoshak · October 2, 2018, 3:24pm

See New Add-on bundle for Prometheus health Metrics. They have included the thread pools as a metric they can report on.

But all of this can only tell you when you start to consume all your Rules threads, not which Rule is the source of the problem. But perhaps a time when the problem starts will b enough.

Yes

Are you using ReentrantLocks? Locks can be troublesome and locks in lambdas can be particularly troublesome because you cannot guarantee that the finally will get executed and that your lock will be unlocked in every possible circumstance.

Lambdas have lots of other problems too (e.g. not thread safe and internal variables appear to be reused on each call so if you have two rules calling the same lambda at the same time they can stomp on each other) so I recommend against their use except in rare circumstances where there is no other viable way. Writing more generic Rules (i.e. all the Rules that need the lambda get merged into one so there is no need for the lambda anymore) or using an alternative approach such as the Separation of Behaviors DP results in fewer problems like these.

Nicholas_Waterton · October 2, 2018, 3:49pm

Thanks,

I saw the Prometheus thing, but that looks like a whole other complicated ecosystem to install/debug. I was looking for something simple.

I am using re entrant locks, but I use them in a try/catch system, and have reduced them to a minimum. I’m not using locks in lambdas, and most of them are doing trivial things (shouldn’t matter if they do stomp on each other).

If

shell:threads --list |grep "RuleEngine" |wc -l

Returns 51, and my thread pool is set for 50, what does this mean? should I try increasing the thread pool until I show less in use than allocated? They are all WAITING or TIMED_WAITING, not BLOCKED, so is this really an issue? what should I see happening to this number?

Thanks.

job · October 2, 2018, 3:56pm

you have a thread pool of 50 threads and one thread is your grep. The threads are created and in the pool, but most likely not used yet.

Nicholas_Waterton · October 2, 2018, 4:02pm

So I just increased my thread pool to 250, now the number climbs to between 217 and 237, fluctuating up and down.

Is this better? does this mean I have some free threads now that I didn’t have before? doesn’t seem to make any difference to rules execution.

Maybe I should monitor the number of threads that are BLOCKED and see if that is an indicator or problems.

currently

shell:threads --list |grep "RuleEngine" | grep "BLOCKED" |wc -l

shows 0

shell:threads --list |grep "RuleEngine" | grep "WAITING" |wc -l

shows 216.
Maybe when/if the problem reoccurs I should run this again, and see how many report BLOCKED. Of course that could take days/weeks…

rlkoshak · October 2, 2018, 4:11pm

If there were something simple someone would have offered it as an option by now. Something simple simply doesn’t exist.

Like I said, unfortunately we cannot guarantee that locks always get unlocked as some errors in a Rule will cause the rule to exit without ensuring that the lock gets unlocked.

I’d focus on the locks as the first thing to review and see if you can figure out a way to do away with.

I’m not sure but I suspect at a minimum you have at one point had 50 Rules running at the same time. When I run that command on my main OH I get 3 (which seems about right since I have no long running Rules and events don’t usually happen all at once). When I run it on my satellite OH I get 1 (which is what I would suspect since my satellite one doesn’t run any Rules).

I think this URL explains the meanings of the states: Thread.State (Java Platform SE 7 )

Mine are all in a TIMED_WAITING state. The link above indicates this could be a problem. It is hard to tell because it does mean that the thread is active and waiting for something to happen in another thread. But that could mean it is stuck or it could mean that it is just in the pool awaiting a job to do. I don’t know how the pool works, but would have expected it to free up any threads that are not being used rather than holding on to them in a waiting state.

This points to a problem. How fast are events occurring in your system and how fast are the Rules. Based on this behavior it is taking ~240 threads running at the same time for OH to keep up.