Why have my Rules stopped running? Why Thread::sleep is a bad idea

Tags: #<Tag:0x00007fc8f727ec10>

why not set up a bash script that checks the number of threads running every 5 minutes (say) and pushes it to an openhab item, which you can then log in openhab? Then if each of your rules logs when it starts and when it finishes, you should be able to spot quite quickly which one(s) are causing the number of threads to explode.

You should refactor your rules. Start with your most complex ones.

I have quite soemthing going on and just have 120 rules with about 2000 lines of code (and additional 1000 lines of code of script). But it is mostly simple 3-liners now.

I am wondering the same thing, my results are 51, 8, 10, and 6. My thread pool for rules is set to 50, so does this mean all my threads are in use? When I list them, none show as BLOCKED, they all say WAITING or TIMED_WAITING. What does this mean?

I have refactored my rules many times, usually to take advantage of a new feature (like the new Trigger function).

I have boiled my rules down to several lambdas that are called by rules that are 3 or 4 lines (mostly logging to try to find this problem).

The problem is lack of information, not lack of coding skills.

This was not what i ment to say.

No “coding skills” apply to anything in openHAB rule engine. It is always the “other way round”. (A little bit exaggerated, but i guess most will agree, at least partial.)

See New Add-on bundle for Prometheus health Metrics. They have included the thread pools as a metric they can report on.

But all of this can only tell you when you start to consume all your Rules threads, not which Rule is the source of the problem. But perhaps a time when the problem starts will b enough.

Yes

Are you using ReentrantLocks? Locks can be troublesome and locks in lambdas can be particularly troublesome because you cannot guarantee that the finally will get executed and that your lock will be unlocked in every possible circumstance.

Lambdas have lots of other problems too (e.g. not thread safe and internal variables appear to be reused on each call so if you have two rules calling the same lambda at the same time they can stomp on each other) so I recommend against their use except in rare circumstances where there is no other viable way. Writing more generic Rules (i.e. all the Rules that need the lambda get merged into one so there is no need for the lambda anymore) or using an alternative approach such as the Separation of Behaviors DP results in fewer problems like these.

Thanks,

I saw the Prometheus thing, but that looks like a whole other complicated ecosystem to install/debug. I was looking for something simple.

I am using re entrant locks, but I use them in a try/catch system, and have reduced them to a minimum. I’m not using locks in lambdas, and most of them are doing trivial things (shouldn’t matter if they do stomp on each other).

If

shell:threads --list |grep "RuleEngine" |wc -l

Returns 51, and my thread pool is set for 50, what does this mean? should I try increasing the thread pool until I show less in use than allocated? They are all WAITING or TIMED_WAITING, not BLOCKED, so is this really an issue? what should I see happening to this number?

Thanks.

you have a thread pool of 50 threads and one thread is your grep. The threads are created and in the pool, but most likely not used yet.

So I just increased my thread pool to 250, now the number climbs to between 217 and 237, fluctuating up and down.

Is this better? does this mean I have some free threads now that I didn’t have before? doesn’t seem to make any difference to rules execution.

Maybe I should monitor the number of threads that are BLOCKED and see if that is an indicator or problems.

currently

shell:threads --list |grep "RuleEngine" | grep "BLOCKED" |wc -l

shows 0

shell:threads --list |grep "RuleEngine" | grep "WAITING" |wc -l

shows 216.
Maybe when/if the problem reoccurs I should run this again, and see how many report BLOCKED. Of course that could take days/weeks…

If there were something simple someone would have offered it as an option by now. Something simple simply doesn’t exist.

Like I said, unfortunately we cannot guarantee that locks always get unlocked as some errors in a Rule will cause the rule to exit without ensuring that the lock gets unlocked.

I’d focus on the locks as the first thing to review and see if you can figure out a way to do away with.

I’m not sure but I suspect at a minimum you have at one point had 50 Rules running at the same time. When I run that command on my main OH I get 3 (which seems about right since I have no long running Rules and events don’t usually happen all at once). When I run it on my satellite OH I get 1 (which is what I would suspect since my satellite one doesn’t run any Rules).

I think this URL explains the meanings of the states: https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.State.html

Mine are all in a TIMED_WAITING state. The link above indicates this could be a problem. It is hard to tell because it does mean that the thread is active and waiting for something to happen in another thread. But that could mean it is stuck or it could mean that it is just in the pool awaiting a job to do. I don’t know how the pool works, but would have expected it to free up any threads that are not being used rather than holding on to them in a waiting state.

This points to a problem. How fast are events occurring in your system and how fast are the Rules. Based on this behavior it is taking ~240 threads running at the same time for OH to keep up.

Ok,

Well using this I have implemented a rule that track Blocked and waiting threads. I’ll see if it gives any useful information.

If so, I’ll share…

Bear in mind what Rich said about multiple lambda calls trampling each other’s variables. I experienced this with OH1, but essentially same rule engine I believe. This of itself should not cause thread usage of course, But unexpected values could break things though.

Countering this with re-entrant locks brings its own problems. Even with try/catch/finally you can definitely error out without finally releasing a lock. I got my system working reliably by building in lots of sense checking, especially for null values.

Just to share a lesson learned here, when upgrading from 2.3.0 to the 2.4.0M releases the settings in /usr/share/openhab2/runtime/etc/quartz.properties are restored back to default. I was getting confused as to why the upgrade reintroduced all of my thread problems until I realized that this was done.

On another note with 2.4.0, I now get this when I start OH:
2018-10-06 10:27:37.530 [WARN ] [e.jetty.util.thread.ThreadPoolBudget] - Low configured threads: (max=8 - required=1)=7 < warnAt=8 for QueuedThreadPool[ServletModel-29]@21e19873{STARTING,8<=0<=8,i=0,q=0}[ReservedThreadExecutor@347e0f6b{s=0/1,p=0}]

Anyone know why this is happening? I’ve upped my threadpool counts much higher than this.

That is true. Technically speaking we are not expected to be making changes to any files in /usr/share/openhab2 so the installer will simply replace it instead of asking if you want to keep it or use the new version.

I know there is a bug in the release right now related to the RuleEngine threads. This will change at some point to ruleEngine to match the pattern set by the names of the other thread pools and it seems to be case sensitive. But That error does not match.

Magical. This one line is 100% what makes my system work properly or completely collapse.

unverified: have you tried to set the same parameters found in /usr/share/openhab2/runtime/etc/quartz.properties inside /etc/openhab2/services/runtime.cfg ?
I don’t know if this would work and/or if they will take precedence over quartz.properties

Edit: found something interesting: https://issues.apache.org/jira/browse/KARAF-5558

(untested) so according to this, you can use runtime.cfg and set org.quartz.threadPool.threadCount=x etc to modify the quartz. Not exactly sure on the syntax… you need to play with this :slight_smile:

1 Like

Tried to set it in runtime.cfg instead of quartz.properties when I upgraded to 2.4.0 release, no luck.

there is a specific format for setting parameters in runtime.cfg

how did you list the config parameters there? (I tried guessing but I ran out of blue pills :wink:)

It may not work… I mentioned in the previous post that this was untested

Hello,

I have the following problem. Due to speed issues I have to send several command to a KNX actor with a pause of 1000ms between every command. Until today, I am using Thread::Sleep(1000).
Today I tried to change my rule to the following:

rule "Betriebsarten Heizung"
when
Item Heizung_Betriebsart changed
then
    
if(Heizung_Betriebsart.state==1) {
        //Heizung in Komfort (=1) schalten und Standby (=2)
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalA_Betriebsart, 1)  //Wohnzimmer in Komfort
        logInfo("rules","1")     
    ])
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalB_Betriebsart, 1)  //Küche in Komfort
        logInfo("rules","2")
    ])
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalC_Betriebsart, 1)  //Flur EG in Komfort
        logInfo("rules","3")
    ])  
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalH_Betriebsart, 1)  //Gäste WC in Komfort
        logInfo("rules","4")
    ])    
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalD_Betriebsart, 1)  //Kinderzimmer in Komfort
        logInfo("rules","5")
    ])    
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalE_Betriebsart, 2)  //Schlafzimmer in Standby
        logInfo("rules","6")
    ])   
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalF_Betriebsart, 2)  //Badezimmer in Standby
        logInfo("rules","7")
    ])   
    createTimer(now.plusSeconds(1), [ |
        sendCommand(Heizung_KanalG_Betriebsart, 2)  //Speicher in Standby
        logInfo("rules","8")
    ])            
        sendTelegram("johannes","Heizung auf Anwesenheitsmodus geschaltet")
        logInfo("rules","Heizung auf Anwesenheitsmodus geschaltet")
}
//.... and so on for the different modes of operation

I have a logInfo in every Timer to check it in the logs, but it seems there is no time delay between the commands. What is my failure?

Thank you for your help.