(OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea

Pedals2Paddles · July 18, 2018, 1:03pm

Organization and troubleshooting. A problem in one rule file will not necessarily affect all the others. For example, all the rules and items for my Z-Wave garage door openers are in their own item and rule files. They are picky and complicated devices and the rule file is quite long, so it’s best to keep them siloed. The Weather Underground item file is huge and that’s in it’s own for the same reason.

rlkoshak · July 18, 2018, 4:24pm

At runtime there is no difference between the two. Once OH loads them they are all in memory and behave the same.

So the only place where it matters to OH is when it loads the file. And here the main practical difference is that System started Rules fire when a .rules file loads so if you have everything in one .rules file then all of your System started Rules will fire whereas if you have is split into multiple files only the those in that one file will fire.

Beyond that I don’t think there are any practical differences as far as OH loading of files is concerned.

So the biggest practical difference between having multiple .rules files versus one big file is that global val/vars are only global to that one .rules file. So if you have a variable that needs to be used by multiple Rules, they all need to be in the same file.

But for you the human, working with thousands of lines long files (not an unheard of size for lots of OH systems) is awkward. So it makes sense to split up the files. There are several strategies people use. I prefer and recommend splitting both your .items and your .rules files up by function (e.g. lighting, weather, hvac, etc).This provides a logical organization for the files and decreases the likelihood that you will need to deal with the scope of global variables issue mentioned previously. But ultimately, you are the human who has to deal with all this stuff. Do what makes the most sense for you.

morph166955 · July 20, 2018, 4:51pm

That all makes sense. I have north of 20 rules files for the exact reason of troubleshooting. I was looking for the runtime answer which you have given.

The one thing I would add here for those reading it in the future is to be very careful of the sendHttpGetRequest function. When the site replies quickly, it works great. When the site lags or just fails to reply at all, it can cause a world of issues. I haven’t quite pinpointed why, but I’ve seen weirdness where a HTTP GET in one rule will cause rules in totally separate files/items/things/etc to not fire at all until it times out (as long as 15 seconds in some cases that I’ve seen). The easiest way to get around it is to just execute curl through the executeCommandLine function and put a timeout on it that is reasonable. I’ve completely replaced sendHttpGetRequest across my rules and things run much smoother now. This could potentially be resolved by adding a timeout to the function like sendHttpPostRequest has, but for now this seems to work just as well. Again, this is just personal experience and I haven’t gone very far into figuring out why this causes things to lock up, it’s just something to keep an eye out for.

rlkoshak · July 20, 2018, 5:30pm

It sounds a lot like you’ve run out of Rules threads and no Rules can fire until one of the running Rules exits.

This raises a good question though. I always assumed there was a reasonable timeout on the sendHttpRequest Actions but from what you describe there may not be or it is really really large timeout. Indeed there should be a timeout on all of the sendHttpRequest Actions, not just the Post.

morph166955 · July 20, 2018, 6:19pm

I’ve had to increase the number of rules threads several times via runtime.cfg. It’s fixed some things, but others still have issues. I have one website in particular which causes a great deal of pain. It takes as long as 15 seconds to reply (it has to do several queries of remote systems over 4G before it replies to me so this is expected). I’m peeling that off into a timer as you suggested above to try an alleviate the problems but it’s only marginally helping. There is a mechanism where you can create a String variable and do .sendHttpGetRequest(timeout) on the string function but it’s not always reliable. Curl also gives me more options for things like authentication headers so I just tend to use it for everything.

Also an oddity is there is some kind of race condition happening depending on the order that rules files are read in. For example, i have two cron jobs in two different files. Job #1 runs every second. It simply increases an idle timer for a device. Job #2 runs on the 15 and 45 second of every minute to query a webpage (the one that takes up to 15 seconds mentioned above). if the rules file with job #1 is loaded first, job #1 runs without glitch. If the rules file with job #2 is loaded first, job #1 runs from 00-15 and then 30-45 seconds of every minute, it does not run while job #2 is waiting for the page to load. I can replicate this behavior by simply going into the rules file and doing a save to cause it to reload.

rlkoshak · July 20, 2018, 6:49pm

For cron triggered Rules there are only two threads in the pool (it’s a separate pool). So you are probably using up those two threads and the Rules end up having to wait for one of them to free up before getting a thread to run in.

You seem to have an extreme case and sometimes extreme measures may be necessary.

Have you considered offloading this polling of the HTTP pages to a script that runs outside of OH and pushes the results to OH. Then OH won’t have to wait at all or use up any of its threads. Since you are already using curl, you should be able to use sensorReporter with the execSensor and not even have to write any new code. Though there might be some thread timeout problems on my script as well. I’ve not changed the execSensor to spawn a new thread to run the script so it uses up the main thread.

You could also use the system’s cron job and run two curls piped to eachother, one to get the data and the other to post the result to an OH Item.

5iver · July 20, 2018, 7:06pm

Another option is to use JSR223, which TMK does not have the threading issues found in the Rules DSL.

morph166955 · July 26, 2018, 2:19am

For future reference, I ended up resolving the issue by putting the entire offending rule into a timer and greatly increasing the available threads to the system. As a precaution, I created a switch item that I’m using as a defacto Reentrantlock to prevent it from running if the last one hasn’t finished yet (I’ve seen this happen already). I have not yet seen it cause the system to jam up since I did that.

rlkoshak · July 26, 2018, 4:42pm

If you start to see problems you might need to use an actual ReentrantLock. The problem with using an Item is that there can be a good deal of time between setting the switch to ON and your Rule that checks the switch sees that it is ON. In that gap you can end up missing that the last Rule has finished just in time.

You can use a ReentrantLock without actually locking the Rule.

if(!lock.tryLock) return; // another Rule instance has the lock, exit
try {
    // rule body
}
catch(Exception e){
    logError("Error", "Error in rule blah blah blah: " + e.toString)
}
finally {
    lock.unlock
}

In this case the ReentrantLock doesn’t cause a problem because Rules don’t wait around for the lock. They just exit if the previous instance of the Rule is still running. By using the ReentrantLock you avoid that TOC-TOU (time of check, time of use) gap that exists when using a Switch Item.

But it all depends on how important it is that you don’t miss an opportunity to run the Rule when you should be able to.

morph166955 · July 30, 2018, 3:04pm

Apparently I didn’t completely solve the problem. I’m still having the issue where I’m limited to two simultanious cron threads (logs just proved this out for me). I’ve increased thingHandler, discovery, safeCall, and ruleEngine. Is the cron threadpool included in one of those or is it a separate one that I need to increase?

rlkoshak · July 30, 2018, 3:15pm

It’s a separate config file. On an apt-get installed you need to change /usr/share/openhab2/runtime/etc/quartz.properties

Note that this file will almost certainly be overwritten when you upgrade your OH. The maintainers assume that we are not editing any files under runtime.

The pool size is 2 so I’m not surprised that you are limited to only two simultaneous cron Rules.

sjcliffe · August 15, 2018, 5:16am

I’ve been having the problem of rules stopping after a few days ever since I updated to 2.3. Based on the guidance in this thread I’ve now managed to remove all sleeps and and locks from my rules, but unfortunately it hasn’t helped.

Any suggestions on how I can track down the problem or what I should try next?

dan12345 · August 15, 2018, 9:30am

I updated to the latest snapshot and that fixed it for me

Dan

rlkoshak · August 15, 2018, 2:10pm

If the problem persists after upgrading to the snapshot as Dan suggests, here are a few things you should look for:

Do you have other long running calls in your Rules like executeCommandLine or sendHttp*Request?
Do you have any potential feedback looks? For example, a Rule that causes an event that eventually causes the Rule to retrigger building up into an infinite loop that eventually consumes all threads?
Do you see anything unusual with the CPU or RAM usage?
Is it all Rules or do your cron triggered Rules still fire?

sjcliffe · August 15, 2018, 10:43pm

Thanks Rich. I’ll upgrade to the latest snapshot and see how that goes.

I’ve also removed most executeCommndLine calls (replaced with exec binding items) - the ones I have left are rarely used. When my rules stop my cron triggered rules do continue to work.

rlkoshak · August 16, 2018, 1:58am

That’s an important clue. It means your Rules are still able to run in general so most likely something is happening to cause you to run out of Rules threads.

Scott has some commands in his postings above that you can use to tell how many threads you have active. Next time your Rules stop, run them to see how many Rules threads are in use. If it’s 6 then we will know for sure there is something causing your Rules to run amok and consume all the available threads.

sjcliffe · August 19, 2018, 12:29am

Well, I lasted just over 24 hours on the latest snapshot before it ground to a halt. I’ve compared the threads when it was in this semi-hung state and running normally and noticed a couple of things.

When running normally I have 5 ESH-RuleEngine threads in a TIMED_WAITING state. When it was hung I had only one of these threads and it was in a BLOCKED state. I also had a bunch of other threads in a BLOCKED state:

306   │ ESH-RuleEngine-2                                                                                   │ BLOCKED       │ 182771   │ 178950
12164 │ items-1722                                                                                         │ BLOCKED       │ 1797     │ 1570
12312 │ items-1742                                                                                         │ BLOCKED       │ 1116     │ 1000
12437 │ items-1757                                                                                         │ BLOCKED       │ 1009     │ 910
12593 │ items-1776                                                                                         │ BLOCKED       │ 269      │ 240
12632 │ ESH-persist-2693                                                                                   │ BLOCKED       │ 2846     │ 1140
12638 │ ESH-persist-2695                                                                                   │ BLOCKED       │ 2756     │ 1030
12655 │ items-1788                                                                                         │ BLOCKED       │ 14       │ 10
12666 │ ESH-persist-2701                                                                                   │ BLOCKED       │ 0        │ 0
12671 │ ESH-persist-2702                                                                                   │ BLOCKED       │ 0        │ 0
12672 │ ESH-persist-2703                                                                                   │ BLOCKED       │ 0        │ 0

5iver · August 19, 2018, 2:10am

Try a log at the beginning and end of every rule so that you can identify the culprit. Hopefully, when the rules are hung, you’ll find a log entry for the beginning of a rule without a corresponding end entry.

The blocked persist threads are interesting. Maybe your rogue rule is using persistence for something? Random shootin’ from the hip persistence questions come to mind… What persistence services do you have installed, and what strategies are used… maybe too many or too frequent? What hardware is being used to stored the persistence data stored… maybe it’s failing or overloaded?

sjcliffe · August 19, 2018, 3:02am

Thanks for the suggestion Scott. I added MySQL persistence for one item recently and I think it may be around the time I that started experiencing this problem, so this gives me a clue!

Drew_Brown · August 20, 2018, 1:59am

This super helpful post is awesome. To help jr. coders like myself the “My sleeping rule” example could use the missing ) added in on the 2nd to last line.

Again, a super helpful post that is contributing a great deal to the openHAB community!