(OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea

thanks - what exactly do these show? My results are 1, 6, 11 and 6…

Dan

Look at just the shell:threads --list. It will shw you all threads. Adding the grep “thingHandler” will show just the lines containing thingHandler. The wc -l provides a line count (number of threads). The counts change quickly… or should! In theory, the ruleTimer should be the important one for rules.

I have 337 items in shell:threads. Surely that’s not right…

That sounds about right. I have 382. There’s a lot going on back there!

1 Like

Hi Rich.
Want to ban my thread::sleeps in my rules.
If i have more than 1 sleeps, i use a timer in timer?

Situation:

Rollershutter up needs 20 seconds -> than
TTS notification for 13 seconds -> than
TTS notification for 10 seconds.

rule "My sleeping rule"
when
    // and event
then
    // rollershutter up
    createTimer(now.plusSeconds(20),  [ |
        // TTS notification1
      createTimer(now.plusSeconds(13),  [ |
          // TTS notification2
      ])
    ])
end

Same problem by switching 10 lights to on. To prevent an overflow at the lights hub, after every switch of a light, i use a sleep function by 25 mseconds.

And why this is okay too, without the “|”

var Timer timer = null

rule "my rule"
when
    Member of MyGroup changes
then
    if(timer !== null) {
        timer = createTimer(now.plusSeconds(1), [   // no need of "|" ?
            //Find the lowest and do what ever you do
            timer = null
        ])
    }
end

Greetings,
Markus

That is probably what I would do. There is a limited number of threads available to the Timers as well (maybe) so you don’t want to use one of them up doing nothing.

Though they don’t necessarily have to be nested.

createTimer(now.plusSeconds(20),  [ |
    // TTS notification1
])
createTimer(now.plusSeconds(33),  [ |
   // TTS notification2
])

See Design Pattern: Gate Keeper and look at the second to last section above. It is not always practical to remove thread sleeps. In those cases you just have to be more careful.

Because if you don’t have any arguments to pass to the lambda, the | is optional. I always include it to be consistent. It’s the same reason I always put the lambda inside the parens.

Thanks for this. I had a lot of Thread::sleep all over multiple rule files, to include in the startups. All ranging from 2 to 60 seconds and some within while loops. I knew it was probably not a great thing, but didn’t realize how harmful it could be until stumbling across this thread. I’ve since removed all the thread::sleep instances and reworked everything as described here. And the while loops are now timer loops. The only thing I need to do still is put some kind of sanity limit on the loops otherwise a edge case could make them loop for hours.

2 Likes

Is there any benefits or difference between having rules all in one file versus spread across many?

Organization and troubleshooting. A problem in one rule file will not necessarily affect all the others. For example, all the rules and items for my Z-Wave garage door openers are in their own item and rule files. They are picky and complicated devices and the rule file is quite long, so it’s best to keep them siloed. The Weather Underground item file is huge and that’s in it’s own for the same reason.

At runtime there is no difference between the two. Once OH loads them they are all in memory and behave the same.

So the only place where it matters to OH is when it loads the file. And here the main practical difference is that System started Rules fire when a .rules file loads so if you have everything in one .rules file then all of your System started Rules will fire whereas if you have is split into multiple files only the those in that one file will fire.

Beyond that I don’t think there are any practical differences as far as OH loading of files is concerned.

So the biggest practical difference between having multiple .rules files versus one big file is that global val/vars are only global to that one .rules file. So if you have a variable that needs to be used by multiple Rules, they all need to be in the same file.

But for you the human, working with thousands of lines long files (not an unheard of size for lots of OH systems) is awkward. So it makes sense to split up the files. There are several strategies people use. I prefer and recommend splitting both your .items and your .rules files up by function (e.g. lighting, weather, hvac, etc).This provides a logical organization for the files and decreases the likelihood that you will need to deal with the scope of global variables issue mentioned previously. But ultimately, you are the human who has to deal with all this stuff. Do what makes the most sense for you.

That all makes sense. I have north of 20 rules files for the exact reason of troubleshooting. I was looking for the runtime answer which you have given.

The one thing I would add here for those reading it in the future is to be very careful of the sendHttpGetRequest function. When the site replies quickly, it works great. When the site lags or just fails to reply at all, it can cause a world of issues. I haven’t quite pinpointed why, but I’ve seen weirdness where a HTTP GET in one rule will cause rules in totally separate files/items/things/etc to not fire at all until it times out (as long as 15 seconds in some cases that I’ve seen). The easiest way to get around it is to just execute curl through the executeCommandLine function and put a timeout on it that is reasonable. I’ve completely replaced sendHttpGetRequest across my rules and things run much smoother now. This could potentially be resolved by adding a timeout to the function like sendHttpPostRequest has, but for now this seems to work just as well. Again, this is just personal experience and I haven’t gone very far into figuring out why this causes things to lock up, it’s just something to keep an eye out for.

It sounds a lot like you’ve run out of Rules threads and no Rules can fire until one of the running Rules exits.

This raises a good question though. I always assumed there was a reasonable timeout on the sendHttpRequest Actions but from what you describe there may not be or it is really really large timeout. Indeed there should be a timeout on all of the sendHttpRequest Actions, not just the Post.

I’ve had to increase the number of rules threads several times via runtime.cfg. It’s fixed some things, but others still have issues. I have one website in particular which causes a great deal of pain. It takes as long as 15 seconds to reply (it has to do several queries of remote systems over 4G before it replies to me so this is expected). I’m peeling that off into a timer as you suggested above to try an alleviate the problems but it’s only marginally helping. There is a mechanism where you can create a String variable and do .sendHttpGetRequest(timeout) on the string function but it’s not always reliable. Curl also gives me more options for things like authentication headers so I just tend to use it for everything.

Also an oddity is there is some kind of race condition happening depending on the order that rules files are read in. For example, i have two cron jobs in two different files. Job #1 runs every second. It simply increases an idle timer for a device. Job #2 runs on the 15 and 45 second of every minute to query a webpage (the one that takes up to 15 seconds mentioned above). if the rules file with job #1 is loaded first, job #1 runs without glitch. If the rules file with job #2 is loaded first, job #1 runs from 00-15 and then 30-45 seconds of every minute, it does not run while job #2 is waiting for the page to load. I can replicate this behavior by simply going into the rules file and doing a save to cause it to reload.

For cron triggered Rules there are only two threads in the pool (it’s a separate pool). So you are probably using up those two threads and the Rules end up having to wait for one of them to free up before getting a thread to run in.

You seem to have an extreme case and sometimes extreme measures may be necessary.

Have you considered offloading this polling of the HTTP pages to a script that runs outside of OH and pushes the results to OH. Then OH won’t have to wait at all or use up any of its threads. Since you are already using curl, you should be able to use sensorReporter with the execSensor and not even have to write any new code. Though there might be some thread timeout problems on my script as well. I’ve not changed the execSensor to spawn a new thread to run the script so it uses up the main thread.

You could also use the system’s cron job and run two curls piped to eachother, one to get the data and the other to post the result to an OH Item.

Another option is to use JSR223, which TMK does not have the threading issues found in the Rules DSL.

2 Likes

For future reference, I ended up resolving the issue by putting the entire offending rule into a timer and greatly increasing the available threads to the system. As a precaution, I created a switch item that I’m using as a defacto Reentrantlock to prevent it from running if the last one hasn’t finished yet (I’ve seen this happen already). I have not yet seen it cause the system to jam up since I did that.

If you start to see problems you might need to use an actual ReentrantLock. The problem with using an Item is that there can be a good deal of time between setting the switch to ON and your Rule that checks the switch sees that it is ON. In that gap you can end up missing that the last Rule has finished just in time.

You can use a ReentrantLock without actually locking the Rule.

if(!lock.tryLock) return; // another Rule instance has the lock, exit
try {
    // rule body
}
catch(Exception e){
    logError("Error", "Error in rule blah blah blah: " + e.toString)
}
finally {
    lock.unlock
}  

In this case the ReentrantLock doesn’t cause a problem because Rules don’t wait around for the lock. They just exit if the previous instance of the Rule is still running. By using the ReentrantLock you avoid that TOC-TOU (time of check, time of use) gap that exists when using a Switch Item.

But it all depends on how important it is that you don’t miss an opportunity to run the Rule when you should be able to.

Apparently I didn’t completely solve the problem. I’m still having the issue where I’m limited to two simultanious cron threads (logs just proved this out for me). I’ve increased thingHandler, discovery, safeCall, and ruleEngine. Is the cron threadpool included in one of those or is it a separate one that I need to increase?

It’s a separate config file. On an apt-get installed you need to change /usr/share/openhab2/runtime/etc/quartz.properties

Note that this file will almost certainly be overwritten when you upgrade your OH. The maintainers assume that we are not editing any files under runtime.

The pool size is 2 so I’m not surprised that you are limited to only two simultaneous cron Rules.

I’ve been having the problem of rules stopping after a few days ever since I updated to 2.3. Based on the guidance in this thread I’ve now managed to remove all sleeps and and locks from my rules, but unfortunately it hasn’t helped.

Any suggestions on how I can track down the problem or what I should try next?