(OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea

This is an interesting approach. I’ve definitely run into this issue as I have rules that take a long time to run (mostly HTTP queries) as well as rules that try to run simultaneously (e.g. audio notifications for things like doorbells that go off in multiple rooms). It’s more noticeable to the simultaneous rules because you can hear the notification play at different times (in some cases as long as 45 seconds later) across the hosue. I dialed my rules threadpool much higher and it resolved my issues. I would be curious to know if there is a real/noticable difference between increasing the number of threads that the rules can use versus pushing it over to the Quartz pool. I’m not overly worried about hardware resources, the computer my OH runs on has plenty to spare. If so I’ll retool my rules to push those over.

There is a way to hook up a development too called a profiler to OH and it will tell you how many threads are active. Beyond that I don’t think there is. Worrying about threads isn’t usually something users have to worry about.

There might be a way to get some logs from Quartz that might be informative.

But beyond knowing these things are possible I’m not much help. I haven’t used a profiler in a decade.

Karaf may have something built in.

Hi Rich

Thanks for this, really helpful! One question…

Is it a difference between this rule:

rule "My sleeping rule"
when
    // and event
then
    // do some stuff
    createTimer(now.plusSeconds(1)),  [ |
        // do some more stuff
    ]
end

and this (different round brackets):

rule "My sleeping rule"
when
    // and event
then
    // do some stuff
    createTimer(now.plusSeconds(1),  [ |
        // do some more stuff
    ])
end

I mean for both cases the “// do some more stuff” is executed after one second, right?

Thanks
Michael

The second one is the recommended one in the docs.
See: https://www.openhab.org/docs/configuration/actions.html#timers

I know the the first one works but then we have a lambda declaration floating in the code after a comma. Doesn’t make sense to me.
I prefer the second one, the syntax makes more sense.

Both are technically equivalent. As far as the Rules DSL is concerned they are identical.

The underlying Xtend language provides a little “syntatic sugar”. For method calls that take a lambda as it’s final argument, you can put the lambda definition outside the parens.

I don’t like this and recommend against it in the OH context because most people don’t realize they are creating a lambda Object and passing it to the createTimer method in the former case, whereas in the latter case that is made just a little more explicit since you are putting the lambda definition inside the parens of the method call.

In short, like Vincent says, the later example makes more sense to more people as it is more consistent with the rest of the Rules DSL.

You can use jconsole to view the threads, in real-time…

I find this easier to get some quick results…

shell:threads --list |grep "RuleEngine" |wc -l
shell:threads --list |grep "safeCall" |wc -l
shell:threads --list |grep "discovery" |wc -l
shell:threads --list |grep "thingHandler" |wc -l

Just remember the grep will show in the results, so subtract 1.

5 Likes

thanks - what exactly do these show? My results are 1, 6, 11 and 6…

Dan

Look at just the shell:threads --list. It will shw you all threads. Adding the grep “thingHandler” will show just the lines containing thingHandler. The wc -l provides a line count (number of threads). The counts change quickly… or should! In theory, the ruleTimer should be the important one for rules.

I have 337 items in shell:threads. Surely that’s not right…

That sounds about right. I have 382. There’s a lot going on back there!

1 Like

Hi Rich.
Want to ban my thread::sleeps in my rules.
If i have more than 1 sleeps, i use a timer in timer?

Situation:

Rollershutter up needs 20 seconds -> than
TTS notification for 13 seconds -> than
TTS notification for 10 seconds.

rule "My sleeping rule"
when
    // and event
then
    // rollershutter up
    createTimer(now.plusSeconds(20),  [ |
        // TTS notification1
      createTimer(now.plusSeconds(13),  [ |
          // TTS notification2
      ])
    ])
end

Same problem by switching 10 lights to on. To prevent an overflow at the lights hub, after every switch of a light, i use a sleep function by 25 mseconds.

And why this is okay too, without the “|”

var Timer timer = null

rule "my rule"
when
    Member of MyGroup changes
then
    if(timer !== null) {
        timer = createTimer(now.plusSeconds(1), [   // no need of "|" ?
            //Find the lowest and do what ever you do
            timer = null
        ])
    }
end

Greetings,
Markus

That is probably what I would do. There is a limited number of threads available to the Timers as well (maybe) so you don’t want to use one of them up doing nothing.

Though they don’t necessarily have to be nested.

createTimer(now.plusSeconds(20),  [ |
    // TTS notification1
])
createTimer(now.plusSeconds(33),  [ |
   // TTS notification2
])

See Design Pattern: Gate Keeper and look at the second to last section above. It is not always practical to remove thread sleeps. In those cases you just have to be more careful.

Because if you don’t have any arguments to pass to the lambda, the | is optional. I always include it to be consistent. It’s the same reason I always put the lambda inside the parens.

Thanks for this. I had a lot of Thread::sleep all over multiple rule files, to include in the startups. All ranging from 2 to 60 seconds and some within while loops. I knew it was probably not a great thing, but didn’t realize how harmful it could be until stumbling across this thread. I’ve since removed all the thread::sleep instances and reworked everything as described here. And the while loops are now timer loops. The only thing I need to do still is put some kind of sanity limit on the loops otherwise a edge case could make them loop for hours.

2 Likes

Is there any benefits or difference between having rules all in one file versus spread across many?

Organization and troubleshooting. A problem in one rule file will not necessarily affect all the others. For example, all the rules and items for my Z-Wave garage door openers are in their own item and rule files. They are picky and complicated devices and the rule file is quite long, so it’s best to keep them siloed. The Weather Underground item file is huge and that’s in it’s own for the same reason.

At runtime there is no difference between the two. Once OH loads them they are all in memory and behave the same.

So the only place where it matters to OH is when it loads the file. And here the main practical difference is that System started Rules fire when a .rules file loads so if you have everything in one .rules file then all of your System started Rules will fire whereas if you have is split into multiple files only the those in that one file will fire.

Beyond that I don’t think there are any practical differences as far as OH loading of files is concerned.

So the biggest practical difference between having multiple .rules files versus one big file is that global val/vars are only global to that one .rules file. So if you have a variable that needs to be used by multiple Rules, they all need to be in the same file.

But for you the human, working with thousands of lines long files (not an unheard of size for lots of OH systems) is awkward. So it makes sense to split up the files. There are several strategies people use. I prefer and recommend splitting both your .items and your .rules files up by function (e.g. lighting, weather, hvac, etc).This provides a logical organization for the files and decreases the likelihood that you will need to deal with the scope of global variables issue mentioned previously. But ultimately, you are the human who has to deal with all this stuff. Do what makes the most sense for you.

That all makes sense. I have north of 20 rules files for the exact reason of troubleshooting. I was looking for the runtime answer which you have given.

The one thing I would add here for those reading it in the future is to be very careful of the sendHttpGetRequest function. When the site replies quickly, it works great. When the site lags or just fails to reply at all, it can cause a world of issues. I haven’t quite pinpointed why, but I’ve seen weirdness where a HTTP GET in one rule will cause rules in totally separate files/items/things/etc to not fire at all until it times out (as long as 15 seconds in some cases that I’ve seen). The easiest way to get around it is to just execute curl through the executeCommandLine function and put a timeout on it that is reasonable. I’ve completely replaced sendHttpGetRequest across my rules and things run much smoother now. This could potentially be resolved by adding a timeout to the function like sendHttpPostRequest has, but for now this seems to work just as well. Again, this is just personal experience and I haven’t gone very far into figuring out why this causes things to lock up, it’s just something to keep an eye out for.

It sounds a lot like you’ve run out of Rules threads and no Rules can fire until one of the running Rules exits.

This raises a good question though. I always assumed there was a reasonable timeout on the sendHttpRequest Actions but from what you describe there may not be or it is really really large timeout. Indeed there should be a timeout on all of the sendHttpRequest Actions, not just the Post.

I’ve had to increase the number of rules threads several times via runtime.cfg. It’s fixed some things, but others still have issues. I have one website in particular which causes a great deal of pain. It takes as long as 15 seconds to reply (it has to do several queries of remote systems over 4G before it replies to me so this is expected). I’m peeling that off into a timer as you suggested above to try an alleviate the problems but it’s only marginally helping. There is a mechanism where you can create a String variable and do .sendHttpGetRequest(timeout) on the string function but it’s not always reliable. Curl also gives me more options for things like authentication headers so I just tend to use it for everything.

Also an oddity is there is some kind of race condition happening depending on the order that rules files are read in. For example, i have two cron jobs in two different files. Job #1 runs every second. It simply increases an idle timer for a device. Job #2 runs on the 15 and 45 second of every minute to query a webpage (the one that takes up to 15 seconds mentioned above). if the rules file with job #1 is loaded first, job #1 runs without glitch. If the rules file with job #2 is loaded first, job #1 runs from 00-15 and then 30-45 seconds of every minute, it does not run while job #2 is waiting for the page to load. I can replicate this behavior by simply going into the rules file and doing a save to cause it to reload.

For cron triggered Rules there are only two threads in the pool (it’s a separate pool). So you are probably using up those two threads and the Rules end up having to wait for one of them to free up before getting a thread to run in.

You seem to have an extreme case and sometimes extreme measures may be necessary.

Have you considered offloading this polling of the HTTP pages to a script that runs outside of OH and pushes the results to OH. Then OH won’t have to wait at all or use up any of its threads. Since you are already using curl, you should be able to use sensorReporter with the execSensor and not even have to write any new code. Though there might be some thread timeout problems on my script as well. I’ve not changed the execSensor to spawn a new thread to run the script so it uses up the main thread.

You could also use the system’s cron job and run two curls piped to eachother, one to get the data and the other to post the result to an OH Item.