(OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea

rlkoshak · July 11, 2018, 6:04pm

EDIT: While the general advice in this article still applies (i.e. it’s always worth while to make short running rules), the problem described in this article only applies to OH 1.x and Rules DSL in OH 2.x. In OH 3 and beyond, this is no longer a problem so if your rules stop running in any version of OH released in the last two years (as of this edit in 1/2023) this is not your problem.

By way of explanation, in OH 3 each rule gets it’s own thread. It is no longer possible for a rule to starve out other rules by consuming threads waiting around doing nothing.

This topic comes up over and over again so I’m typing it up here to reference later. Eventually I’d like to refine the text and add it to the official docs.

The tl;dr is avoid long running Rules wherever possible.

What is a long running Rule?

Any Rule that takes longer than half a second is a long running Rule.

What causes a long running Rule?

Thread::sleep
executeCommandLine with a timeout
sendHttp*Request
other third party Actions
loops through very large collections (thousands)
indeterminate while loops (e.g. loops waiting for a condition to be reached)
any of the first four things above inside a loop
any of the first four things above inside a Locked block of code

Why can this cause problems?

By default the Rules DSL only provides five threads for Rules to execute in. This means that the Rules Engine can only support five simultaneously running Rules at a given time. When there are long running Rules, the likelihood that those five threads will be used up increases. When all five threads are used up with long running Rules, that means that no other Rules can execute until one of these long running Rules exits.

In the best case scenario this can cause latency between when events occurs and the Rules that trigger on those events execute. In the worst case scenario more and more Rules get queued up awaiting access to an execution thread to the point that all your Rules simply stop executing in any meaningful manner.

Let’s look at a real world example. Review the following Rule:

rule "Motion sensor"
when
    Item MyMotionSensor changed to ON // receives an ON command every time motion is detected with a 30 second cool down period
then
    MyLight.sendCommand(ON)
    Thread::sleep(300000) // five minutes
end

The intent of the Rule above is to turn on a light for 5 minutes after motion is detected. However, if you are active in a room for six minutes constantly moving and the MyMotionSensor triggers every 30 seconds then the following sequence of events take place:

Time	Num Threads Left	Num Running	Num Queued	Comment
0	4	1	0	One instance of “Motion sensor” running
30	3	2	0	Two instances of “Motion sensor” running
1:00	2	3	0	Three instances of “Motion sensor” running
1:30	1	4	0	Four instances of “Motion sensor” running
2:00	0	5	0	Five instances running, no threads left
2:30	0	5	1	One instance of “Motion sensor” queued to run, five running
3:00	0	5	2	Two instances queued
3:30	0	5	3	Three instances queued
4:00	0	5	4	Four instances queued
4:30	0	5	5	Five instances queued
5:00	0	5	5	Finally, that first instance of “Motion sensor” exited and one of those in the queue can start executing.
5:30	0	5	5	At least the queue isn’t growing, but instances of Rules queued five minutes ago are only now just starting to execute
6:00	0	5	5	This is the last new motion sensor event.
6:30	0	5	4	Finally we are working down the queue
7:00	0	5	3
7:30	0	5	2
8:00	0	5	1
8:30	0	5	0
9:00	0	4	0	For the first time in seven minutes new instances of Rules can run
9:30	0	3	0
10:00	0	2	0
10:30	0	1	0
11:00	0	0	0	Finally, 11 minutes after the first event and five minutes after the last event all the Events have been worked off.

Given the simple Rule above, between times 2:30 and 8:30, six whole minutes, NO other Rules can run. And if motion is continually detected the Rules could be starved out indefinitely.

And the above example assumes just one Rule is trying to run. Where real problems can occur is if there is more than one long running Rule in a system. The longer the Rule takes to run and the more of these Rules have the potential to run at the same time the higher the likelihood that the five threads will be used up and your other Rules will get starved out.

What can I do?

Avoid Thread::sleep longer than 500 msec.

Use the shortest sleep possible. If longer sleeps are required, use a Timer instead of a Thread::sleep. Timers do not use a Rules thread.

For example, change:

rule "My sleeping rule"
when
    // an event
then
    // do some stuff
    Thread::sleep(1000)
    // do some more stuff
end

to

rule "My sleeping rule"
when
    // and event
then
    // do some stuff
    createTimer(now.plusSeconds(1),  [ |
        // do some more stuff
    ])
end

The first version ties up a Rules execution thread for one second and some dozen milliseconds. The second version only ties up a Rule execution thread for a dozen milliseconds.

Put long running Actions in a Timer

When you create a Timer to execute now, the Timer will start to execute immediately. So you can replace the following:

rule "Long running actions"
when
    // some event
then

    val results = executeCommandLine("/long/running/script.sh", 5000)
    // code to process results

    val json = sendHttpGetRequest("http://slow.web.site.com")
    // code to process results
end

with

rule "Long running actions"
when
    // some event
then
    createTimer(now, [ |
        val results = executeCommandLine("/long/running/script.sh", 5000)
        // code to process results        
    ])

    createTimer(now, [ |
        val json = sendHttpGetRequest("http://slow.web.site.com")
        // code to process results
    ])
end

Avoid Locks Entirely

Locks pose a lot of the same risks that long running Rules do. So they should be used sparingly. It is worth the effort to come up with some other approach if Locks start to appear to be a requirement.

If a Lock must be used, do not put long running code inside the locked block. The problem with Lock and long running code is not only is the long running code in that one Rule using up a thread, but all the Rules awaiting access to the Lock are also blocked and using up one of the Rule threads while they await the lock. This only exacerbates the sorts of problems illustrated in the table above.

Never put a Thread::sleep or long running Action call inside a loop

Let’s look at the following example.

while(condition){
    // some code
    Thread::sleep(100)
}

Looks like there is no problem. The Thread::sleep is far below the 500 max discussed above. But the problem is we have no idea how many times this while loop will iterate, and each iteration takes more than 100 msec. So if the loop iterates just five times we are already in danger territory. And we don’t know up front how many times this loop will iterate (otherwise a for loop would have been used).

Luckily it is very easy to implement a while loop using Timers.

var timer = null // global variable

timer = createTimer(now, [ |
    // some code
    if(!condition) timer.reschedule(now.plusMillis(100))
    else timer = null
])

But what if there is no other way?

Rarely you will encounter a situation where there really is no other way but to use a long Thread::sleep or Lock or long running action in a Rule. If this is the case, writing such a Rule safely will require the Rules developer to have a very good understanding of ALL of their Rules and of the events that occur in their system including how often and at what times.

If we look back at the “Motion sensor” Rule above, one of the big problems was that the Rule itself triggers faster than it takes for the Rule to execute. This is a situation that must be avoided. So the Rules developer must understand the maximum rate that their events can occur and make sure the Rule takes less time than that to complete.

Make sure to understand how Rules interact. When one event occurs, are more than one Rule triggered? Are to independent long running Rules likely to trigger at or near the same time? As you make changes to or add new Rules, you need to revisit all of your long running Rules again to make sure they are still safe to run.

The ability to use long running Rules safely requires a deep understanding of your own home automation. Often, even for advanced developers, avoiding the long running Rules in the first place is easier.

The Aspirin Fix

Finally, you can treat the symptom of the problem by increasing the number of Rules execution threads. I call this the Aspirin Fix because it is like treating a broken bone with just Aspirin. It might deaden the pain, but it doesn’t really treat the real problem. Increasing the number of threads takes some of the pressure off so it takes more simultaneous Rules before you run out of threads, but the Rules instances are still piling up and may become a problem in the future.

To increase the number of Rules threads, add the following line to conf/services/runtime.cfg (/etc/openhab2/services/runtime.cfg on installed OH):

org.eclipse.smarthome.threadpool:RuleEngine=5

Note that the larger value you use, the more RAM your OH will require.

On another thread: New Add-on bundle for Prometheus health Metrics - #3 by friesenkiwi @friesenkiwi and team are working on a way to export openHAB metrics to Prometheus, centralized health and status metrics server that uses Graphana to display and explore various metrics. At me request they just added the OH thread pools as one of the metrics which some here may find useful.

NCO · July 12, 2018, 5:35am

Great hints, Rich!

I am using sleep all over the place and did not know the consequences.

I have one questions:

How long does this run and does this mean the rest of the rule will proceed without waiting (script will be finished in the background)?

rlkoshak · July 12, 2018, 2:00pm

The call to createTimer completes almost instantly and the rest of the Rule will run immediately without waiting. That is sort of the point.

Any code you have in your Rule that depends on the results from executeCommandLine, that code must be put inside the timer’s body.

morph166955 · July 12, 2018, 6:39pm

So given that everything ultimately needs to live on a thread somewhere, what is the logic of how the timer gets around this issue? Does it just push it to one of the other pools? If so, which pool? Also, what impact does that have on whatever else was supposed to live in that other pool? The last time I checked, the other 3 pools only have 5 or 10 threads each as well so aren’t we just shifting the problem around?

rlkoshak · July 12, 2018, 7:52pm

I know that Timers get executed by Quartz. But I don’t know if it uses a thread from the quartz thread pool or whether it creates a new thread. I suspect it creates a new thread because the default pool is 2 for quartz. (/usr/share/openhab2/runtime/etc/quartz.properties) and I’m certain I’ve had more than two running at the same time.

The only things that live in that pool are Timers (if it does use threads from that pool) and cron triggered Rules. Sometime this weekend I’ll try to set up an experiment to verify this. I’m just assuming based on the behavior I’ve see in my system so far.

We are not just shifting the problem around, we are shifting the problem to a place where it causes less of an impact. Assuming the Timers do run in a thread from the Quartz thread pool, then if we run out of those threads then the impact is only Timers and cron triggered Rules stop running rather than ALL Rules stop running.

We are also lessening the amount of time that the Threads are tied up because unlike a Thread::sleep that uses up a thread doing nothing, a Timer doesn’t use any Thread until it is actually running. So, if you look at the complex example, the while loop uses a Rule thread non-stop from the first movement detection until morning. With the Timer implementation it is only using a Quartz thread (assuming that it does use a thread from this pool) for a handful of milliseconds every minute, freeing up that thread for use the rest of the time.

So in many cases we are decreasing the likelihood that we will run out of threads because we are not tying one up doing nothing but waiting around, and we are lessening the impact should we run out of threads.

I’d be really interested in hearing if anyone encounters problems with Timers that may be caused by running out of Threads in the Quartz pool. If that does occur then we should file an issue to address that problem.

dan12345 · July 12, 2018, 8:01pm

Rich, this is all really helpful. Is there an easy way to see how many threads are running at any given time?

morph166955 · July 12, 2018, 8:11pm

This is an interesting approach. I’ve definitely run into this issue as I have rules that take a long time to run (mostly HTTP queries) as well as rules that try to run simultaneously (e.g. audio notifications for things like doorbells that go off in multiple rooms). It’s more noticeable to the simultaneous rules because you can hear the notification play at different times (in some cases as long as 45 seconds later) across the hosue. I dialed my rules threadpool much higher and it resolved my issues. I would be curious to know if there is a real/noticable difference between increasing the number of threads that the rules can use versus pushing it over to the Quartz pool. I’m not overly worried about hardware resources, the computer my OH runs on has plenty to spare. If so I’ll retool my rules to push those over.

rlkoshak · July 13, 2018, 4:39am

There is a way to hook up a development too called a profiler to OH and it will tell you how many threads are active. Beyond that I don’t think there is. Worrying about threads isn’t usually something users have to worry about.

There might be a way to get some logs from Quartz that might be informative.

But beyond knowing these things are possible I’m not much help. I haven’t used a profiler in a decade.

Karaf may have something built in.

michaeljoos · July 13, 2018, 7:41am

Hi Rich

Thanks for this, really helpful! One question…

Is it a difference between this rule:

rule "My sleeping rule"
when
    // and event
then
    // do some stuff
    createTimer(now.plusSeconds(1)),  [ |
        // do some more stuff
    ]
end

and this (different round brackets):

rule "My sleeping rule"
when
    // and event
then
    // do some stuff
    createTimer(now.plusSeconds(1),  [ |
        // do some more stuff
    ])
end

I mean for both cases the “// do some more stuff” is executed after one second, right?

Thanks
Michael

vzorglub · July 13, 2018, 8:16am

The second one is the recommended one in the docs.
See: https://www.openhab.org/docs/configuration/actions.html#timers

I know the the first one works but then we have a lambda declaration floating in the code after a comma. Doesn’t make sense to me.
I prefer the second one, the syntax makes more sense.

rlkoshak · July 14, 2018, 3:18am

Both are technically equivalent. As far as the Rules DSL is concerned they are identical.

The underlying Xtend language provides a little “syntatic sugar”. For method calls that take a lambda as it’s final argument, you can put the lambda definition outside the parens.

I don’t like this and recommend against it in the OH context because most people don’t realize they are creating a lambda Object and passing it to the createTimer method in the former case, whereas in the latter case that is made just a little more explicit since you are putting the lambda definition inside the parens of the method call.

In short, like Vincent says, the later example makes more sense to more people as it is more consistent with the rest of the Rules DSL.

5iver · July 14, 2018, 7:08am

You can use jconsole to view the threads, in real-time…

I find this easier to get some quick results…

shell:threads --list |grep "RuleEngine" |wc -l
shell:threads --list |grep "safeCall" |wc -l
shell:threads --list |grep "discovery" |wc -l
shell:threads --list |grep "thingHandler" |wc -l

Just remember the grep will show in the results, so subtract 1.

dan12345 · July 14, 2018, 11:54am

thanks - what exactly do these show? My results are 1, 6, 11 and 6…

Dan

5iver · July 14, 2018, 11:58am

Look at just the shell:threads --list. It will shw you all threads. Adding the grep “thingHandler” will show just the lines containing thingHandler. The wc -l provides a line count (number of threads). The counts change quickly… or should! In theory, the ruleTimer should be the important one for rules.

dan12345 · July 14, 2018, 12:00pm

I have 337 items in shell:threads. Surely that’s not right…

5iver · July 14, 2018, 12:03pm

That sounds about right. I have 382. There’s a lot going on back there!

Master79 · July 15, 2018, 10:08am

Hi Rich.
Want to ban my thread::sleeps in my rules.
If i have more than 1 sleeps, i use a timer in timer?

Situation:

Rollershutter up needs 20 seconds -> than
TTS notification for 13 seconds -> than
TTS notification for 10 seconds.

rule "My sleeping rule"
when
    // and event
then
    // rollershutter up
    createTimer(now.plusSeconds(20),  [ |
        // TTS notification1
      createTimer(now.plusSeconds(13),  [ |
          // TTS notification2
      ])
    ])
end

Same problem by switching 10 lights to on. To prevent an overflow at the lights hub, after every switch of a light, i use a sleep function by 25 mseconds.

And why this is okay too, without the “|”

var Timer timer = null

rule "my rule"
when
    Member of MyGroup changes
then
    if(timer !== null) {
        timer = createTimer(now.plusSeconds(1), [   // no need of "|" ?
            //Find the lowest and do what ever you do
            timer = null
        ])
    }
end

Greetings,
Markus

rlkoshak · July 15, 2018, 10:15pm

That is probably what I would do. There is a limited number of threads available to the Timers as well (maybe) so you don’t want to use one of them up doing nothing.

Though they don’t necessarily have to be nested.

createTimer(now.plusSeconds(20),  [ |
    // TTS notification1
])
createTimer(now.plusSeconds(33),  [ |
   // TTS notification2
])

See Design Pattern: Gate Keeper and look at the second to last section above. It is not always practical to remove thread sleeps. In those cases you just have to be more careful.

Because if you don’t have any arguments to pass to the lambda, the | is optional. I always include it to be consistent. It’s the same reason I always put the lambda inside the parens.

Pedals2Paddles · July 17, 2018, 10:27am

Thanks for this. I had a lot of Thread::sleep all over multiple rule files, to include in the startups. All ranging from 2 to 60 seconds and some within while loops. I knew it was probably not a great thing, but didn’t realize how harmful it could be until stumbling across this thread. I’ve since removed all the thread::sleep instances and reworked everything as described here. And the while loops are now timer loops. The only thing I need to do still is put some kind of sanity limit on the loops otherwise a edge case could make them loop for hours.

morph166955 · July 18, 2018, 12:35pm

Is there any benefits or difference between having rules all in one file versus spread across many?