EDIT: While the general advice in this article still applies (i.e. it’s always worth while to make short running rules), the problem described in this article only applies to OH 1.x and Rules DSL in OH 2.x. In OH 3 and beyond, this is no longer a problem so if your rules stop running in any version of OH released in the last two years (as of this edit in 1/2023) this is not your problem.
By way of explanation, in OH 3 each rule gets it’s own thread. It is no longer possible for a rule to starve out other rules by consuming threads waiting around doing nothing.
This topic comes up over and over again so I’m typing it up here to reference later. Eventually I’d like to refine the text and add it to the official docs.
The tl;dr is avoid long running Rules wherever possible.
What is a long running Rule?
Any Rule that takes longer than half a second is a long running Rule.
What causes a long running Rule?
- Thread::sleep
- executeCommandLine with a timeout
- sendHttp*Request
- other third party Actions
- loops through very large collections (thousands)
- indeterminate while loops (e.g. loops waiting for a condition to be reached)
- any of the first four things above inside a loop
- any of the first four things above inside a Locked block of code
Why can this cause problems?
By default the Rules DSL only provides five threads for Rules to execute in. This means that the Rules Engine can only support five simultaneously running Rules at a given time. When there are long running Rules, the likelihood that those five threads will be used up increases. When all five threads are used up with long running Rules, that means that no other Rules can execute until one of these long running Rules exits.
In the best case scenario this can cause latency between when events occurs and the Rules that trigger on those events execute. In the worst case scenario more and more Rules get queued up awaiting access to an execution thread to the point that all your Rules simply stop executing in any meaningful manner.
Let’s look at a real world example. Review the following Rule:
rule "Motion sensor"
when
Item MyMotionSensor changed to ON // receives an ON command every time motion is detected with a 30 second cool down period
then
MyLight.sendCommand(ON)
Thread::sleep(300000) // five minutes
end
The intent of the Rule above is to turn on a light for 5 minutes after motion is detected. However, if you are active in a room for six minutes constantly moving and the MyMotionSensor triggers every 30 seconds then the following sequence of events take place:
Time | Num Threads Left | Num Running | Num Queued | Comment |
---|---|---|---|---|
0 | 4 | 1 | 0 | One instance of “Motion sensor” running |
30 | 3 | 2 | 0 | Two instances of “Motion sensor” running |
1:00 | 2 | 3 | 0 | Three instances of “Motion sensor” running |
1:30 | 1 | 4 | 0 | Four instances of “Motion sensor” running |
2:00 | 0 | 5 | 0 | Five instances running, no threads left |
2:30 | 0 | 5 | 1 | One instance of “Motion sensor” queued to run, five running |
3:00 | 0 | 5 | 2 | Two instances queued |
3:30 | 0 | 5 | 3 | Three instances queued |
4:00 | 0 | 5 | 4 | Four instances queued |
4:30 | 0 | 5 | 5 | Five instances queued |
5:00 | 0 | 5 | 5 | Finally, that first instance of “Motion sensor” exited and one of those in the queue can start executing. |
5:30 | 0 | 5 | 5 | At least the queue isn’t growing, but instances of Rules queued five minutes ago are only now just starting to execute |
6:00 | 0 | 5 | 5 | This is the last new motion sensor event. |
6:30 | 0 | 5 | 4 | Finally we are working down the queue |
7:00 | 0 | 5 | 3 | |
7:30 | 0 | 5 | 2 | |
8:00 | 0 | 5 | 1 | |
8:30 | 0 | 5 | 0 | |
9:00 | 0 | 4 | 0 | For the first time in seven minutes new instances of Rules can run |
9:30 | 0 | 3 | 0 | |
10:00 | 0 | 2 | 0 | |
10:30 | 0 | 1 | 0 | |
11:00 | 0 | 0 | 0 | Finally, 11 minutes after the first event and five minutes after the last event all the Events have been worked off. |
Given the simple Rule above, between times 2:30 and 8:30, six whole minutes, NO other Rules can run. And if motion is continually detected the Rules could be starved out indefinitely.
And the above example assumes just one Rule is trying to run. Where real problems can occur is if there is more than one long running Rule in a system. The longer the Rule takes to run and the more of these Rules have the potential to run at the same time the higher the likelihood that the five threads will be used up and your other Rules will get starved out.
What can I do?
Avoid Thread::sleep longer than 500 msec.
Use the shortest sleep possible. If longer sleeps are required, use a Timer instead of a Thread::sleep. Timers do not use a Rules thread.
For example, change:
rule "My sleeping rule"
when
// an event
then
// do some stuff
Thread::sleep(1000)
// do some more stuff
end
to
rule "My sleeping rule"
when
// and event
then
// do some stuff
createTimer(now.plusSeconds(1), [ |
// do some more stuff
])
end
The first version ties up a Rules execution thread for one second and some dozen milliseconds. The second version only ties up a Rule execution thread for a dozen milliseconds.
Put long running Actions in a Timer
When you create a Timer to execute now
, the Timer will start to execute immediately. So you can replace the following:
rule "Long running actions"
when
// some event
then
val results = executeCommandLine("/long/running/script.sh", 5000)
// code to process results
val json = sendHttpGetRequest("http://slow.web.site.com")
// code to process results
end
with
rule "Long running actions"
when
// some event
then
createTimer(now, [ |
val results = executeCommandLine("/long/running/script.sh", 5000)
// code to process results
])
createTimer(now, [ |
val json = sendHttpGetRequest("http://slow.web.site.com")
// code to process results
])
end
Avoid Locks Entirely
Locks pose a lot of the same risks that long running Rules do. So they should be used sparingly. It is worth the effort to come up with some other approach if Locks start to appear to be a requirement.
If a Lock must be used, do not put long running code inside the locked block. The problem with Lock and long running code is not only is the long running code in that one Rule using up a thread, but all the Rules awaiting access to the Lock are also blocked and using up one of the Rule threads while they await the lock. This only exacerbates the sorts of problems illustrated in the table above.
Never put a Thread::sleep or long running Action call inside a loop
Let’s look at the following example.
while(condition){
// some code
Thread::sleep(100)
}
Looks like there is no problem. The Thread::sleep is far below the 500 max discussed above. But the problem is we have no idea how many times this while loop will iterate, and each iteration takes more than 100 msec. So if the loop iterates just five times we are already in danger territory. And we don’t know up front how many times this loop will iterate (otherwise a for loop would have been used).
Luckily it is very easy to implement a while loop using Timers.
var timer = null // global variable
timer = createTimer(now, [ |
// some code
if(!condition) timer.reschedule(now.plusMillis(100))
else timer = null
])
But what if there is no other way?
Rarely you will encounter a situation where there really is no other way but to use a long Thread::sleep or Lock or long running action in a Rule. If this is the case, writing such a Rule safely will require the Rules developer to have a very good understanding of ALL of their Rules and of the events that occur in their system including how often and at what times.
If we look back at the “Motion sensor” Rule above, one of the big problems was that the Rule itself triggers faster than it takes for the Rule to execute. This is a situation that must be avoided. So the Rules developer must understand the maximum rate that their events can occur and make sure the Rule takes less time than that to complete.
Make sure to understand how Rules interact. When one event occurs, are more than one Rule triggered? Are to independent long running Rules likely to trigger at or near the same time? As you make changes to or add new Rules, you need to revisit all of your long running Rules again to make sure they are still safe to run.
The ability to use long running Rules safely requires a deep understanding of your own home automation. Often, even for advanced developers, avoiding the long running Rules in the first place is easier.
The Aspirin Fix
Finally, you can treat the symptom of the problem by increasing the number of Rules execution threads. I call this the Aspirin Fix because it is like treating a broken bone with just Aspirin. It might deaden the pain, but it doesn’t really treat the real problem. Increasing the number of threads takes some of the pressure off so it takes more simultaneous Rules before you run out of threads, but the Rules instances are still piling up and may become a problem in the future.
To increase the number of Rules threads, add the following line to conf/services/runtime.cfg
(/etc/openhab2/services/runtime.cfg on installed OH):
org.eclipse.smarthome.threadpool:RuleEngine=5
Note that the larger value you use, the more RAM your OH will require.
On another thread: New Add-on bundle for Prometheus health Metrics - #3 by friesenkiwi @friesenkiwi and team are working on a way to export openHAB metrics to Prometheus, centralized health and status metrics server that uses Graphana to display and explore various metrics. At me request they just added the OH thread pools as one of the metrics which some here may find useful.