I’m having problems with delayed rules execution. Here are the details:
- Platform information:
- Hardware: 2*2,8GHz, 6GB RAM
- OS: Linux Mint 18.3
- Java Runtime Environment: Zulu 8.27
- openHAB version: 2.2 Relase
- Issue of the topic: the execution of rules is delayed sometimes. I cannot find a pattern or any reason in the logs. Sometimes one rules is executed instantly and sometimes not.
- Bindings: exec1,http1,knx1,ntp,serial1
- Persistence: mysql
Logs working (check the execution time):
20:14:26.822 [INFO ] [smarthome.event.ItemCommandEvent] - Item 'Bewegung_EG_Flur' received command ON
20:14:26.823 [INFO ] [smarthome.event.ItemCommandEvent] - Item 'LED_EG_Flur' received command 60
Logs delayed (check the execution time):
15:06:13.515 [INFO ] [smarthome.event.ItemCommandEvent] - Item 'Bewegung_EG_Flur' received command ON
15:06:19.504 [INFO ] [smarthome.event.ItemCommandEvent] - Item 'LED_EG_Flur' received command 60
Additional info: In the last three days I was looking for other topics in this forum and I found a lot of ideas but nothing fixed my issue. At first I thought it was because of errors in the rules because of changes from OH1 to OH2. After double checking everything with Visual Studio Code I have not found any problems. And sometimes the rules are executing instantly and all is done correctly. Then I found a topic about slow rule execution after restarting OH. But even after a rule executed rapidly without a restart the next time it was executed really late. I found a lot of other topics but nothing fixed my issue.
Any ideas how I can get more detailed information of the execution of rules in the karaf console?
Is there a command to get all loaded rules?
Do you have ReentrantLocks, Thread::sleeps, or executeCommandLines in your Rules?
What is the CPU and memory usage of the machine when the rule execution is delayed?
Thank you for your answer Rich.
I do not use locks. But Thread::sleep is used in some rules. But the problem also occurs in rules without them. The same for executeCommandLine.
Could rules be executed in parallel in OH2 or could it be possible that one rule execution is blocking another?
CPU or memory usage has no abnormalities.
Do you have any advices for additional logging?
Here is what might be happening.
The Rules engine has a small number of execution threads. I think the number is 6 but the exact number doesn’t matter. Just know that it is less than 10.
This means you can have no more than 6 instances of Rules executing at the same time.
When you have Thread::sleeps and/or executeCommandLines that run scripts that take more than a few dozen milliseconds to complete you are tying up one of those execution threads until the Rule returns.
When you run out of Rule execution threads, no other Rules can run until one of those running Rules are done and returns the thread to the pool.
So, if you long Thread::sleeps or executCommandLines and especially if you have more than one, it greatly increases the liklihood that you will run out of execution threads.
This is why you will see all over the forum the warning to avoid long Thread::sleeps. Anything longer than 500 milliseconds it too long and using a Timer or some other approach is necessary. You can’t really do much about the executeCommandLines but you can with the Thread::sleeps.
I can’t say for sure that is what is happening with you, but it is the first thing I would look at.
The usual alternative to Thread::sleep is to use Timers.
@rlkoshak’s advice about Thread::sleep and executeCommandLine is spot on.
My understanding is that, out of the box, the thread pool for rules gets set to the default number of threads, which is 5. If you want to test whether that is the cause of your issue, the number of threads allocated to rule execution can be adjusted on the fly.
conf/services/runtime.cfg, put the following parameter and set NN to something greater than 5. The number you select will depend a lot on your hardware configuration, as well as what else is running on the box. I actually have this number set to 20 in 3 separate openHAB environments (one test box, and 2 prod boxes). I’ve been monitoring for the last several days to see if there are any bad side effects (of which have been none so far).
I had no idea this was possible. Hurray! I learned something today!
Thanks for posing that. Another tool added to be tool box.
I would expect the amount of RAM needed by OH would go up a little bit and if you have a deadlock somewhere in your rules you might not discover that as easily or quickly but other than that I can’t think of anything else negative about upping it.
Thank you for your answers Rich and Mark. I will try this and give feedback. Sounds like a good solution for my problem.
Truth be told, this was a new learning for me as well (hence the reason why I’m currently testing with a higher number of threads).
@Alex2016 Also note that when you are editing rules, whenever you save a rule and for the time it takes for openHAB to reload the rule (sometimes several seconds depending on the size of the rule and the power of the hardware), other rule executions are blocked. For example, I have one rule that takes about 12 seconds to reload, so there are situations where it can take longer than you might think.
It looks great! The rules are now working as expected. Thank you both so much!
Good news! I’m relieved to hear this resolved the issue. I wasn’t so sure this would do it.
Gives us more confidence that tweaking the rule engine threads using
org.eclipse.smarthome.threadpool:RuleEngine can be another tool to resolve rule performance issues.
@Alex2016 Can you let us know what value you ended up using for RuleEngine threads? It would be good to get another data point besides the value I’m using.
because I think with my rule setup I need so many parallel tasks. I have not noticed any performance issues or any other bad side effects with “30”.
But also the answer of Rich pointed me to my other mistake. I used Thread::Sleep a bit too intensive
I’ve been at 20 since last night on a pi3. Working well with a similar situation to Alexander. Is there a reason it’s defaulted to 5?
This is a really good question. I also would suggest to use a higher number as standard.
But I think the really important point is, that this feature has to be in the runtime.cfg so anyone can find it.
I dont know anything about github. Is it right that I have to make a suggestion for adding this to the file on the following page:
There’s already an open issue on github related to rule threads. I’ll probably update that issue with the results from this thread.
5 is the default number of the threads when you ask for a thread pool. Unless another number is specified, that’s what you get.
If you want to know more what RuleEngine etc., means:
The safeCall thread pool mainly controls the concurrency for handling commands and state updates.
Every ThingHandler#handleCommand and ThingHandler#handleUpdate is run through the SafeCaller in its own Thread.
The ruleEngine thread pool controls the concurrency for rule execution, basically how many rules can be executed at the same time.
For tasks which are scheduled from a ThingHandler (like script execution from the Exec binding or periodic polling) there is a separate thread pool thingHandler which can be configured.
Every day I learned one more things.
Now I have no more delays on my rules.
This seems to be the most experienced tread about delayed rule execution even if the thread is a bit old. I recently updated my openhab 1.8 to the new version 2.3 and experienced quite a lot of problem. Therefore still 1.8 is the active one and 2.3 is only running as a shadow as long as not all problems are solved. I configured a really simple system, but because of homematic binding with lots of things and items. For me, after doing many changes it looks like disabling rrd4j persistence to write every minute all values solves the problem (682 files in rrd4j). Any idea why there seems to be a relationship. I still have influxdb ON with the rule everyChange. Changing any of the threadpool did not help
I had a similar problem with many seconds of delay that ended up being traced back to rrd4j. I don’t have an answer for you. I limited what I wanted rrd4j to persist and started using mapdb for persisting * and restore on start.
I found out that my rrd4j files are on a synology NAS quite a recent one. On the disk I have high load usage. this was caused 99% by rrd4j. Putting it on a single disk (no redudancy) and ext4 instead of btfrs the load is much lower. Then also the delay goes down. Without rrd4j everything is fine also on btfrs. By the way the influxdb is on the ext4 partition.
Mine was on a Synology as well. I moved it for a few months to a NUC with just a regular HDD thinking something on the Syno was causing the issue but the issue moved. After a lot of watching processes and logs I narrowed it down to rrd4j. Once I stopped trying to persist everything the problem went away and I recently moved it all back to the Syno. The NUC was running ubuntu with ext4 (where the problem existed also).
Every time my rules engine stalls out I increase the thread pool a little, I’m up to 100 threads now and it’s still happening. This tells me I have a rule or rules that are not releasing their threads properly. I really need a tool that will let me see which rules are holding threads open.