Could you please post your rule and script?
Iām going to guess that your executeCommandLine is taking longer to execute than your Rule is triggering. The timeout in executeCommandLine must be less, perhaps significantly less, than the polling period of your Rule.
So, for example, if your Rule is executing every five seconds, the timeout on executeCommandLine should be no more than 4500 (lets give it half a second to do all the stuff before and after the call to executeCommandLine).
Now, there can be a potential bug here. If executeCommandLine doesnāt actually timeout in 4.5 seconds and return but instead blocks until the script called by executeCommandLine returns no matter how long it takes.
Iād like to see this potential bug ruled out so like Scott I eagerly await the posting of your actual Rule.
But, as a workaround you can export this looping to the script itself and have that script push the result to OH. Then OH will react to the data instead of tying up a thread awaiting the value to be returned. Iāve done similar stuff using https://github.com/rkoshak/sensorReporter. That is what the execSensor.py does.
Maybe an important update. When putting both executeCommandLine in the same rule, then the problem disappears.
It seems that the problem only pops up when
- using multiple executeCommandLine calls from different rules, meaning that we can have multiple executeCommandLine() running at the same time
- using a timeout
- (maybe, too) calling scripts that take several seconds to complete
rule which is running fine
rule "scipts"
when
Time cron "0/12 * * * * ?"
then
executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py",5000)
executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py",5000)
end
before, I had them launched from different rules files:
rule "Spotify scheduled refresh"
when
Time cron "0/12 * * * * ?"
then
executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py")
end
and
rule "Scheduled sync with redis"
when
Time cron "0/15 * * * * ?" // every 15 seconds
then
//logDebug("hvd","sync redis start")
executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py") //5000 ms timeout verwijderd
//logDebug("hvd","sync redis end")
end
scripts:
- spotify.py: the spotify API taken from GitHub - pmpkk/openhab_spotify-webconnect-api: A simple Python script to integrate Spotify's Web Connect API (https://developer.spotify.com/web-api/web-api-connect-endpoint-reference/) to OpenHab (openhab.org)
- tuinhuis-zwembad.py: connect to remote Redis instance, read and write in both directions
from myopenhab import openhab
import redis
oh=openhab()
#open connectie met de redis database in tuinhuis
r = redis.Redis(host='blabla', port=6379)
#lees waarden uit redis en geef ze door aan openHAB
oh.sendCommand('TemperatuurTuinhuis',r.get('TemperatuurTuinhuis'))
oh.sendCommand('WarmtepompLuchtIn',r.get('WarmtepompLuchtIn'))
oh.sendCommand('WarmtepompLuchtUit',r.get('WarmtepompLuchtUit'))
oh.sendCommand('TuinhuisVerwarming',r.get('TuinhuisVerwarming'))
oh.sendCommand('PentairCurrentRpm',r.get('PentairCurrentRpm'))
oh.sendCommand('PentairCurrentPowerConsumption',r.get('PentairCurrentPowerConsumption'))
oh.sendCommand('PentairCurrentClock',r.get('PentairCurrentClock'))
oh.sendCommand('PentairStatus',r.get('PentairStatus'))
oh.sendCommand('WarmtepompStatus',r.get('WarmtepompStatus'))
if r.get('PentairStatus')=='10':
oh.sendCommand('PentairWattPerRpm',float(r.get('PentairCurrentPowerConsumption'))/float(r.get('PentairCurrentRpm')))
else:
oh.sendCommand('PentairWattPerRpm','0.00')
#watertemperaturen alleen loggen als de Pentair pomp draait, rekening houdend met opstart-timeout.
if oh.getState('WaterTemperaturesValid') == 'ON':
oh.sendCommand('FilterWaterUit',r.get('FilterWaterUit'))
oh.sendCommand('WarmtepompWaterIn',r.get('WarmtepompWaterIn'))
oh.sendCommand('WarmtepompWaterUit',r.get('WarmtepompWaterUit'))
else:
oh.sendCommand('FilterWaterUit','NULL')
oh.sendCommand('WarmtepompWaterIn','NULL')
oh.sendCommand('WarmtepompWaterUit','NULL')
#Setpoints vanuit openHAB doorgeven aan redis
r.set('PentairRpmSetpoint', oh.getState('PentairRpmSetpoint'))
r.set('WarmtepompCommando', oh.getState('WarmtepompCommando'))
I had the same thought and tested this hypothesis. Runtime of both scripts is in the 3s-4s range. Timeout setting is 5000ms. Initially repetition rate was 5s. When increasing to 10s or 16s, the problem persisted. And like I said, problem disappeared by leaving out the timeout and not changing the repetition rate of the executeCommandLine() calls.
That was my fallback, but I wanted to contribute as much diagnostics and analysis to the community as possible.
See above. Frankly, I think this is not adding much value. I am afraid that the focus will move to design of my scripts and the way I connect to remote resources, effectively diverting attention from the issue at hand. Keep in mind that my original scripts and rules have been running non stop for 4 months. This problem popped up when upgrading from OH2.2 to OH2.3.
Just to be clear, when you leave out the timeout the call to executeCommandLine immediately returns and any output from the called script gets lost. Since you donāt care about the return value anyway, there is no reason to use the timeout. I think (hope) that executeCommandLine will not tie up a thread somewhere if the script never returns.
Indeed, I am not interested in exit code or return value from the scripts. They use the REST API to update items and then exit. From the behaviour of openHAB I have the impression that they always exit in a clean way, but I have no solid proof.
I could implement logging in the Python scripts to find out whether one of them intermittently exits on an uncaught exception. If the team would be interested, I could give this a try.
Since the problem goes away when the timeout isnāt present I donāt think there is much benefit to getting down to the source of the problem by doing so.
My cron rules stopped working as well (see here: Rules stop execution)
I use quite a bunch of executeCommandline (90), where I sometimes needed longer timeouts like 20000ms because requesting the state of my car for instance took longer.
How can I avoid these?
I could replace some executecommandline("curl...
by using the http binding.
Maybe thatās a start!?
Any āinterrupted exceptionā in log files.
Put quarts into the trace logging and see if it crashes.
Itās still the biggest reason for instability I have at the moment. Iām forced to reboot often and have assign 50 threads to cron to. ābufferā the problem
If you are polling the API, the HTTP binding would indeed help. If not, move the polling out of OH to an external script that polls the API and returns the result without forcing OH to wait. See my sensorReporter for an example. I send a MQTT message, the script does the commands and returns the result as a MQTT message. OH doesnāt have to wait around for the command to execute.
A 20 second executecommandline means you are tieing up 50% of the threads you Cron triggered rules, Astro triggered rules, and Timers user to execute. It wouldnāt take much to use them all up.
In 2.5 ME the pool had been increased to 10.
I will look into that - thanks!
Evenry now and then I get one - but what do you mean with:
I forgot the link.
I get loads of these in the logs.
2019-08-11 18:43:18.689 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 18:41:48.579 [ERROR] [org.quartz.simpl.SimpleThreadPool ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[?:?]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
I get them when I have a particular rule running under cron. It fires every minute and will, on occasion, cause this. Each time one of these appears in the logs I lose a quartz thread. When the count of these exceptions matches the number of threads then no cron rules will run.
Iāve tried for a long time to work around it and seek help but it seems very specific to me and my setup (It has to do with the Xiaomi binding.) I cannot create a working example for others to try and debug
My solution:
Using wget, in a cron job, call the rest API for an item every minute. When that item changes to ON the rule fires. Itās a hack but it works.
Setting logs to debug for quartz:
log:set DEBUG org.quartz.core
Other threads where I have talked about it:
OK, so we know from the error that something is taking more than 5 seconds to run which is the root cause of the problem. What is that something? Is there a way you can move that something outside of OH entirely (e.g. an external script that OH communicates asynchronously like using MQTT messages)?
Nothing external if Iām assuming bindings are considered internal. The binding I suspect is the cause is the xiaomi binding. It communicates with the gateway which then uses, I believe, zwave to talk to the devices. Perhaps thatās taking 5 seconds?
It could take that long but it wouldnāt be a blocking action so it wouldnāt hold on to a Timer body or cron/Astro triggered Rule for all that time. It would immediately return and the binding would process the command in the background.
From Rules your interaction with the binding is done via commands sent to Items. That command gets put on the event bus and sendCommand immediately returns. No waiting.
To solve this, you need to find what could possibly be taking five seconds to complete in Timer bodies, cron triggered Rules or Astro triggered Rules. There may be other bindings that use Quartz to schedule future events as well.
Nothing
No (long) sleeps, no endless loops. Some http calls but they all 2s timeout. All timers either toggle a switch which might be turning on the power etc or it inturn fires another rule. Anything else is done via timers.
This error and associated exception only happen when I execute the rule with cron. Trigger by a switch which is toggled from api via OS cron? Never errors.
Triyger it via switch toggled from the rule triggered by cron? Errors will appear.
Iām now running from OS cron again. Will seen after a week.
May we see the (recent version of) the cron rule that messes up, and the Item triggered version that doesnāt?
Sure, both version below
rule "CalcLastMotionAgo"
when
// Time cron "0 0/1 * 1/1 * ? *" //every minute
// or
Item RecalcMotion changed to ON
then
logWarn("Calc Last Motion ago", "Calculating last-motion")
MotionLastCalced.sendCommand(now.toString)
g_LastMotionAgo?.allMembers.forEach[item |
if (item === null){
logWarn("Calc Last Motion ago", "Item item was null. Very odd")
return;
}
// val iName = item.name
// if (item.state == NULL){
// logWarn("Calc Last Motion ago", "Item " + iName + " was NULL. item: " + item.name)
// //sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), "0")
// item.sendCommand(now.toString)
// return;
// }
val motionAgo = (now.millis - (item.state as DateTimeType).zonedDateTime.toInstant.toEpochMilli) / 1000 / 60
sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), motionAgo.toString)
]
RecalcMotion.sendCommand(OFF)
end
Here are the logs from yesterday during the day
2019-08-11 14:49:29.060 [INFO ] [ome.model.script.rooms.climate.rules] - evohome_Wetroom_Temperature_Warmup is not on as we're not controlling. We're not interfering
2019-08-11 14:54:49.859 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:55:44.878 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:56:44.892 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:57:44.931 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:58:44.915 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:00.039 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityElectricUnitsH
2019-08-11 15:00:00.035 [WARN ] [pse.smarthome.model.script.Heartbeat] - ping
2019-08-11 14:59:44.947 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:33.978 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityGasUnitsH
2019-08-11 15:01:24.961 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:03:25.496 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:05:25.377 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:07:25.670 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:07:05.145 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:08:30.187 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:09:24.442 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was opened
2019-08-11 15:09:30.970 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was closed
2019-08-11 15:09:30.245 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:11:30.219 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:12:30.266 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:13:30.222 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:14:30.288 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
------snip
2019-08-11 21:14:17.540 [WARN ] [.model.script.Blinds Action list Add] - Key FrontRoomRight already in list. Count is now 8
2019-08-11 21:14:17.512 [ERROR] [org.quartz.simpl.SimpleThreadPool ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[?:?]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
2019-08-11 21:14:00.040 [WARN ] [e.model.script.Blind Add Action Item] - FrontRoomRight|BlindsMotorControl_FrontRoomRight|DOWN|100
2019-08-11 21:13:17.625 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
There are a lot of them in the āsnipā section.
Last night, whilst we were talking, I changed it back to the OS-based cron option and not a single issue since.
This runs every minute
wget -q --spider http://homeauto.home:8080/classicui/CMD?RecalcMotion=ON
Now, if I have another cron-based rule and toggle the switch to on I get the same errors in the log and eventually quartz dies leaving OH āpointlessā
note the ping line in the logs. This is a cron rule which simply writes the word ping to the logs. Itās what I use to manually see if quartz has died.
No other cron jobs in OH have this problem. Just this one.
items
DateTime Hallway_MotionSensor_LastMotion "[%1$tY-%1$tm-%1$td %1$tH:%1$tM]" <clock-on> (g_LastMotionAgo) { channel="mihome:sensor_motion_aq2:158d0001e5266c:lastMotion" }
things
Thing mihome:sensor_motion_aq2:158d0001e5266c "Sensor - Motion - Hallway" [itemId="158d0001e5266c"] //backdoor door
So, it loops over all the items from the Xiaomi sensors, gets the datetime value from them, works out the difference to now in minutes and then pushes that into the associated item.
My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item. Iāve read the binding code and can see where it does this but a) donāt understand enough about java and the framework to see if there are any issues and b) that theory still conflicts with my understanding of what framework does and with what @rlkoshak said above. It should not deadlock.
*apologies to OP for the hijack but perhaps this actually helps as it seems I am not alone
I have an every minute rule using
Time cron "0 0/1 * * * ?" // each minute
which of course shouldnāt matter.
It may be that the crucial difference between cron trigger and external periodic trigger is synchronizing with other internal processes.
I would imagine, for example, some of these items are persisted. Iād be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.
Some http calls but they all 2s timeout.
I have a vague recollection that the timeout may not be followed in the sendHttp*Request commands. It might be worth while adding some logging before and after those calls to verify that the timeout is actually working.
Crispin:
Anything else is done via timers.
Itās the anything else done via timers that is relevant right now.
Crispin:
This error and associated exception only happen when I execute the rule with cron.
But the error could be caused by other Timers blocking the cron triggerās Ruleās ability to run within five seconds of being triggered. The error isnāt very clear that itās complaining that a Timer or cron triggered rule took more than five seconds to complete or that it took more than five seconds for a thread to free up so it could execute the Rule.
Crispin:
Sure, both version below
There is nothing in that Rule that should take a long time to run, unless you have thousands of members in g_LastMotionAgo. So perhaps the real problem is the latter of the two described above, itās taking longer than five seconds for the Rule to start executing because all the threads are currently in use for five seconds.
Crispin:
not a single issue since.
This runs every minute
That makes sense because with the OS cron trigger, the Rule is running out of the rules thread pool, not the Quartz thread pool.
The more I look at it the less I think itās this Rule that is causing the problem. Somewhere else you have a Timer body or cron/Astro triggered Rule that is taking longer than expected or crashing in a way that doesnāt return the thread to the Quartz thread pool. Then when this Rule tries to run, it takes more than five seconds to start running and OH sees that delay and throws an error.
Crispin:
My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item.
That all happens in the background using other OH thread pools and would not impact the Quartz thread pool. When you sendCommand to an Item the sendCommand doesnāt block until the binding processes it. It immediately returns once the command is placed on the Event Bus which should take mere microseconds.
rossko57:
I would imagine, for example, some of these items are persisted. Iād be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.
Thatās an interesting theory. You can see if thatās the case by adding logging statements before and after the sendCommand/postUpdates and checking the timestamps. That will also show that anything going on in the binding is also not blocking the Rule from running.