Cron based rules suddenly stop executing

Could you please post your rule and script?

Iā€™m going to guess that your executeCommandLine is taking longer to execute than your Rule is triggering. The timeout in executeCommandLine must be less, perhaps significantly less, than the polling period of your Rule.

So, for example, if your Rule is executing every five seconds, the timeout on executeCommandLine should be no more than 4500 (lets give it half a second to do all the stuff before and after the call to executeCommandLine).

Now, there can be a potential bug here. If executeCommandLine doesnā€™t actually timeout in 4.5 seconds and return but instead blocks until the script called by executeCommandLine returns no matter how long it takes.

Iā€™d like to see this potential bug ruled out so like Scott I eagerly await the posting of your actual Rule.

But, as a workaround you can export this looping to the script itself and have that script push the result to OH. Then OH will react to the data instead of tying up a thread awaiting the value to be returned. Iā€™ve done similar stuff using https://github.com/rkoshak/sensorReporter. That is what the execSensor.py does.

Maybe an important update. When putting both executeCommandLine in the same rule, then the problem disappears.

It seems that the problem only pops up when

  • using multiple executeCommandLine calls from different rules, meaning that we can have multiple executeCommandLine() running at the same time
  • using a timeout
  • (maybe, too) calling scripts that take several seconds to complete

rule which is running fine

rule "scipts"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py",5000)
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py",5000)
end

before, I had them launched from different rules files:

rule "Spotify scheduled refresh"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py")
end

and

rule "Scheduled sync with redis"
when 
    Time cron "0/15 * * * * ?"   // every 15 seconds
then
//logDebug("hvd","sync redis start")
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py") //5000 ms timeout verwijderd
	//logDebug("hvd","sync redis end")
end

scripts:

from myopenhab import openhab
import redis


oh=openhab()

#open connectie met de redis database in tuinhuis
r = redis.Redis(host='blabla', port=6379)

#lees waarden uit redis en geef ze door aan openHAB
oh.sendCommand('TemperatuurTuinhuis',r.get('TemperatuurTuinhuis'))
oh.sendCommand('WarmtepompLuchtIn',r.get('WarmtepompLuchtIn'))
oh.sendCommand('WarmtepompLuchtUit',r.get('WarmtepompLuchtUit'))
oh.sendCommand('TuinhuisVerwarming',r.get('TuinhuisVerwarming'))
oh.sendCommand('PentairCurrentRpm',r.get('PentairCurrentRpm'))
oh.sendCommand('PentairCurrentPowerConsumption',r.get('PentairCurrentPowerConsumption'))
oh.sendCommand('PentairCurrentClock',r.get('PentairCurrentClock'))
oh.sendCommand('PentairStatus',r.get('PentairStatus'))
oh.sendCommand('WarmtepompStatus',r.get('WarmtepompStatus'))

if r.get('PentairStatus')=='10':
	oh.sendCommand('PentairWattPerRpm',float(r.get('PentairCurrentPowerConsumption'))/float(r.get('PentairCurrentRpm')))
else:
	oh.sendCommand('PentairWattPerRpm','0.00')
#watertemperaturen alleen loggen als de Pentair pomp draait, rekening houdend met opstart-timeout.
if oh.getState('WaterTemperaturesValid') == 'ON':
	oh.sendCommand('FilterWaterUit',r.get('FilterWaterUit'))
	oh.sendCommand('WarmtepompWaterIn',r.get('WarmtepompWaterIn'))
	oh.sendCommand('WarmtepompWaterUit',r.get('WarmtepompWaterUit'))

else:
	oh.sendCommand('FilterWaterUit','NULL')
	oh.sendCommand('WarmtepompWaterIn','NULL')
	oh.sendCommand('WarmtepompWaterUit','NULL')

	
#Setpoints vanuit openHAB doorgeven aan redis
r.set('PentairRpmSetpoint', oh.getState('PentairRpmSetpoint'))
r.set('WarmtepompCommando', oh.getState('WarmtepompCommando'))

I had the same thought and tested this hypothesis. Runtime of both scripts is in the 3s-4s range. Timeout setting is 5000ms. Initially repetition rate was 5s. When increasing to 10s or 16s, the problem persisted. And like I said, problem disappeared by leaving out the timeout and not changing the repetition rate of the executeCommandLine() calls.

That was my fallback, but I wanted to contribute as much diagnostics and analysis to the community as possible.

See above. Frankly, I think this is not adding much value. I am afraid that the focus will move to design of my scripts and the way I connect to remote resources, effectively diverting attention from the issue at hand. Keep in mind that my original scripts and rules have been running non stop for 4 months. This problem popped up when upgrading from OH2.2 to OH2.3.

Just to be clear, when you leave out the timeout the call to executeCommandLine immediately returns and any output from the called script gets lost. Since you donā€™t care about the return value anyway, there is no reason to use the timeout. I think (hope) that executeCommandLine will not tie up a thread somewhere if the script never returns.

Indeed, I am not interested in exit code or return value from the scripts. They use the REST API to update items and then exit. From the behaviour of openHAB I have the impression that they always exit in a clean way, but I have no solid proof.

I could implement logging in the Python scripts to find out whether one of them intermittently exits on an uncaught exception. If the team would be interested, I could give this a try.

Since the problem goes away when the timeout isnā€™t present I donā€™t think there is much benefit to getting down to the source of the problem by doing so.

My cron rules stopped working as well (see here: Rules stop execution)

I use quite a bunch of executeCommandline (90), where I sometimes needed longer timeouts like 20000ms because requesting the state of my car for instance took longer.

How can I avoid these?
I could replace some executecommandline("curl... by using the http binding.
Maybe thatā€™s a start!?

Any ā€œinterrupted exceptionā€ in log files.

Put quarts into the trace logging and see if it crashes.

Itā€™s still the biggest reason for instability I have at the moment. Iā€™m forced to reboot often and have assign 50 threads to cron to. ā€œbufferā€ the problem :frowning:

If you are polling the API, the HTTP binding would indeed help. If not, move the polling out of OH to an external script that polls the API and returns the result without forcing OH to wait. See my sensorReporter for an example. I send a MQTT message, the script does the commands and returns the result as a MQTT message. OH doesnā€™t have to wait around for the command to execute.

A 20 second executecommandline means you are tieing up 50% of the threads you Cron triggered rules, Astro triggered rules, and Timers user to execute. It wouldnā€™t take much to use them all up.

In 2.5 ME the pool had been increased to 10.

I will look into that - thanks!

Evenry now and then I get one - but what do you mean with:

I forgot the link.

I get loads of these in the logs.

2019-08-11 18:43:18.689 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 18:41:48.579 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]

I get them when I have a particular rule running under cron. It fires every minute and will, on occasion, cause this. Each time one of these appears in the logs I lose a quartz thread. When the count of these exceptions matches the number of threads then no cron rules will run.

Iā€™ve tried for a long time to work around it and seek help but it seems very specific to me and my setup (It has to do with the Xiaomi binding.) I cannot create a working example for others to try and debug :frowning:

My solution:
Using wget, in a cron job, call the rest API for an item every minute. When that item changes to ON the rule fires. Itā€™s a hack but it works.

Setting logs to debug for quartz:

 log:set DEBUG org.quartz.core

Other threads where I have talked about it:


https://community.openhab.org/search?q=cron%20%40crispin

OK, so we know from the error that something is taking more than 5 seconds to run which is the root cause of the problem. What is that something? Is there a way you can move that something outside of OH entirely (e.g. an external script that OH communicates asynchronously like using MQTT messages)?

Nothing external if Iā€™m assuming bindings are considered internal. The binding I suspect is the cause is the xiaomi binding. It communicates with the gateway which then uses, I believe, zwave to talk to the devices. Perhaps thatā€™s taking 5 seconds?

It could take that long but it wouldnā€™t be a blocking action so it wouldnā€™t hold on to a Timer body or cron/Astro triggered Rule for all that time. It would immediately return and the binding would process the command in the background.

From Rules your interaction with the binding is done via commands sent to Items. That command gets put on the event bus and sendCommand immediately returns. No waiting.

To solve this, you need to find what could possibly be taking five seconds to complete in Timer bodies, cron triggered Rules or Astro triggered Rules. There may be other bindings that use Quartz to schedule future events as well.

Nothing :slight_smile:
No (long) sleeps, no endless loops. Some http calls but they all 2s timeout. All timers either toggle a switch which might be turning on the power etc or it inturn fires another rule. Anything else is done via timers.
This error and associated exception only happen when I execute the rule with cron. Trigger by a switch which is toggled from api via OS cron? Never errors.
Triyger it via switch toggled from the rule triggered by cron? Errors will appear.

Iā€™m now running from OS cron again. Will seen after a week.

May we see the (recent version of) the cron rule that messes up, and the Item triggered version that doesnā€™t?

Sure, both version below :slight_smile:


rule "CalcLastMotionAgo"
when     
//    Time cron "0 0/1 * 1/1 * ? *" //every minute
//    or 
    Item RecalcMotion changed to ON
then
    logWarn("Calc Last Motion ago", "Calculating last-motion")
    MotionLastCalced.sendCommand(now.toString)
    g_LastMotionAgo?.allMembers.forEach[item |

        if (item === null){
                logWarn("Calc Last Motion ago", "Item item was null. Very odd")
                return;
        }

//        val iName = item.name
//        if (item.state == NULL){
//                logWarn("Calc Last Motion ago", "Item " + iName + " was NULL. item: " + item.name)
//                //sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), "0")
//                item.sendCommand(now.toString)
//                return;
//        }

        val motionAgo = (now.millis - (item.state as DateTimeType).zonedDateTime.toInstant.toEpochMilli) / 1000 / 60
        sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), motionAgo.toString)
    ]
    RecalcMotion.sendCommand(OFF)
end 

Here are the logs from yesterday during the day

2019-08-11 14:49:29.060 [INFO ] [ome.model.script.rooms.climate.rules] - evohome_Wetroom_Temperature_Warmup is not on as we're not controlling. We're not interfering
2019-08-11 14:54:49.859 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:55:44.878 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:56:44.892 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:57:44.931 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:58:44.915 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:00.039 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityElectricUnitsH
2019-08-11 15:00:00.035 [WARN ] [pse.smarthome.model.script.Heartbeat] - ping
2019-08-11 14:59:44.947 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:33.978 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityGasUnitsH
2019-08-11 15:01:24.961 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:03:25.496 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:05:25.377 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:07:25.670 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:07:05.145 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:08:30.187 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:09:24.442 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was opened
2019-08-11 15:09:30.970 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was closed
2019-08-11 15:09:30.245 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:11:30.219 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:12:30.266 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:13:30.222 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:14:30.288 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
------snip
2019-08-11 21:14:17.540 [WARN ] [.model.script.Blinds Action list Add] - Key FrontRoomRight already in list. Count is now 8
2019-08-11 21:14:17.512 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
2019-08-11 21:14:00.040 [WARN ] [e.model.script.Blind Add Action Item] - FrontRoomRight|BlindsMotorControl_FrontRoomRight|DOWN|100
2019-08-11 21:13:17.625 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.

There are a lot of them in the ā€œsnipā€ section.
Last night, whilst we were talking, I changed it back to the OS-based cron option and not a single issue since.
This runs every minute

wget -q --spider http://homeauto.home:8080/classicui/CMD?RecalcMotion=ON

Now, if I have another cron-based rule and toggle the switch to on I get the same errors in the log and eventually quartz dies leaving OH ā€œpointlessā€ :frowning:

note the ping line in the logs. This is a cron rule which simply writes the word ping to the logs. Itā€™s what I use to manually see if quartz has died.
No other cron jobs in OH have this problem. Just this one.

items

DateTime    Hallway_MotionSensor_LastMotion     "[%1$tY-%1$tm-%1$td  %1$tH:%1$tM]"  <clock-on>      (g_LastMotionAgo)               { channel="mihome:sensor_motion_aq2:158d0001e5266c:lastMotion" }

things

    Thing mihome:sensor_motion_aq2:158d0001e5266c "Sensor - Motion - Hallway" [itemId="158d0001e5266c"] //backdoor door

So, it loops over all the items from the Xiaomi sensors, gets the datetime value from them, works out the difference to now in minutes and then pushes that into the associated item.

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item. Iā€™ve read the binding code and can see where it does this but a) donā€™t understand enough about java and the framework to see if there are any issues and b) that theory still conflicts with my understanding of what framework does and with what @rlkoshak said above. It should not deadlock.

*apologies to OP for the hijack but perhaps this actually helps as it seems I am not alone :slight_smile:

I have an every minute rule using
Time cron "0 0/1 * * * ?" // each minute
which of course shouldnā€™t matter.

It may be that the crucial difference between cron trigger and external periodic trigger is synchronizing with other internal processes.
I would imagine, for example, some of these items are persisted. Iā€™d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

Some http calls but they all 2s timeout.

I have a vague recollection that the timeout may not be followed in the sendHttp*Request commands. It might be worth while adding some logging before and after those calls to verify that the timeout is actually working.

Crispin:

Anything else is done via timers.

Itā€™s the anything else done via timers that is relevant right now.

Crispin:

This error and associated exception only happen when I execute the rule with cron.

But the error could be caused by other Timers blocking the cron triggerā€™s Ruleā€™s ability to run within five seconds of being triggered. The error isnā€™t very clear that itā€™s complaining that a Timer or cron triggered rule took more than five seconds to complete or that it took more than five seconds for a thread to free up so it could execute the Rule.

Crispin:

Sure, both version below

There is nothing in that Rule that should take a long time to run, unless you have thousands of members in g_LastMotionAgo. So perhaps the real problem is the latter of the two described above, itā€™s taking longer than five seconds for the Rule to start executing because all the threads are currently in use for five seconds.

Crispin:

not a single issue since.
This runs every minute

That makes sense because with the OS cron trigger, the Rule is running out of the rules thread pool, not the Quartz thread pool.

The more I look at it the less I think itā€™s this Rule that is causing the problem. Somewhere else you have a Timer body or cron/Astro triggered Rule that is taking longer than expected or crashing in a way that doesnā€™t return the thread to the Quartz thread pool. Then when this Rule tries to run, it takes more than five seconds to start running and OH sees that delay and throws an error.

Crispin:

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item.

That all happens in the background using other OH thread pools and would not impact the Quartz thread pool. When you sendCommand to an Item the sendCommand doesnā€™t block until the binding processes it. It immediately returns once the command is placed on the Event Bus which should take mere microseconds.

rossko57:

I would imagine, for example, some of these items are persisted. Iā€™d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

Thatā€™s an interesting theory. You can see if thatā€™s the case by adding logging statements before and after the sendCommand/postUpdates and checking the timestamps. That will also show that anything going on in the binding is also not blocking the Rule from running.