Cron based rules suddenly stop executing

I’m going to guess that your executeCommandLine is taking longer to execute than your Rule is triggering. The timeout in executeCommandLine must be less, perhaps significantly less, than the polling period of your Rule.

So, for example, if your Rule is executing every five seconds, the timeout on executeCommandLine should be no more than 4500 (lets give it half a second to do all the stuff before and after the call to executeCommandLine).

Now, there can be a potential bug here. If executeCommandLine doesn’t actually timeout in 4.5 seconds and return but instead blocks until the script called by executeCommandLine returns no matter how long it takes.

I’d like to see this potential bug ruled out so like Scott I eagerly await the posting of your actual Rule.

But, as a workaround you can export this looping to the script itself and have that script push the result to OH. Then OH will react to the data instead of tying up a thread awaiting the value to be returned. I’ve done similar stuff using https://github.com/rkoshak/sensorReporter. That is what the execSensor.py does.

Maybe an important update. When putting both executeCommandLine in the same rule, then the problem disappears.

It seems that the problem only pops up when

  • using multiple executeCommandLine calls from different rules, meaning that we can have multiple executeCommandLine() running at the same time
  • using a timeout
  • (maybe, too) calling scripts that take several seconds to complete

rule which is running fine

rule "scipts"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py",5000)
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py",5000)
end

before, I had them launched from different rules files:

rule "Spotify scheduled refresh"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py")
end

and

rule "Scheduled sync with redis"
when 
    Time cron "0/15 * * * * ?"   // every 15 seconds
then
//logDebug("hvd","sync redis start")
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py") //5000 ms timeout verwijderd
	//logDebug("hvd","sync redis end")
end

scripts:

from myopenhab import openhab
import redis


oh=openhab()

#open connectie met de redis database in tuinhuis
r = redis.Redis(host='blabla', port=6379)

#lees waarden uit redis en geef ze door aan openHAB
oh.sendCommand('TemperatuurTuinhuis',r.get('TemperatuurTuinhuis'))
oh.sendCommand('WarmtepompLuchtIn',r.get('WarmtepompLuchtIn'))
oh.sendCommand('WarmtepompLuchtUit',r.get('WarmtepompLuchtUit'))
oh.sendCommand('TuinhuisVerwarming',r.get('TuinhuisVerwarming'))
oh.sendCommand('PentairCurrentRpm',r.get('PentairCurrentRpm'))
oh.sendCommand('PentairCurrentPowerConsumption',r.get('PentairCurrentPowerConsumption'))
oh.sendCommand('PentairCurrentClock',r.get('PentairCurrentClock'))
oh.sendCommand('PentairStatus',r.get('PentairStatus'))
oh.sendCommand('WarmtepompStatus',r.get('WarmtepompStatus'))

if r.get('PentairStatus')=='10':
	oh.sendCommand('PentairWattPerRpm',float(r.get('PentairCurrentPowerConsumption'))/float(r.get('PentairCurrentRpm')))
else:
	oh.sendCommand('PentairWattPerRpm','0.00')
#watertemperaturen alleen loggen als de Pentair pomp draait, rekening houdend met opstart-timeout.
if oh.getState('WaterTemperaturesValid') == 'ON':
	oh.sendCommand('FilterWaterUit',r.get('FilterWaterUit'))
	oh.sendCommand('WarmtepompWaterIn',r.get('WarmtepompWaterIn'))
	oh.sendCommand('WarmtepompWaterUit',r.get('WarmtepompWaterUit'))

else:
	oh.sendCommand('FilterWaterUit','NULL')
	oh.sendCommand('WarmtepompWaterIn','NULL')
	oh.sendCommand('WarmtepompWaterUit','NULL')

	
#Setpoints vanuit openHAB doorgeven aan redis
r.set('PentairRpmSetpoint', oh.getState('PentairRpmSetpoint'))
r.set('WarmtepompCommando', oh.getState('WarmtepompCommando'))

I had the same thought and tested this hypothesis. Runtime of both scripts is in the 3s-4s range. Timeout setting is 5000ms. Initially repetition rate was 5s. When increasing to 10s or 16s, the problem persisted. And like I said, problem disappeared by leaving out the timeout and not changing the repetition rate of the executeCommandLine() calls.

That was my fallback, but I wanted to contribute as much diagnostics and analysis to the community as possible.

See above. Frankly, I think this is not adding much value. I am afraid that the focus will move to design of my scripts and the way I connect to remote resources, effectively diverting attention from the issue at hand. Keep in mind that my original scripts and rules have been running non stop for 4 months. This problem popped up when upgrading from OH2.2 to OH2.3.

Just to be clear, when you leave out the timeout the call to executeCommandLine immediately returns and any output from the called script gets lost. Since you don’t care about the return value anyway, there is no reason to use the timeout. I think (hope) that executeCommandLine will not tie up a thread somewhere if the script never returns.

Indeed, I am not interested in exit code or return value from the scripts. They use the REST API to update items and then exit. From the behaviour of openHAB I have the impression that they always exit in a clean way, but I have no solid proof.

I could implement logging in the Python scripts to find out whether one of them intermittently exits on an uncaught exception. If the team would be interested, I could give this a try.

Since the problem goes away when the timeout isn’t present I don’t think there is much benefit to getting down to the source of the problem by doing so.

My cron rules stopped working as well (see here: Rules stop execution)

I use quite a bunch of executeCommandline (90), where I sometimes needed longer timeouts like 20000ms because requesting the state of my car for instance took longer.

How can I avoid these?
I could replace some executecommandline("curl... by using the http binding.
Maybe that’s a start!?

Any “interrupted exception” in log files.

Put quarts into the trace logging and see if it crashes.

It’s still the biggest reason for instability I have at the moment. I’m forced to reboot often and have assign 50 threads to cron to. “buffer” the problem :frowning:

If you are polling the API, the HTTP binding would indeed help. If not, move the polling out of OH to an external script that polls the API and returns the result without forcing OH to wait. See my sensorReporter for an example. I send a MQTT message, the script does the commands and returns the result as a MQTT message. OH doesn’t have to wait around for the command to execute.

A 20 second executecommandline means you are tieing up 50% of the threads you Cron triggered rules, Astro triggered rules, and Timers user to execute. It wouldn’t take much to use them all up.

In 2.5 ME the pool had been increased to 10.

I will look into that - thanks!

Evenry now and then I get one - but what do you mean with:

I forgot the link.

I get loads of these in the logs.

2019-08-11 18:43:18.689 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 18:41:48.579 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]

I get them when I have a particular rule running under cron. It fires every minute and will, on occasion, cause this. Each time one of these appears in the logs I lose a quartz thread. When the count of these exceptions matches the number of threads then no cron rules will run.

I’ve tried for a long time to work around it and seek help but it seems very specific to me and my setup (It has to do with the Xiaomi binding.) I cannot create a working example for others to try and debug :frowning:

My solution:
Using wget, in a cron job, call the rest API for an item every minute. When that item changes to ON the rule fires. It’s a hack but it works.

Setting logs to debug for quartz:

 log:set DEBUG org.quartz.core

Other threads where I have talked about it:

https://community.openhab.org/search?q=cron%20%40crispin

OK, so we know from the error that something is taking more than 5 seconds to run which is the root cause of the problem. What is that something? Is there a way you can move that something outside of OH entirely (e.g. an external script that OH communicates asynchronously like using MQTT messages)?

Nothing external if I’m assuming bindings are considered internal. The binding I suspect is the cause is the xiaomi binding. It communicates with the gateway which then uses, I believe, zwave to talk to the devices. Perhaps that’s taking 5 seconds?

It could take that long but it wouldn’t be a blocking action so it wouldn’t hold on to a Timer body or cron/Astro triggered Rule for all that time. It would immediately return and the binding would process the command in the background.

From Rules your interaction with the binding is done via commands sent to Items. That command gets put on the event bus and sendCommand immediately returns. No waiting.

To solve this, you need to find what could possibly be taking five seconds to complete in Timer bodies, cron triggered Rules or Astro triggered Rules. There may be other bindings that use Quartz to schedule future events as well.

Nothing :slight_smile:
No (long) sleeps, no endless loops. Some http calls but they all 2s timeout. All timers either toggle a switch which might be turning on the power etc or it inturn fires another rule. Anything else is done via timers.
This error and associated exception only happen when I execute the rule with cron. Trigger by a switch which is toggled from api via OS cron? Never errors.
Triyger it via switch toggled from the rule triggered by cron? Errors will appear.

I’m now running from OS cron again. Will seen after a week.

May we see the (recent version of) the cron rule that messes up, and the Item triggered version that doesn’t?

Sure, both version below :slight_smile:


rule "CalcLastMotionAgo"
when     
//    Time cron "0 0/1 * 1/1 * ? *" //every minute
//    or 
    Item RecalcMotion changed to ON
then
    logWarn("Calc Last Motion ago", "Calculating last-motion")
    MotionLastCalced.sendCommand(now.toString)
    g_LastMotionAgo?.allMembers.forEach[item |

        if (item === null){
                logWarn("Calc Last Motion ago", "Item item was null. Very odd")
                return;
        }

//        val iName = item.name
//        if (item.state == NULL){
//                logWarn("Calc Last Motion ago", "Item " + iName + " was NULL. item: " + item.name)
//                //sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), "0")
//                item.sendCommand(now.toString)
//                return;
//        }

        val motionAgo = (now.millis - (item.state as DateTimeType).zonedDateTime.toInstant.toEpochMilli) / 1000 / 60
        sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), motionAgo.toString)
    ]
    RecalcMotion.sendCommand(OFF)
end 

Here are the logs from yesterday during the day

2019-08-11 14:49:29.060 [INFO ] [ome.model.script.rooms.climate.rules] - evohome_Wetroom_Temperature_Warmup is not on as we're not controlling. We're not interfering
2019-08-11 14:54:49.859 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:55:44.878 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:56:44.892 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:57:44.931 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:58:44.915 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:00.039 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityElectricUnitsH
2019-08-11 15:00:00.035 [WARN ] [pse.smarthome.model.script.Heartbeat] - ping
2019-08-11 14:59:44.947 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:33.978 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityGasUnitsH
2019-08-11 15:01:24.961 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:03:25.496 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:05:25.377 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:07:25.670 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:07:05.145 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:08:30.187 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:09:24.442 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was opened
2019-08-11 15:09:30.970 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was closed
2019-08-11 15:09:30.245 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:11:30.219 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:12:30.266 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:13:30.222 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:14:30.288 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
------snip
2019-08-11 21:14:17.540 [WARN ] [.model.script.Blinds Action list Add] - Key FrontRoomRight already in list. Count is now 8
2019-08-11 21:14:17.512 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
2019-08-11 21:14:00.040 [WARN ] [e.model.script.Blind Add Action Item] - FrontRoomRight|BlindsMotorControl_FrontRoomRight|DOWN|100
2019-08-11 21:13:17.625 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.

There are a lot of them in the “snip” section.
Last night, whilst we were talking, I changed it back to the OS-based cron option and not a single issue since.
This runs every minute

wget -q --spider http://homeauto.home:8080/classicui/CMD?RecalcMotion=ON

Now, if I have another cron-based rule and toggle the switch to on I get the same errors in the log and eventually quartz dies leaving OH “pointless” :frowning:

note the ping line in the logs. This is a cron rule which simply writes the word ping to the logs. It’s what I use to manually see if quartz has died.
No other cron jobs in OH have this problem. Just this one.

items

DateTime    Hallway_MotionSensor_LastMotion     "[%1$tY-%1$tm-%1$td  %1$tH:%1$tM]"  <clock-on>      (g_LastMotionAgo)               { channel="mihome:sensor_motion_aq2:158d0001e5266c:lastMotion" }

things

    Thing mihome:sensor_motion_aq2:158d0001e5266c "Sensor - Motion - Hallway" [itemId="158d0001e5266c"] //backdoor door

So, it loops over all the items from the Xiaomi sensors, gets the datetime value from them, works out the difference to now in minutes and then pushes that into the associated item.

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item. I’ve read the binding code and can see where it does this but a) don’t understand enough about java and the framework to see if there are any issues and b) that theory still conflicts with my understanding of what framework does and with what @rlkoshak said above. It should not deadlock.

*apologies to OP for the hijack but perhaps this actually helps as it seems I am not alone :slight_smile:

I have an every minute rule using
Time cron "0 0/1 * * * ?" // each minute
which of course shouldn’t matter.

It may be that the crucial difference between cron trigger and external periodic trigger is synchronizing with other internal processes.
I would imagine, for example, some of these items are persisted. I’d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

Some http calls but they all 2s timeout.

I have a vague recollection that the timeout may not be followed in the sendHttp*Request commands. It might be worth while adding some logging before and after those calls to verify that the timeout is actually working.

Crispin:

Anything else is done via timers.

It’s the anything else done via timers that is relevant right now.

Crispin:

This error and associated exception only happen when I execute the rule with cron.

But the error could be caused by other Timers blocking the cron trigger’s Rule’s ability to run within five seconds of being triggered. The error isn’t very clear that it’s complaining that a Timer or cron triggered rule took more than five seconds to complete or that it took more than five seconds for a thread to free up so it could execute the Rule.

Crispin:

Sure, both version below

There is nothing in that Rule that should take a long time to run, unless you have thousands of members in g_LastMotionAgo. So perhaps the real problem is the latter of the two described above, it’s taking longer than five seconds for the Rule to start executing because all the threads are currently in use for five seconds.

Crispin:

not a single issue since.
This runs every minute

That makes sense because with the OS cron trigger, the Rule is running out of the rules thread pool, not the Quartz thread pool.

The more I look at it the less I think it’s this Rule that is causing the problem. Somewhere else you have a Timer body or cron/Astro triggered Rule that is taking longer than expected or crashing in a way that doesn’t return the thread to the Quartz thread pool. Then when this Rule tries to run, it takes more than five seconds to start running and OH sees that delay and throws an error.

Crispin:

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item.

That all happens in the background using other OH thread pools and would not impact the Quartz thread pool. When you sendCommand to an Item the sendCommand doesn’t block until the binding processes it. It immediately returns once the command is placed on the Event Bus which should take mere microseconds.

rossko57:

I would imagine, for example, some of these items are persisted. I’d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

That’s an interesting theory. You can see if that’s the case by adding logging statements before and after the sendCommand/postUpdates and checking the timestamps. That will also show that anything going on in the binding is also not blocking the Rule from running.

Interesting. I have many every-minute cron rules which fire and don’t cause a problem.
I am certain, but checking again tonight, I did have a rule which used cron to trigger the switch and the switch triggered the motion rule. This also failed. Why would it fail?
While I am sure, I am setting this up again tonight and will let it run overnight.

No, the group has 5 or 6 items in it. It’s not large. To @rossko57 point, none of the items are persisted. They’re datetime if that feeds any thought process.

I have loads of cron rules which work without a problem. I have one which runs every 30 minutes and writes out “ping”. I can certainly make it move frequent to try and exacerbate the problem.
I also have a rule, cron based, which runs every 5 minutes to check the position of the blinds. It does not falter until quartz totally gives up the ghost.

This “we’ve” spoken about before. When this error happens the log entries are all out of sequence.
See here: Help understanding quartz crashing - "Worker thread was interrupt()'ed."

As a snippet, look at the logs here:

1	2019-01-05 10:48:14.674 [INFO ] [home.model.script.Motion - resetting] - Motion from - Kitchen_MotionSensor - resetting to zero	
2	2019-01-05 10:48:12.596 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@1e2e736' takes more than 5000ms.	
3	2019-01-05 10:47:37.641 [INFO ] [home.model.script.Motion - resetting] - Motion from - Hallway_MotionSensor - resetting to zero	
4	2019-01-05 10:47:27.061 [WARN ] [me.model.script.Motion - Auto Lights] - Bathroom_MotionSensor caused auto-lights - state:OFF	
5	2019-01-05 10:47:17.012 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.	
6	2019-01-05 10:52:12.890 [WARN ] [me.model.script.Motion - Auto Lights] - Room: Bathroom	
7	2019-01-05 10:49:42.950 [DEBUG] [rg.quartz.core.QuartzSchedulerThread] - batch acquisition of 1 triggers	
8	2019-01-05 10:49:42.949 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.	
9	java.lang.InterruptedException: null	
10		at java.lang.Object.wait(Native Method) ~[?:?]
11		at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
12	2019-01-05 10:49:12.610 [INFO ] [htwaverf.internal.LightwaveRfBinding] - Update from LightwaveRf ItemName[Power_Main], State[726]	
13	2019-01-05 10:49:12.610 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.state.motion.lastmotion.rules#Calc Last Motion ago#0 0/1 * 1/1 * ? *	
14	2019-01-05 10:55:12.954 [INFO ] [htwaverf.internal.LightwaveRfBinding] - Update from LightwaveRf ItemName[Power_Solar_prxy], State[144]	
15	2019-01-05 10:54:42.934 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.	
16	java.lang.InterruptedException: null	
17		at java.lang.Object.wait(Native Method) ~[?:?]
18		at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
19	

See how out of sequence the log entries are?
That “resetting” entry is the binding triggering a motion. It then set a the motionago item for that room back to zero.
So, actually, you have two rules here: 1) counting the minutes up. When it uses an item with that binding it fails and 2) a rule which fires and then resets the counter back to zero.
Two rules and they both fail, binding in the middle.

Things to try tonight -

  1. Set up another minute-based rule in the same rules file to test with.
  2. Disable the OS cron and revert to a cron-based rule which toggles the proxy item and check for failures.
  3. to do tomorrow - remove the binding from the items. You’re theories will be proved either way if I do that. If it no longer fails then it’s confirmed to be a item+binding issue. If it still fails, then it is a rule / item / threading issue.