Cron based rules suddenly stop executing

NCO · August 10, 2019, 6:27am

My cron rules stopped working as well (see here: Rules stop execution)

I use quite a bunch of executeCommandline (90), where I sometimes needed longer timeouts like 20000ms because requesting the state of my car for instance took longer.

How can I avoid these?
I could replace some executecommandline("curl... by using the http binding.
Maybe that’s a start!?

CDriver · August 10, 2019, 6:43am

Any “interrupted exception” in log files.

Put quarts into the trace logging and see if it crashes.

It’s still the biggest reason for instability I have at the moment. I’m forced to reboot often and have assign 50 threads to cron to. “buffer” the problem

rlkoshak · August 10, 2019, 3:27pm

If you are polling the API, the HTTP binding would indeed help. If not, move the polling out of OH to an external script that polls the API and returns the result without forcing OH to wait. See my sensorReporter for an example. I send a MQTT message, the script does the commands and returns the result as a MQTT message. OH doesn’t have to wait around for the command to execute.

A 20 second executecommandline means you are tieing up 50% of the threads you Cron triggered rules, Astro triggered rules, and Timers user to execute. It wouldn’t take much to use them all up.

In 2.5 ME the pool had been increased to 10.

NCO · August 11, 2019, 6:00am

I will look into that - thanks!

Evenry now and then I get one - but what do you mean with:

rlkoshak · August 11, 2019, 4:10pm

I forgot the link.

CDriver · August 11, 2019, 7:32pm

I get loads of these in the logs.

2019-08-11 18:43:18.689 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 18:41:48.579 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]

I get them when I have a particular rule running under cron. It fires every minute and will, on occasion, cause this. Each time one of these appears in the logs I lose a quartz thread. When the count of these exceptions matches the number of threads then no cron rules will run.

I’ve tried for a long time to work around it and seek help but it seems very specific to me and my setup (It has to do with the Xiaomi binding.) I cannot create a working example for others to try and debug

My solution:
Using wget, in a cron job, call the rest API for an item every minute. When that item changes to ON the rule fires. It’s a hack but it works.

Setting logs to debug for quartz:

 log:set DEBUG org.quartz.core

Other threads where I have talked about it:

https://community.openhab.org/search?q=cron%20%40crispin

rlkoshak · August 11, 2019, 10:21pm

OK, so we know from the error that something is taking more than 5 seconds to run which is the root cause of the problem. What is that something? Is there a way you can move that something outside of OH entirely (e.g. an external script that OH communicates asynchronously like using MQTT messages)?

CDriver · August 11, 2019, 10:27pm

Nothing external if I’m assuming bindings are considered internal. The binding I suspect is the cause is the xiaomi binding. It communicates with the gateway which then uses, I believe, zwave to talk to the devices. Perhaps that’s taking 5 seconds?

rlkoshak · August 11, 2019, 10:31pm

It could take that long but it wouldn’t be a blocking action so it wouldn’t hold on to a Timer body or cron/Astro triggered Rule for all that time. It would immediately return and the binding would process the command in the background.

From Rules your interaction with the binding is done via commands sent to Items. That command gets put on the event bus and sendCommand immediately returns. No waiting.

To solve this, you need to find what could possibly be taking five seconds to complete in Timer bodies, cron triggered Rules or Astro triggered Rules. There may be other bindings that use Quartz to schedule future events as well.

CDriver · August 11, 2019, 10:39pm

Nothing
No (long) sleeps, no endless loops. Some http calls but they all 2s timeout. All timers either toggle a switch which might be turning on the power etc or it inturn fires another rule. Anything else is done via timers.
This error and associated exception only happen when I execute the rule with cron. Trigger by a switch which is toggled from api via OS cron? Never errors.
Triyger it via switch toggled from the rule triggered by cron? Errors will appear.

I’m now running from OS cron again. Will seen after a week.

rossko57 · August 12, 2019, 12:33am

May we see the (recent version of) the cron rule that messes up, and the Item triggered version that doesn’t?

CDriver · August 12, 2019, 4:10am

Sure, both version below


rule "CalcLastMotionAgo"
when     
//    Time cron "0 0/1 * 1/1 * ? *" //every minute
//    or 
    Item RecalcMotion changed to ON
then
    logWarn("Calc Last Motion ago", "Calculating last-motion")
    MotionLastCalced.sendCommand(now.toString)
    g_LastMotionAgo?.allMembers.forEach[item |

        if (item === null){
                logWarn("Calc Last Motion ago", "Item item was null. Very odd")
                return;
        }

//        val iName = item.name
//        if (item.state == NULL){
//                logWarn("Calc Last Motion ago", "Item " + iName + " was NULL. item: " + item.name)
//                //sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), "0")
//                item.sendCommand(now.toString)
//                return;
//        }

        val motionAgo = (now.millis - (item.state as DateTimeType).zonedDateTime.toInstant.toEpochMilli) / 1000 / 60
        sendCommand(iName.replace("_LastMotion", "_MotionMinutesAgo"), motionAgo.toString)
    ]
    RecalcMotion.sendCommand(OFF)
end

Here are the logs from yesterday during the day

2019-08-11 14:49:29.060 [INFO ] [ome.model.script.rooms.climate.rules] - evohome_Wetroom_Temperature_Warmup is not on as we're not controlling. We're not interfering
2019-08-11 14:54:49.859 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:55:44.878 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:56:44.892 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:57:44.931 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 14:58:44.915 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:00.039 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityElectricUnitsH
2019-08-11 15:00:00.035 [WARN ] [pse.smarthome.model.script.Heartbeat] - ping
2019-08-11 14:59:44.947 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:00:33.978 [INFO ] [el.script.Utility.Count - Clear Hour] - Clearing UtilityGasUnitsH
2019-08-11 15:01:24.961 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:03:25.496 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:05:25.377 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.
2019-08-11 15:07:25.670 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness back to 10
2019-08-11 15:07:05.145 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:08:30.187 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:09:24.442 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was opened
2019-08-11 15:09:30.970 [INFO ] [eclipse.smarthome.model.script.Entry] - Front door was closed
2019-08-11 15:09:30.245 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:11:30.219 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:12:30.266 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:13:30.222 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
2019-08-11 15:14:30.288 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.
------snip
2019-08-11 21:14:17.540 [WARN ] [.model.script.Blinds Action list Add] - Key FrontRoomRight already in list. Count is now 8
2019-08-11 21:14:17.512 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.
java.lang.InterruptedException: null
	at java.lang.Object.wait(Native Method) ~[?:?]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
2019-08-11 21:14:00.040 [WARN ] [e.model.script.Blind Add Action Item] - FrontRoomRight|BlindsMotorControl_FrontRoomRight|DOWN|100
2019-08-11 21:13:17.625 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@155fd69' takes more than 5000ms.

There are a lot of them in the “snip” section.
Last night, whilst we were talking, I changed it back to the OS-based cron option and not a single issue since.
This runs every minute

wget -q --spider http://homeauto.home:8080/classicui/CMD?RecalcMotion=ON

Now, if I have another cron-based rule and toggle the switch to on I get the same errors in the log and eventually quartz dies leaving OH “pointless”

note the ping line in the logs. This is a cron rule which simply writes the word ping to the logs. It’s what I use to manually see if quartz has died.
No other cron jobs in OH have this problem. Just this one.

items

DateTime    Hallway_MotionSensor_LastMotion     "[%1$tY-%1$tm-%1$td  %1$tH:%1$tM]"  <clock-on>      (g_LastMotionAgo)               { channel="mihome:sensor_motion_aq2:158d0001e5266c:lastMotion" }

things

    Thing mihome:sensor_motion_aq2:158d0001e5266c "Sensor - Motion - Hallway" [itemId="158d0001e5266c"] //backdoor door

So, it loops over all the items from the Xiaomi sensors, gets the datetime value from them, works out the difference to now in minutes and then pushes that into the associated item.

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item. I’ve read the binding code and can see where it does this but a) don’t understand enough about java and the framework to see if there are any issues and b) that theory still conflicts with my understanding of what framework does and with what @rlkoshak said above. It should not deadlock.

*apologies to OP for the hijack but perhaps this actually helps as it seems I am not alone

rossko57 · August 12, 2019, 10:48am

I have an every minute rule using
Time cron "0 0/1 * * * ?" // each minute
which of course shouldn’t matter.

It may be that the crucial difference between cron trigger and external periodic trigger is synchronizing with other internal processes.
I would imagine, for example, some of these items are persisted. I’d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

rlkoshak · August 12, 2019, 5:07pm

Some http calls but they all 2s timeout.

I have a vague recollection that the timeout may not be followed in the sendHttp*Request commands. It might be worth while adding some logging before and after those calls to verify that the timeout is actually working.

Crispin:

Anything else is done via timers.

It’s the anything else done via timers that is relevant right now.

Crispin:

This error and associated exception only happen when I execute the rule with cron.

But the error could be caused by other Timers blocking the cron trigger’s Rule’s ability to run within five seconds of being triggered. The error isn’t very clear that it’s complaining that a Timer or cron triggered rule took more than five seconds to complete or that it took more than five seconds for a thread to free up so it could execute the Rule.

Crispin:

Sure, both version below

There is nothing in that Rule that should take a long time to run, unless you have thousands of members in g_LastMotionAgo. So perhaps the real problem is the latter of the two described above, it’s taking longer than five seconds for the Rule to start executing because all the threads are currently in use for five seconds.

Crispin:

not a single issue since.
This runs every minute

That makes sense because with the OS cron trigger, the Rule is running out of the rules thread pool, not the Quartz thread pool.

The more I look at it the less I think it’s this Rule that is causing the problem. Somewhere else you have a Timer body or cron/Astro triggered Rule that is taking longer than expected or crashing in a way that doesn’t return the thread to the Quartz thread pool. Then when this Rule tries to run, it takes more than five seconds to start running and OH sees that delay and throws an error.

Crispin:

My suspicion is there is some sort of deadlock between this code putting something on the bus and the binding polling or writing to the item.

That all happens in the background using other OH thread pools and would not impact the Quartz thread pool. When you sendCommand to an Item the sendCommand doesn’t block until the binding processes it. It immediately returns once the command is placed on the Event Bus which should take mere microseconds.

rossko57:

I would imagine, for example, some of these items are persisted. I’d be considering if the event bus logjam has an underlying cause of bulk storage I/O logjam.

That’s an interesting theory. You can see if that’s the case by adding logging statements before and after the sendCommand/postUpdates and checking the timestamps. That will also show that anything going on in the binding is also not blocking the Rule from running.

CDriver · August 12, 2019, 8:49pm

Interesting. I have many every-minute cron rules which fire and don’t cause a problem.
I am certain, but checking again tonight, I did have a rule which used cron to trigger the switch and the switch triggered the motion rule. This also failed. Why would it fail?
While I am sure, I am setting this up again tonight and will let it run overnight.

No, the group has 5 or 6 items in it. It’s not large. To @rossko57 point, none of the items are persisted. They’re datetime if that feeds any thought process.

I have loads of cron rules which work without a problem. I have one which runs every 30 minutes and writes out “ping”. I can certainly make it move frequent to try and exacerbate the problem.
I also have a rule, cron based, which runs every 5 minutes to check the position of the blinds. It does not falter until quartz totally gives up the ghost.

This “we’ve” spoken about before. When this error happens the log entries are all out of sequence.
See here: Help understanding quartz crashing - "Worker thread was interrupt()'ed."

As a snippet, look at the logs here:

1	2019-01-05 10:48:14.674 [INFO ] [home.model.script.Motion - resetting] - Motion from - Kitchen_MotionSensor - resetting to zero	
2	2019-01-05 10:48:12.596 [WARN ] [me.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@1e2e736' takes more than 5000ms.	
3	2019-01-05 10:47:37.641 [INFO ] [home.model.script.Motion - resetting] - Motion from - Hallway_MotionSensor - resetting to zero	
4	2019-01-05 10:47:27.061 [WARN ] [me.model.script.Motion - Auto Lights] - Bathroom_MotionSensor caused auto-lights - state:OFF	
5	2019-01-05 10:47:17.012 [WARN ] [l.script.controls.tablet.state.rules] - Brighten Screen - Setting brightness to 100.	
6	2019-01-05 10:52:12.890 [WARN ] [me.model.script.Motion - Auto Lights] - Room: Bathroom	
7	2019-01-05 10:49:42.950 [DEBUG] [rg.quartz.core.QuartzSchedulerThread] - batch acquisition of 1 triggers	
8	2019-01-05 10:49:42.949 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.	
9	java.lang.InterruptedException: null	
10		at java.lang.Object.wait(Native Method) ~[?:?]
11		at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
12	2019-01-05 10:49:12.610 [INFO ] [htwaverf.internal.LightwaveRfBinding] - Update from LightwaveRf ItemName[Power_Main], State[726]	
13	2019-01-05 10:49:12.610 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.state.motion.lastmotion.rules#Calc Last Motion ago#0 0/1 * 1/1 * ? *	
14	2019-01-05 10:55:12.954 [INFO ] [htwaverf.internal.LightwaveRfBinding] - Update from LightwaveRf ItemName[Power_Solar_prxy], State[144]	
15	2019-01-05 10:54:42.934 [ERROR] [org.quartz.simpl.SimpleThreadPool   ] - Worker thread was interrupt()'ed.	
16	java.lang.InterruptedException: null	
17		at java.lang.Object.wait(Native Method) ~[?:?]
18		at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568) [107:org.eclipse.smarthome.core.scheduler:0.10.0.oh240]
19

See how out of sequence the log entries are?
That “resetting” entry is the binding triggering a motion. It then set a the motionago item for that room back to zero.
So, actually, you have two rules here: 1) counting the minutes up. When it uses an item with that binding it fails and 2) a rule which fires and then resets the counter back to zero.
Two rules and they both fail, binding in the middle.

Things to try tonight -

Set up another minute-based rule in the same rules file to test with.
Disable the OS cron and revert to a cron-based rule which toggles the proxy item and check for failures.
to do tomorrow - remove the binding from the items. You’re theories will be proved either way if I do that. If it no longer fails then it’s confirmed to be a item+binding issue. If it still fails, then it is a rule / item / threading issue.

rlkoshak · August 12, 2019, 9:17pm

You have lot of cron triggering Rules.
Let’s assume you have the default number of Quartz threads of 2
Two of those cron triggered Rules take 10 seconds to complete.
While those two Rules are running, let’s say 3 seconds in, another cron triggered Rule tries to execute.
It takes more than five seconds for that third cron triggered Rule to start running and OH throws an exception.

It doesn’t matter what that third Rules actually does. It never got a chance to do it because those first two cron triggered Rules are hogging all the execution threads.

It can be even worse if for some reason those first two cron triggered Rules never return because of some bug or error that occurred.

The fact that you have “many every-minute cron rules” is a code smell. You should not have many. More than two would be too many, and even for the two you would need to make sure they don’t execute at exactly the same time (e.g.one uses 0 * * * * * ? * and the other uses 30 * * * * * ? *) so they don’t execute at the same time.

If you have lots of cron triggered Rules trying to trigger at the same time, I can totally see how you might break Quartz itself which might be the root cause of the problem. I can also see how you might end up starving a Rule or two because it isn’t clear whether Quartz uses a FIFO queue for those Rules that are waiting to get a thread.

Only to further confirm that the problem isn’t in this Rule, it’s before this Rule. The problem is your other cron triggered Rules (and Timers and Astro triggered Rules) are starving this Rule. It takes more than five seconds for the Rule to get an execution thread so it get’s errored out. That error may “break” Quartz preventing it from running any further Rules, or something may have broken Quartz before that point. I don’t think it is well tested error code as almost no one has encountered this problem with cron Rules. Most users don’t have that many and they rarely try to run them at the same time.

I’m convinced now that “loads of cron rules” is the problem.

I don’t know what a proper sequence for those log entries would be. Remember, each event in OH is processed in parallel. If you have two events being processed at the same time then the log entries for those events will be interleaved.

3 I already know will not do anything. The binding doesn’t block Rules. The binding can take half an hour to respond to an event but the sendCommand will return in milliseconds.

rossko57 · August 12, 2019, 9:19pm

Well yes, we might expect jumbled log messages with this going on. I’ll hazard a guess your system is fairly stressed and can be tipped into a logjam in various ways.

My thought about persistence was more about strategies like everyMinute invoking a workload spike, which will coincide with this rules cron trigger. They’re all running at the same whole minute and no seconds, not staggered across the clockface.

Humm, now we got lots of rules and possibly system services all lined up at the top of the minute.
I’ve got to wonder why lots of cron rules are needed in openHAB.

Time cron "30 * * * * ? *" // each minute at second 30
would run your suspect rule every minute but staggered from other minute based happenings

CDriver · August 13, 2019, 7:31am

I’ll go through and catalogue the rules I have running and when they actually execute. It may well be that I have a number of clashes.
I did change the Quartz threads to 50 (I know! ) to absorb the crashes for longer while I worked this out. Would the fact that I have that many threads go against the thought that it’s too many cron jobs running at the same time? At least for Quartz. Yes, concurrent rules might be a problem. I will try and understand those too as I have some which run quite frequently due to power consumption fluctuations in the house.

This, however, is interesting. Last night I did option 3 first (I know I know )
I put the cron back in to the rule trigger. I commented out the switch item change and left the OS cron as-is.
I then also commented out the channel link for the item in the group. i.e. the one it loops over to get the last time motion was detected.
The rule ran all night, every minute as it should, without one crash. The safety “ping” rule is also running happily which suggests quarts is still alive.
I can see the MotionAgo values are increasing because the rule is counting up and the last-motion datetime item is not being updated by the binding.

Does this contradict with your thought of “too many rules executing”?
Without the binding linked to the item which is in a group being enumerated, the rule, and quartz, never fail.

I’ll try option 2 this evening and see if that reverts to failing overnight.

@rossko57
Regarding the persistence etc, I have mysql logging and mapdb. The most frequent timed one is every 5 minutes. I’ll change that to 15 seconds past to spread that around. The others, mysql (and one mapdb) is quite chatty as it’s reading power values of the house and changes every 30 seconds.

Question:
Could it be that the binding is breaking the model by trying to access the variables directly and not put the changes “on the bus” where the normal framework takes over?
I have other rules which enumerate items in groups and change their values and nothing ever faults there. It’s always just this guy…

Thanks for the help

rossko57 · August 13, 2019, 10:52am

It’s not really clear if these
Dispatching event ... EventLogger@155fd69' takes more than 5000ms.
warnings actually go away when you change the “problem rule”.

A closer look at your “problem rule”.

So far as I can see, there’s only circumstantial evidence linking this rule to the InterruptedException: null errors i.e. they only happen sometimes when this rule is triggered a particular way, but no direct link in any message?

Being picky, your OS+Item based external trigger is command based. Rule trigger is state changed.
Should the OH event bus be jammed when your command comes in, the rule will not trigger until after the command gets on the bus, and then autoupdate processes it into an update, and the update gets put on the bus.
In other words, the external trigger will fire the rule after the event bus has got moving again, the rule will effectively be queued.

All I’m doing is looking for the reason why the rule should have different consequences depending on trigger source.

What is the purpose of the ? here, could that be hiding/suppressing a clue.

rlkoshak · August 13, 2019, 3:26pm

Only if the problem is caused by running out of threads. But if the problem is a lack of resources because you have a ton of stuff trying to be done on the first second of a minute than it will only add to the amount of work that is trying to be done all at once.

It is impossible for the binding to access the variable directly.