Cron scheduled rules suddenly stop working

sdda · August 25, 2020, 7:40pm

Hey *,

I have a strange behavior with my openHAB installation.
About 45 minutes after startup all scheduled roles suddenly stop working.

I already enabled debug logging on org.eclipse.smarthome.model and org.quartz.core but there’s no useful information to find.

I use the official Docker Image (openhab/openhab:latest-debian, OHB 2.5.8, based on Debian 10.4).

Any ideas where to look at? I’m kinda helpless

Cheers
Sven

rlkoshak · August 25, 2020, 7:52pm

Start here: Why have my Rules stopped running? Why Thread::sleep is a bad idea

sdda · August 25, 2020, 8:43pm

Sorry, I forgot to mention that I know that article and I don’t have a single thread::sleep in these rules and there are rarely any timers at all. Whenever possible I use the expire binding.

All other rules are still working. Only the scheduled ones suddenly stop.

rossko57 · August 25, 2020, 9:02pm

As you’ll know from the long thread, cron-triggered rules run using a thread from a different pool to regular rules

So that’s a strong hint it’s a scheduled “something” at work.

Various bindings use the same scheduler, so we should not rule those out.

As a crude diagnostic, you could add a logInfo() to the start and exit of each of your cron rules, just to see if any hang up.

Andrew_Rowe · August 25, 2020, 10:03pm

how many cron triggered rules do you run? How often do they trigger? Give us some use case ideas here… what are you doing (with cron triggered rules) Are they triggering every few moments or at some specific time of the day?
Final question: are these rules that have always worked up until now or are they new rules?

if you notice within 45 minutes that they are not working then I’m guess they fire often

sdda · August 26, 2020, 7:43am

I’ve already done that. They simply don’t trigger anymore.

9

Depends… from once a week up to every 60 seconds.

Heating

rule "heating hobby high"
    when
        Time cron "0 0 6 ? * MON,TUE,WED,THU,FRI *"
    then
        sendCommand(HobbyRoomThermostatTemp, 21)
end

Vacation Presence Simulation (only 3 items in the group)

var int presence_days = 7
var String persistence = "influxdb"

rule "PresenceSimulation"
when
	Time cron "0 0/1 * * * ?" // Every Minute
then
	if (VacationMode.state == ON) {
		gPresenceSimulation.members.forEach(light,i |
			if(light.historicState(now.minusDays(presence_days), persistence).state != light.state) {
				logInfo("Pres_Sim",light.name + " switches to " + light.historicState(now.minusDays(presence_days), persistence).state)
				sendTelegram("bot2", light.name + " switches to " + light.historicState(now.minusDays(presence_days), persistence).state)
				light.sendCommand(light.historicState(now.minusDays(presence_days), persistence).state.toString)
			} else {
				logInfo("Pres_Sim",light.name + " is already " + light.historicState(now.minusDays(presence_days), persistence).state)
			}
		)
	}
end

That’s a good question!
I noticed the problem when I implemented the last rule, the presence simulation.
It could be that the problem occurred before, but now it is noticed because I need the rule. We are on vacation at the moment and the lights are off

2020-08-25 21:10:00.002 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'
2020-08-25 21:11:00.001 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'
2020-08-25 21:12:00.002 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'
2020-08-25 21:13:00.002 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'
2020-08-25 21:14:00.002 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'
2020-08-25 21:15:00.003 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'PresenceSimulation'

And it stops without further information. Nothing more in the logs.

The same with this rule

2020-08-25 20:55:00.004 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'AirPurifier pull Data'
2020-08-25 21:00:00.006 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'AirPurifier pull Data'
2020-08-25 21:05:00.004 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'AirPurifier pull Data'
2020-08-25 21:10:00.003 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'AirPurifier pull Data'
2020-08-25 21:15:00.232 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'AirPurifier pull Data'

rossko57 · August 26, 2020, 10:33am

(ref rule logs) The general idea is to check that every preceding call of a rule has an exit. We know you don’t get triggers, but this is looking for rules that hang up before the point of disaster, consuming threads.

It can get terribly confusing to inspect though, especially if you may have multiple starts of a given rule. Although we wouldn’t expect multiple starts of a rule triggered only on cron.

It is possible to add some smarts using global variables as counters, if you have suspects.

var number myRuleStarts = 0

rule "my rule"
...
then
   myRuleStarts = myRuleStarts +1
   if (myRuleStarts >1) { logInfo("myrule", "WARNING - multiple starts of my rule " + myRuleStarts.toString) }
...
// real rule body
...
   myRuleStarts = myRuleStarts - 1
   logInfo("my rule", "Exit with code " + myRuleStarts.toString)
end

Exec and HTTP actions included in rules are prime suspects for hangups. Got any?

I’m eyeing the sendTelegram action in the shown rule with suspicion at the moment. That should I’m sure have a default timeout, but I can’t find any reference for that. I think you are still using v1.x action, not v2.x binding? You may have run into a snag there.

If you are confident this is not the area, then the next likely is a binding with scheduled activities (that’s most of them!). At the moment you’re the only complainant, so it would make sense to look at more unusual bindings or the way you have configured them.

sdda · August 26, 2020, 11:21am

First of all, thanks for the detailed reply! Much appreciated!

I didn’t even notice that it is still the V1 Telegram Binding. I will debug all rules and see if I can find the culprit.

Thanks a lot for the first time, I will get back to you

sdda · August 26, 2020, 12:12pm

I think I have found the error!

var int myRuleStarts = 0

rule "AirPurifier pull Data"
when
    Time cron "0 0/5 * * * ?"
then
    myRuleStarts = myRuleStarts +1
    if (myRuleStarts >1) { logInfo("AirPurifier pull Data", "WARNING - multiple starts of my rule " + myRuleStarts.toString) }
    
    executeCommandLine("/bin/bash /openhab/conf/scripts/airctrl.sh", 5000)

    myRuleStarts = myRuleStarts - 1
    logInfo("AirPurifier pull Data", "Exit with code " + myRuleStarts.toString)
end

This rule runs every five minutes and has a timeout of 5 seconds but nevertheless the processes stack up!

2020-08-26 14:10:00.002 [INFO ] [e.model.script.AirPurifier pull Data] - WARNING - multiple starts of my rule 5

Many thanks to all of you!
Now I’ll change my code to use the V2 Telegram Binding

rossko57 · August 26, 2020, 12:42pm

Bit mysterious. I’m not surprised there is an exec involved, but you seem to have a correct timeout and no reason to hang up.

Thinks: some other process gobbling up schedule threads could stop the timeout working? Surely not - if there were no threads we could not start the rule in the first place.

rlkoshak · August 26, 2020, 2:47pm

If you can verify with a simple rule like the above that the timeout is indeed being ignored please file an issue on openHAB-core (I think that’s where executeCommandLine lives). That would definitely be a bug that needs to be fixed. Though I suspect a fix won’t occur until OH 3.

In the mean time, assuming that it is indeed this rule that is consuming the threads (you have pretty convincing evidence in my opinion) I would add logging (i.e. echo commands) to the airctrl.sh script and log out what executeCommandLine returns so you can figure out which line in the script it’s hanging on and apply a timeout to that command in the script itself. See Put a Timer on Running Commands With Timeout Command in Linux.

It’s good practice to do this but ultimately it’s a workaround and executeCommandLine should be fixed to use the passed in timeout.

If I read the cron expression correctly, this means that the command line is taking longer than five minutes to complete and based on the behavior, it might just be hanging forever.

Yep, if some other process were gobbling up the threads the problem would appear starting up the thread in the first place. executeCommandLine with a timeout argument is blocking so it doesn’t consume a separate thread by itself. It’s running in the same thread as the rule that called it.

sdda · August 26, 2020, 6:44pm

I will try to debug this when I’m back home

sdda · August 27, 2020, 7:08pm

Bug report submitted:

rossko57 · August 27, 2020, 9:16pm

Just for info … I experimented in OH 2.5.0 (Windows environment), and cannot fault the Action.

Using executeCommandLine with a wget and duff URL, so that it loops retrying -
Reliably interrupts the script at the given integer timeout mS.
It works with or without retrieving the results e.g.
var results = executeCommandLine(...
If I fix up the wget to work, it returns when complete before the given timeout expires.

All seems well for me? Is it possible your script is somehow special, un-interruptable, because it is in a shell?
There was a relevant-to-bash looking change in Jan 2020, I don’t know what OH version that would be released in.