Cron based rules suddenly stop executing

I have only two “slow” rules. One is talking to Spotify and the other to a Redis instance on another machine. All other rules I have are merely doing some logic operations like comparing temperatures and posting updates to other items. I assume that these execute in milliseconds.

Anyhow, after I disabled the faulty motion sensor, the system has been running for more than 12 hrs now.

In case my cron jobs stop executing again, is there a way to see which jobs are on the Quartz queue?

I found a file “quartz.properties” in directory /usr/share/openhab2/runtime/etc (using openhabian)

org.quartz.threadPool.threadCount = 2

Is this the one you were referring to?

I don’t think so. It should be set to five, not 2 by default.

I also use openhabian on a PI

/usr/share/openhab2/runtime/etc/quartz.properties

org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.instanceName = openHAB-job-scheduler
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 2
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore

also :

org.quartz.threadPool.threadCount = 2

Looking into the threads via the karaf console, I found three quartz related threads. One
When the system is running fine, then it looks like this

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@d15df1
    at java.lang.Object.wait(Native Method)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:311)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING on lock=java.lang.Object@1368f2f
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING on lock=java.lang.Object@1f4a045
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

A few minutes ago, cron jobs were stopped again, and now I see this:

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@10e2117
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1368f2f

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1f4a045

Clearly, both quartz worker threads are stuck on a thread.sleep() while executing a command line, and the scheduler thread is blocked, waiting for a worker thread.

Looking in the source code of ExecuteResultHandler, it seems like the thread is waiting “forever” for a result or exit code to return from the OS.

097    /**
098     * Has the process exited and a result is available, i.e. exitCode or exception?
099     *
100     * @return true if a result of the execution is available
101     */
102    public boolean hasResult() {
103        return hasResult;
104    }
105
106    /**
107     * Causes the current thread to wait, if necessary, until the
108     * process has terminated. This method returns immediately if
109     * the process has already terminated. If the process has
110     * not yet terminated, the calling thread will be blocked until the
111     * process exits.
112     *
113     * @exception  InterruptedException if the current thread is
114     *             {@linkplain Thread#interrupt() interrupted} by another
115     *             thread while it is waiting, then the wait is ended and
116     *             an {@link InterruptedException} is thrown.
117     */
118    public void waitFor() throws InterruptedException {
119
120        while (!hasResult()) {
121            Thread.sleep(SLEEP_TIME_MS);
122        }
123    }

I am absolutely certain that my python scripts are not hanging, so there seems to be a problem returning the script exit code to the Java thread.

You can configure thread pools In /conf/services/runtime.cfg. I believe these are the default values…

org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:discovery=5
org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:ruleEngine=5

You can check your current settings with this Karaf command…

config:list "(service.pid=org.eclipse.smarthome.threadpool)"

I recently had cron rules failing due to the CalDav binding failing to connect to our DavMail server. Setting the Quartz schedulr to debug is very helpful. It gives you the time of the last successful job, which is a starting point for investigating what is causing the hang. More details in my comment here…

I can guess what discovery and thingHandler is for. Any idea what safeCall threads do?

This was asked/answered in an ESH issue I had read. I was hoping the documentation PR was merged at this point…

1 Like

Thanks!

I came across the ruleHandler thread pool sizing in https://github.com/eclipse/smarthome/issues/5040 and added this as well

threadpool settings in services.cfg :

# Configuration of thread pool sizes
#was  org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:thingHandler=10
org.eclipse.smarthome.threadpool:discovery=5
#was org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:safeCall=20
#default seems to be 5 -- https://github.com/eclipse/smarthome/issues/5040
org.eclipse.smarthome.threadpool:ruleHandler=20

Unfortunately, less than 5 hours later, my cron rules stopped again. It seems like a resource shortage at ESH side is not a cause of the problem. I also confirmed that I had no Python process running when the quartz scheduler was blocked. This confirmed that my scripts had ended.

This should be ruleEngine.

The logs you posted point to executeCommandLine hanging. IIRC, these can be logged through org.eclipse.smarthome.io.net.exec. If you post the lines from your rules where you are using executeCommandLine, maybe we could spot something.

You should keep the timeouts. They should prevent executeCommandLine from hanging.

After several days of experimentation, I conclude that the timeout in executeCommandLine is triggering my problem.

When using a timeout, cron based rules stop executing after some hours. When this happens, I see that the quartz scheduler thread is waiting in blockForAvailableThreads.

Using the “threads” command in the karaf console, I see:

"openHAB-job-scheduler_QuartzSchedulerThread" Id=83 in TIMED_WAITING on lock=java.lang.Object@1d68a94
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)

and the Scheduler Thread remains stuck in that status until I restart openHAB.

I tried these workarounds

  • increase openHAB thread pool size. Did not help. I have no indication that resource (threads) availability at ESH/OH side has any influence.
  • increase number of quartz worker threads: no solution, it just took some more time until the quartz scheduler locked up again
  • increase interval between executeCommandLine cron rules. Iniitially my cron rules were running every 5 seconds, with a 5 seconds timeout. I thought this could be the cause. Increasing the interval to 10 or 15s did not bring a solution.

I also learned how to predict system failure by observing the delay between the planned time and the actual execution time using

 log:set DEBUG org.quartz.core

It is hard to see whether the script-based cron rules that run often (every x seconds) are launched on time. But I have other cron rules that run every minute. These latter rules are just updating a couple of OH items, no scripts involved.

When all is fine, I typically see a delay of milliseconds between schedule and execution:

2018-06-29 13:37:20.005 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:38:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:39:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

If I wait a bit, I observe a delay of about 4 seconds

2018-06-29 13:46:24.423 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:47:24.626 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:48:24.905 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:49:24.540 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

increasing to more than 9 seconds somewhat later

2018-06-29 15:05:29.026 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:06:29.757 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:07:29.445 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

Plot:

When leaving out the timeout in the executeCommandLine() call, all cron tasks keep executing with less than 170msec delay versus planned, even after more than 72 hours.

This is what I observe. I have no real explanation/understanding.

In coming weeks, I will try to reproduce this problem on a clean install, on another Raspberry Pi. Until then, this phenomenon might be due to other fiddling I did on my system while experimenting in OH.

Could you please post your rule and script?

I’m going to guess that your executeCommandLine is taking longer to execute than your Rule is triggering. The timeout in executeCommandLine must be less, perhaps significantly less, than the polling period of your Rule.

So, for example, if your Rule is executing every five seconds, the timeout on executeCommandLine should be no more than 4500 (lets give it half a second to do all the stuff before and after the call to executeCommandLine).

Now, there can be a potential bug here. If executeCommandLine doesn’t actually timeout in 4.5 seconds and return but instead blocks until the script called by executeCommandLine returns no matter how long it takes.

I’d like to see this potential bug ruled out so like Scott I eagerly await the posting of your actual Rule.

But, as a workaround you can export this looping to the script itself and have that script push the result to OH. Then OH will react to the data instead of tying up a thread awaiting the value to be returned. I’ve done similar stuff using https://github.com/rkoshak/sensorReporter. That is what the execSensor.py does.

Maybe an important update. When putting both executeCommandLine in the same rule, then the problem disappears.

It seems that the problem only pops up when

  • using multiple executeCommandLine calls from different rules, meaning that we can have multiple executeCommandLine() running at the same time
  • using a timeout
  • (maybe, too) calling scripts that take several seconds to complete

rule which is running fine

rule "scipts"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py",5000)
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py",5000)
end

before, I had them launched from different rules files:

rule "Spotify scheduled refresh"
when 
    Time cron "0/12 * * * * ?" 
then
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py")
end

and

rule "Scheduled sync with redis"
when 
    Time cron "0/15 * * * * ?"   // every 15 seconds
then
//logDebug("hvd","sync redis start")
	executeCommandLine("/usr/bin/python /etc/openhab2/scripts/tuinhuis-zwembad.py") //5000 ms timeout verwijderd
	//logDebug("hvd","sync redis end")
end

scripts:

from myopenhab import openhab
import redis


oh=openhab()

#open connectie met de redis database in tuinhuis
r = redis.Redis(host='blabla', port=6379)

#lees waarden uit redis en geef ze door aan openHAB
oh.sendCommand('TemperatuurTuinhuis',r.get('TemperatuurTuinhuis'))
oh.sendCommand('WarmtepompLuchtIn',r.get('WarmtepompLuchtIn'))
oh.sendCommand('WarmtepompLuchtUit',r.get('WarmtepompLuchtUit'))
oh.sendCommand('TuinhuisVerwarming',r.get('TuinhuisVerwarming'))
oh.sendCommand('PentairCurrentRpm',r.get('PentairCurrentRpm'))
oh.sendCommand('PentairCurrentPowerConsumption',r.get('PentairCurrentPowerConsumption'))
oh.sendCommand('PentairCurrentClock',r.get('PentairCurrentClock'))
oh.sendCommand('PentairStatus',r.get('PentairStatus'))
oh.sendCommand('WarmtepompStatus',r.get('WarmtepompStatus'))

if r.get('PentairStatus')=='10':
	oh.sendCommand('PentairWattPerRpm',float(r.get('PentairCurrentPowerConsumption'))/float(r.get('PentairCurrentRpm')))
else:
	oh.sendCommand('PentairWattPerRpm','0.00')
#watertemperaturen alleen loggen als de Pentair pomp draait, rekening houdend met opstart-timeout.
if oh.getState('WaterTemperaturesValid') == 'ON':
	oh.sendCommand('FilterWaterUit',r.get('FilterWaterUit'))
	oh.sendCommand('WarmtepompWaterIn',r.get('WarmtepompWaterIn'))
	oh.sendCommand('WarmtepompWaterUit',r.get('WarmtepompWaterUit'))

else:
	oh.sendCommand('FilterWaterUit','NULL')
	oh.sendCommand('WarmtepompWaterIn','NULL')
	oh.sendCommand('WarmtepompWaterUit','NULL')

	
#Setpoints vanuit openHAB doorgeven aan redis
r.set('PentairRpmSetpoint', oh.getState('PentairRpmSetpoint'))
r.set('WarmtepompCommando', oh.getState('WarmtepompCommando'))

I had the same thought and tested this hypothesis. Runtime of both scripts is in the 3s-4s range. Timeout setting is 5000ms. Initially repetition rate was 5s. When increasing to 10s or 16s, the problem persisted. And like I said, problem disappeared by leaving out the timeout and not changing the repetition rate of the executeCommandLine() calls.

That was my fallback, but I wanted to contribute as much diagnostics and analysis to the community as possible.

See above. Frankly, I think this is not adding much value. I am afraid that the focus will move to design of my scripts and the way I connect to remote resources, effectively diverting attention from the issue at hand. Keep in mind that my original scripts and rules have been running non stop for 4 months. This problem popped up when upgrading from OH2.2 to OH2.3.

Just to be clear, when you leave out the timeout the call to executeCommandLine immediately returns and any output from the called script gets lost. Since you don’t care about the return value anyway, there is no reason to use the timeout. I think (hope) that executeCommandLine will not tie up a thread somewhere if the script never returns.

Indeed, I am not interested in exit code or return value from the scripts. They use the REST API to update items and then exit. From the behaviour of openHAB I have the impression that they always exit in a clean way, but I have no solid proof.

I could implement logging in the Python scripts to find out whether one of them intermittently exits on an uncaught exception. If the team would be interested, I could give this a try.

Since the problem goes away when the timeout isn’t present I don’t think there is much benefit to getting down to the source of the problem by doing so.

My cron rules stopped working as well (see here: Rules stop execution)

I use quite a bunch of executeCommandline (90), where I sometimes needed longer timeouts like 20000ms because requesting the state of my car for instance took longer.

How can I avoid these?
I could replace some executecommandline("curl... by using the http binding.
Maybe that’s a start!?

Any “interrupted exception” in log files.

Put quarts into the trace logging and see if it crashes.

It’s still the biggest reason for instability I have at the moment. I’m forced to reboot often and have assign 50 threads to cron to. “buffer” the problem :frowning: