Cron based rules suddenly stop executing

I have quite a large set of rules that have been running without problem for 4 months non-stop.

A new z-wave device I bought forced me to upgrade to v2.3.0 and now all rules suddenly stop executing. This morning I enabled quartz core debug logging

 log:set DEBUG org.quartz.core

I restarted openHAB and about 4 hours later I stopped seeing log entries like

2018-06-24 18:42:41.004 [DEBUG] [rg.quartz.core.QuartzSchedulerThread] - batch acquisition of 1 triggers

Not a single error message, nothing useful in event log. Is there any other logging I can enable to better understand what is happening?

Very frustrating: simple version update from one stable to the next and my whole home automation is dead.

For the record: running on openhabian, mysql persistence with db on a different machine.

1 Like

What happens if you restart? Any errors occuring?

Some rules need rework due to syntax changes.

Upon restart, all rules files are loaded without any syntax issue. The whole openHAB is starting up properly and runs fine for some unpredictable time. Then the rule engine (or whatever it is called in OH speak) suddenly stops processing any rule. As I said, no message in any log.

2hrs ago I rebooted the OS and changed the timing of my cron-based rules to better spread them within a minute. If one of the cron rules is the cause, this will help me to isolate.

I found another thread that reported a similar thing in v2.2.0, but this quickly pointed in the direction of thread.sleep. I don’t use any thread.sleep calls.

Or other long running commands like:

  • executeCommandLine with a timeout
  • sendHttp*Request
  • while loops

There can also be a problem even if you don’t have long running rules if you have A LOT of events firing off lots and lots of rules. Only five can run at a time, so if you have a barrage of events occur all at once it could take some time to work them off and in the mean time the backlog continues to build.

If this is the case, you can increase the number of runtime threads. There is a setting somewhere in the userdata/etc config files.

Are you just using cron based rules? If yes, could you try an item-based trigger, to get an indication if the rule engine or the cron trigger has a problem.

Interesting suggestion! Item-based triggers still seem to work.

  • Yes, a two executeCommandLine with 5000ms timeout. I now removed these timeouts. The scripts complete in less than 3 seconds and there is no risk that they will hang.
  • no sendHttp nor while loops.

I don’t have so many events. Counting lines in the events.log, I typically end up with about 1100 events/hr. I would not call this “a lot”. Is there a way to see/measure/track/log the size of the scheduler backlog?

But even then, how come the scheduler stops working without making noise?

I do have quite some timers with long timeouts (a few minutes up to 1 hr). Are these timer objects using the Quartz scheduler as well, or is there another mechanism for executing the timer lambdas?

New: I just found out that one of my motion sensors is sending alerts every 3 seconds. I now put it in a drawer to see if this has any effect.

In the scenario I described above, the scheduler isn’t stopping, you just run out of threads and if you have rules that don’t log and that execute a lot it looks like rules stopped when in fact they are furiously working off events but doing so more slowly than new events come in. So it looks like rules stop executing.

I don’t know what algorithm is used to select what events get worked off first but it doesn’t appear to be fifo. I can’t say that for certain though.

1100 means about one event every three seconds. With five threads you should be able to handle about five events per second so long as the rules take less than three seconds to complete. But keep in mind that over those seconds more events are coming in so it is very easy to build up a backlog.

It isn’t always possible, but most of your rules should take on the order is hand a second or less to execute with a rare one that takes longer.

If you have a motion sensor going off every three seconds and the rule that triggers off of it takes three seconds or more you can easily build up this backlog.

I’m not saying this is your problem, but we need to eliminate it as a possibility before we can say there is a bug.

I have only two “slow” rules. One is talking to Spotify and the other to a Redis instance on another machine. All other rules I have are merely doing some logic operations like comparing temperatures and posting updates to other items. I assume that these execute in milliseconds.

Anyhow, after I disabled the faulty motion sensor, the system has been running for more than 12 hrs now.

In case my cron jobs stop executing again, is there a way to see which jobs are on the Quartz queue?

I found a file “quartz.properties” in directory /usr/share/openhab2/runtime/etc (using openhabian)

org.quartz.threadPool.threadCount = 2

Is this the one you were referring to?

I don’t think so. It should be set to five, not 2 by default.

I also use openhabian on a PI

/usr/share/openhab2/runtime/etc/quartz.properties

org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.instanceName = openHAB-job-scheduler
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 2
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore

also :

org.quartz.threadPool.threadCount = 2

Looking into the threads via the karaf console, I found three quartz related threads. One
When the system is running fine, then it looks like this

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@d15df1
    at java.lang.Object.wait(Native Method)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:311)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING on lock=java.lang.Object@1368f2f
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING on lock=java.lang.Object@1f4a045
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

A few minutes ago, cron jobs were stopped again, and now I see this:

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@10e2117
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1368f2f

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1f4a045

Clearly, both quartz worker threads are stuck on a thread.sleep() while executing a command line, and the scheduler thread is blocked, waiting for a worker thread.

Looking in the source code of ExecuteResultHandler, it seems like the thread is waiting “forever” for a result or exit code to return from the OS.

097    /**
098     * Has the process exited and a result is available, i.e. exitCode or exception?
099     *
100     * @return true if a result of the execution is available
101     */
102    public boolean hasResult() {
103        return hasResult;
104    }
105
106    /**
107     * Causes the current thread to wait, if necessary, until the
108     * process has terminated. This method returns immediately if
109     * the process has already terminated. If the process has
110     * not yet terminated, the calling thread will be blocked until the
111     * process exits.
112     *
113     * @exception  InterruptedException if the current thread is
114     *             {@linkplain Thread#interrupt() interrupted} by another
115     *             thread while it is waiting, then the wait is ended and
116     *             an {@link InterruptedException} is thrown.
117     */
118    public void waitFor() throws InterruptedException {
119
120        while (!hasResult()) {
121            Thread.sleep(SLEEP_TIME_MS);
122        }
123    }

I am absolutely certain that my python scripts are not hanging, so there seems to be a problem returning the script exit code to the Java thread.

You can configure thread pools In /conf/services/runtime.cfg. I believe these are the default values…

org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:discovery=5
org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:ruleEngine=5

You can check your current settings with this Karaf command…

config:list "(service.pid=org.eclipse.smarthome.threadpool)"

I recently had cron rules failing due to the CalDav binding failing to connect to our DavMail server. Setting the Quartz schedulr to debug is very helpful. It gives you the time of the last successful job, which is a starting point for investigating what is causing the hang. More details in my comment here…

I can guess what discovery and thingHandler is for. Any idea what safeCall threads do?

This was asked/answered in an ESH issue I had read. I was hoping the documentation PR was merged at this point…

1 Like

Thanks!

I came across the ruleHandler thread pool sizing in https://github.com/eclipse/smarthome/issues/5040 and added this as well

threadpool settings in services.cfg :

# Configuration of thread pool sizes
#was  org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:thingHandler=10
org.eclipse.smarthome.threadpool:discovery=5
#was org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:safeCall=20
#default seems to be 5 -- https://github.com/eclipse/smarthome/issues/5040
org.eclipse.smarthome.threadpool:ruleHandler=20

Unfortunately, less than 5 hours later, my cron rules stopped again. It seems like a resource shortage at ESH side is not a cause of the problem. I also confirmed that I had no Python process running when the quartz scheduler was blocked. This confirmed that my scripts had ended.

This should be ruleEngine.

The logs you posted point to executeCommandLine hanging. IIRC, these can be logged through org.eclipse.smarthome.io.net.exec. If you post the lines from your rules where you are using executeCommandLine, maybe we could spot something.

You should keep the timeouts. They should prevent executeCommandLine from hanging.

After several days of experimentation, I conclude that the timeout in executeCommandLine is triggering my problem.

When using a timeout, cron based rules stop executing after some hours. When this happens, I see that the quartz scheduler thread is waiting in blockForAvailableThreads.

Using the “threads” command in the karaf console, I see:

"openHAB-job-scheduler_QuartzSchedulerThread" Id=83 in TIMED_WAITING on lock=java.lang.Object@1d68a94
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)

and the Scheduler Thread remains stuck in that status until I restart openHAB.

I tried these workarounds

  • increase openHAB thread pool size. Did not help. I have no indication that resource (threads) availability at ESH/OH side has any influence.
  • increase number of quartz worker threads: no solution, it just took some more time until the quartz scheduler locked up again
  • increase interval between executeCommandLine cron rules. Iniitially my cron rules were running every 5 seconds, with a 5 seconds timeout. I thought this could be the cause. Increasing the interval to 10 or 15s did not bring a solution.

I also learned how to predict system failure by observing the delay between the planned time and the actual execution time using

 log:set DEBUG org.quartz.core

It is hard to see whether the script-based cron rules that run often (every x seconds) are launched on time. But I have other cron rules that run every minute. These latter rules are just updating a couple of OH items, no scripts involved.

When all is fine, I typically see a delay of milliseconds between schedule and execution:

2018-06-29 13:37:20.005 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:38:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:39:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

If I wait a bit, I observe a delay of about 4 seconds

2018-06-29 13:46:24.423 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:47:24.626 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:48:24.905 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:49:24.540 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

increasing to more than 9 seconds somewhat later

2018-06-29 15:05:29.026 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:06:29.757 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:07:29.445 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

Plot:

When leaving out the timeout in the executeCommandLine() call, all cron tasks keep executing with less than 170msec delay versus planned, even after more than 72 hours.

This is what I observe. I have no real explanation/understanding.

In coming weeks, I will try to reproduce this problem on a clean install, on another Raspberry Pi. Until then, this phenomenon might be due to other fiddling I did on my system while experimenting in OH.