Cron based rules suddenly stop executing

hansvandamme · June 24, 2018, 4:45pm

I have quite a large set of rules that have been running without problem for 4 months non-stop.

A new z-wave device I bought forced me to upgrade to v2.3.0 and now all rules suddenly stop executing. This morning I enabled quartz core debug logging

 log:set DEBUG org.quartz.core

I restarted openHAB and about 4 hours later I stopped seeing log entries like

2018-06-24 18:42:41.004 [DEBUG] [rg.quartz.core.QuartzSchedulerThread] - batch acquisition of 1 triggers

Not a single error message, nothing useful in event log. Is there any other logging I can enable to better understand what is happening?

Very frustrating: simple version update from one stable to the next and my whole home automation is dead.

For the record: running on openhabian, mysql persistence with db on a different machine.

job · June 24, 2018, 5:44pm

What happens if you restart? Any errors occuring?

Some rules need rework due to syntax changes.

hansvandamme · June 24, 2018, 6:44pm

Upon restart, all rules files are loaded without any syntax issue. The whole openHAB is starting up properly and runs fine for some unpredictable time. Then the rule engine (or whatever it is called in OH speak) suddenly stops processing any rule. As I said, no message in any log.

2hrs ago I rebooted the OS and changed the timing of my cron-based rules to better spread them within a minute. If one of the cron rules is the cause, this will help me to isolate.

I found another thread that reported a similar thing in v2.2.0, but this quickly pointed in the direction of thread.sleep. I don’t use any thread.sleep calls.

rlkoshak · June 24, 2018, 7:49pm

Or other long running commands like:

executeCommandLine with a timeout
sendHttp*Request
while loops

There can also be a problem even if you don’t have long running rules if you have A LOT of events firing off lots and lots of rules. Only five can run at a time, so if you have a barrage of events occur all at once it could take some time to work them off and in the mean time the backlog continues to build.

If this is the case, you can increase the number of runtime threads. There is a setting somewhere in the userdata/etc config files.

job · June 24, 2018, 8:09pm

Are you just using cron based rules? If yes, could you try an item-based trigger, to get an indication if the rule engine or the cron trigger has a problem.

hansvandamme · June 25, 2018, 6:11pm

Interesting suggestion! Item-based triggers still seem to work.

hansvandamme · June 25, 2018, 7:06pm

Yes, a two executeCommandLine with 5000ms timeout. I now removed these timeouts. The scripts complete in less than 3 seconds and there is no risk that they will hang.
no sendHttp nor while loops.

I don’t have so many events. Counting lines in the events.log, I typically end up with about 1100 events/hr. I would not call this “a lot”. Is there a way to see/measure/track/log the size of the scheduler backlog?

But even then, how come the scheduler stops working without making noise?

I do have quite some timers with long timeouts (a few minutes up to 1 hr). Are these timer objects using the Quartz scheduler as well, or is there another mechanism for executing the timer lambdas?

New: I just found out that one of my motion sensors is sending alerts every 3 seconds. I now put it in a drawer to see if this has any effect.

rlkoshak · June 25, 2018, 10:10pm

In the scenario I described above, the scheduler isn’t stopping, you just run out of threads and if you have rules that don’t log and that execute a lot it looks like rules stopped when in fact they are furiously working off events but doing so more slowly than new events come in. So it looks like rules stop executing.

I don’t know what algorithm is used to select what events get worked off first but it doesn’t appear to be fifo. I can’t say that for certain though.

1100 means about one event every three seconds. With five threads you should be able to handle about five events per second so long as the rules take less than three seconds to complete. But keep in mind that over those seconds more events are coming in so it is very easy to build up a backlog.

It isn’t always possible, but most of your rules should take on the order is hand a second or less to execute with a rare one that takes longer.

If you have a motion sensor going off every three seconds and the rule that triggers off of it takes three seconds or more you can easily build up this backlog.

I’m not saying this is your problem, but we need to eliminate it as a possibility before we can say there is a bug.

hansvandamme · June 26, 2018, 7:38am

I have only two “slow” rules. One is talking to Spotify and the other to a Redis instance on another machine. All other rules I have are merely doing some logic operations like comparing temperatures and posting updates to other items. I assume that these execute in milliseconds.

Anyhow, after I disabled the faulty motion sensor, the system has been running for more than 12 hrs now.

In case my cron jobs stop executing again, is there a way to see which jobs are on the Quartz queue?

hansvandamme · June 26, 2018, 8:39am

I found a file “quartz.properties” in directory /usr/share/openhab2/runtime/etc (using openhabian)

org.quartz.threadPool.threadCount = 2

Is this the one you were referring to?

rlkoshak · June 26, 2018, 10:25am

I don’t think so. It should be set to five, not 2 by default.

The-Elk · June 26, 2018, 11:29am

I also use openhabian on a PI

/usr/share/openhab2/runtime/etc/quartz.properties

org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.instanceName = openHAB-job-scheduler
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 2
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore

also :

org.quartz.threadPool.threadCount = 2

hansvandamme · June 26, 2018, 1:33pm

Looking into the threads via the karaf console, I found three quartz related threads. One
When the system is running fine, then it looks like this

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@d15df1
    at java.lang.Object.wait(Native Method)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:311)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING on lock=java.lang.Object@1368f2f
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING on lock=java.lang.Object@1f4a045
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

A few minutes ago, cron jobs were stopped again, and now I see this:

<snip>
"openHAB-job-scheduler_QuartzSchedulerThread" Id=85 in TIMED_WAITING on lock=java.lang.Object@10e2117
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)
<snip>
"openHAB-job-scheduler_Worker-1" Id=83 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1368f2f

"openHAB-job-scheduler_Worker-2" Id=84 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1085)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1060)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1046)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:991)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:143)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:901)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:864)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:223)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:446)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:227)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:219)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:203)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:189)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@1f4a045

Clearly, both quartz worker threads are stuck on a thread.sleep() while executing a command line, and the scheduler thread is blocked, waiting for a worker thread.

Looking in the source code of ExecuteResultHandler, it seems like the thread is waiting “forever” for a result or exit code to return from the OS.

097    /**
098     * Has the process exited and a result is available, i.e. exitCode or exception?
099     *
100     * @return true if a result of the execution is available
101     */
102    public boolean hasResult() {
103        return hasResult;
104    }
105
106    /**
107     * Causes the current thread to wait, if necessary, until the
108     * process has terminated. This method returns immediately if
109     * the process has already terminated. If the process has
110     * not yet terminated, the calling thread will be blocked until the
111     * process exits.
112     *
113     * @exception  InterruptedException if the current thread is
114     *             {@linkplain Thread#interrupt() interrupted} by another
115     *             thread while it is waiting, then the wait is ended and
116     *             an {@link InterruptedException} is thrown.
117     */
118    public void waitFor() throws InterruptedException {
119
120        while (!hasResult()) {
121            Thread.sleep(SLEEP_TIME_MS);
122        }
123    }

I am absolutely certain that my python scripts are not hanging, so there seems to be a problem returning the script exit code to the Java thread.

5iver · June 26, 2018, 2:56pm

You can configure thread pools In /conf/services/runtime.cfg. I believe these are the default values…

org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:discovery=5
org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:ruleEngine=5

You can check your current settings with this Karaf command…

config:list "(service.pid=org.eclipse.smarthome.threadpool)"

I recently had cron rules failing due to the CalDav binding failing to connect to our DavMail server. Setting the Quartz schedulr to debug is very helpful. It gives you the time of the last successful job, which is a starting point for investigating what is causing the hang. More details in my comment here…

rlkoshak · June 26, 2018, 10:52pm

I can guess what discovery and thingHandler is for. Any idea what safeCall threads do?

5iver · June 26, 2018, 10:58pm

This was asked/answered in an ESH issue I had read. I was hoping the documentation PR was merged at this point…

github.com/eclipse-archived/smarthome

Allow extra execution threads for use on large systems hosted on powerful hardware

opened 08:44PM - 06 Feb 18 UTC

closed 07:07PM - 14 Feb 18 UTC

sheppy99

question

I've just migrated a large OpenHAB system from 2 x Raspberry Pi's onto one i3-71…00u mini PC with 8GB RAM and after consolidating the rules from both machines I ran into issues with rules taking many seconds to fire due, I think, to awaiting spare program threads to use. Most of the time the Java process uses less than 5% of one CPU core and less than 12% memory, yet I seem to run out of resources for rules. There are no warnings in the log for this condition, rules just don't trigger for several seconds. I'm currently running Snapshot 1203 in case its relevant and had the same issues in V2.2 stable As an enhancement can a setting be added to expand the resources of OpenHAB? Apologies if I have posted this in the wrong place!

rlkoshak · June 26, 2018, 10:59pm

Thanks!

hansvandamme · June 27, 2018, 8:19am

I came across the ruleHandler thread pool sizing in https://github.com/eclipse/smarthome/issues/5040 and added this as well

threadpool settings in services.cfg :

# Configuration of thread pool sizes
#was  org.eclipse.smarthome.threadpool:thingHandler=5
org.eclipse.smarthome.threadpool:thingHandler=10
org.eclipse.smarthome.threadpool:discovery=5
#was org.eclipse.smarthome.threadpool:safeCall=10
org.eclipse.smarthome.threadpool:safeCall=20
#default seems to be 5 -- https://github.com/eclipse/smarthome/issues/5040
org.eclipse.smarthome.threadpool:ruleHandler=20

Unfortunately, less than 5 hours later, my cron rules stopped again. It seems like a resource shortage at ESH side is not a cause of the problem. I also confirmed that I had no Python process running when the quartz scheduler was blocked. This confirmed that my scripts had ended.

5iver · June 27, 2018, 1:31pm

This should be ruleEngine.

The logs you posted point to executeCommandLine hanging. IIRC, these can be logged through org.eclipse.smarthome.io.net.exec. If you post the lines from your rules where you are using executeCommandLine, maybe we could spot something.

You should keep the timeouts. They should prevent executeCommandLine from hanging.

hansvandamme · July 2, 2018, 10:06am

After several days of experimentation, I conclude that the timeout in executeCommandLine is triggering my problem.

When using a timeout, cron based rules stop executing after some hours. When this happens, I see that the quartz scheduler thread is waiting in blockForAvailableThreads.

Using the “threads” command in the karaf console, I see:

"openHAB-job-scheduler_QuartzSchedulerThread" Id=83 in TIMED_WAITING on lock=java.lang.Object@1d68a94
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)

and the Scheduler Thread remains stuck in that status until I restart openHAB.

I tried these workarounds

increase openHAB thread pool size. Did not help. I have no indication that resource (threads) availability at ESH/OH side has any influence.
increase number of quartz worker threads: no solution, it just took some more time until the quartz scheduler locked up again
increase interval between executeCommandLine cron rules. Iniitially my cron rules were running every 5 seconds, with a 5 seconds timeout. I thought this could be the cause. Increasing the interval to 10 or 15s did not bring a solution.

I also learned how to predict system failure by observing the delay between the planned time and the actual execution time using

 log:set DEBUG org.quartz.core

It is hard to see whether the script-based cron rules that run often (every x seconds) are launched on time. But I have other cron rules that run every minute. These latter rules are just updating a couple of OH items, no scripts involved.

When all is fine, I typically see a delay of milliseconds between schedule and execution:

2018-06-29 13:37:20.005 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:38:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:39:20.003 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

If I wait a bit, I observe a delay of about 4 seconds

2018-06-29 13:46:24.423 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:47:24.626 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:48:24.905 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:49:24.540 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

increasing to more than 9 seconds somewhat later

2018-06-29 15:05:29.026 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:06:29.757 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:07:29.445 [DEBUG] [org.quartz.core.JobRunShell         ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?

Plot:

When leaving out the timeout in the executeCommandLine() call, all cron tasks keep executing with less than 170msec delay versus planned, even after more than 72 hours.

This is what I observe. I have no real explanation/understanding.

In coming weeks, I will try to reproduce this problem on a clean install, on another Raspberry Pi. Until then, this phenomenon might be due to other fiddling I did on my system while experimenting in OH.