After several days of experimentation, I conclude that the timeout in executeCommandLine is triggering my problem.
When using a timeout, cron based rules stop executing after some hours. When this happens, I see that the quartz scheduler thread is waiting in blockForAvailableThreads.
Using the “threads” command in the karaf console, I see:
"openHAB-job-scheduler_QuartzSchedulerThread" Id=83 in TIMED_WAITING on lock=java.lang.Object@1d68a94
at java.lang.Object.wait(Native Method)
at org.quartz.simpl.SimpleThreadPool.blockForAvailableThreads(SimpleThreadPool.java:452)
at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:263)
and the Scheduler Thread remains stuck in that status until I restart openHAB.
I tried these workarounds
- increase openHAB thread pool size. Did not help. I have no indication that resource (threads) availability at ESH/OH side has any influence.
- increase number of quartz worker threads: no solution, it just took some more time until the quartz scheduler locked up again
- increase interval between executeCommandLine cron rules. Iniitially my cron rules were running every 5 seconds, with a 5 seconds timeout. I thought this could be the cause. Increasing the interval to 10 or 15s did not bring a solution.
I also learned how to predict system failure by observing the delay between the planned time and the actual execution time using
log:set DEBUG org.quartz.core
It is hard to see whether the script-based cron rules that run often (every x seconds) are launched on time. But I have other cron rules that run every minute. These latter rules are just updating a couple of OH items, no scripts involved.
When all is fine, I typically see a delay of milliseconds between schedule and execution:
2018-06-29 13:37:20.005 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:38:20.003 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:39:20.003 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
If I wait a bit, I observe a delay of about 4 seconds
2018-06-29 13:46:24.423 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:47:24.626 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:48:24.905 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 13:49:24.540 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
increasing to more than 9 seconds somewhat later
2018-06-29 15:05:29.026 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:06:29.757 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
2018-06-29 15:07:29.445 [DEBUG] [org.quartz.core.JobRunShell ] - Calling execute on job DEFAULT.thermostaatkraan.rules#countdownZithoekBar#20 0/1 * * * ?
Plot:
When leaving out the timeout in the executeCommandLine() call, all cron tasks keep executing with less than 170msec delay versus planned, even after more than 72 hours.
This is what I observe. I have no real explanation/understanding.
In coming weeks, I will try to reproduce this problem on a clean install, on another Raspberry Pi. Until then, this phenomenon might be due to other fiddling I did on my system while experimenting in OH.