Astro: suddently stops working after few days running

staehler67 · June 30, 2019, 7:51am

Hi,
I have problems with the astro binding. First my environment:

Release:        debian 9.9
Codename:       stretch
openhab2             2.5.0~S1592-1
Astro Binding        2.5.0.201905160403
oracle-java8-installer           8u201-1~webupd8~1                 all          Oracle Java(TM) Development Kit (JDK) 8
oracle-java8-set-default         8u201-1~webupd8~1                 all          Set Oracle JDK 8 as default Java

Astro binding works for a few days (2-4 days) and then it suddenly stops working at exacly midnight. This means, that the most of my time (night/day) or sun based rules (blinds) stops working then as well.

As you can see in my openhab.log

2019-06-26 18:58:30.821 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model ‘astro.items’
2019-06-26 18:58:41.430 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model ‘astro.things’
2019-06-26 18:59:04.051 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-26 18:59:04.250 [INFO ] [o.internal.handler.AstroThingHandler] - Scheduled Positional job astro:sun:home every 120 seconds
2019-06-26 18:59:04.605 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home
2019-06-26 18:59:04.639 [INFO ] [o.internal.handler.AstroThingHandler] - Scheduled Positional job astro:moon:home every 120 seconds
2019-06-26 18:59:07.413 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-26 18:59:07.518 [INFO ] [o.internal.handler.AstroThingHandler] - Scheduled Positional job astro:sun:home every 120 seconds
2019-06-26 18:59:22.472 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home
2019-06-26 18:59:22.477 [INFO ] [o.internal.handler.AstroThingHandler] - Scheduled Positional job astro:moon:home every 120 seconds
2019-06-27 00:00:30.118 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-27 00:00:30.406 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home
2019-06-28 00:00:30.173 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-28 00:00:30.425 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home
2019-06-29 00:00:30.211 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-29 00:00:30.443 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home
2019-06-30 00:00:30.240 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:sun:home
2019-06-30 00:00:30.463 [INFO ] [enhab.binding.astro.internal.job.Job] - Scheduled Astro event-jobs for thing astro:moon:home

the binding doesn’ throw any errors, but when you look at the events e.g. Azimuth item, it stopps at midmight. But it seems there is already a problem at around 2019-06-29 22:43:08.184, there are no further updates between the last update and midnight. But should be every 2 minutes.

2019-06-29 22:21:08.646 [vent.ItemStateChangedEvent] - Azimuth changed from 319.1551039856148 to 319.56487989521764
2019-06-29 22:23:08.323 [vent.ItemStateChangedEvent] - Azimuth changed from 319.56487989521764 to 319.97585511014574
2019-06-29 22:25:11.338 [vent.ItemStateChangedEvent] - Azimuth changed from 319.97585511014574 to 320.38800898502905
2019-06-29 22:27:09.286 [vent.ItemStateChangedEvent] - Azimuth changed from 320.38800898502905 to 320.80132735484256
2019-06-29 22:29:09.501 [vent.ItemStateChangedEvent] - Azimuth changed from 320.80132735484256 to 321.2158267057504
2019-06-29 22:31:08.253 [vent.ItemStateChangedEvent] - Azimuth changed from 321.2158267057504 to 321.6314955751383
2019-06-29 22:33:10.280 [vent.ItemStateChangedEvent] - Azimuth changed from 321.6314955751383 to 322.0483564374804
2019-06-29 22:35:09.766 [vent.ItemStateChangedEvent] - Azimuth changed from 322.0483564374804 to 322.4663764973794
2019-06-29 22:37:09.271 [vent.ItemStateChangedEvent] - Azimuth changed from 322.4663764973794 to 322.88558809565507
2019-06-29 22:39:11.875 [vent.ItemStateChangedEvent] - Azimuth changed from 322.88558809565507 to 323.3175699223158
2019-06-29 22:41:11.285 [vent.ItemStateChangedEvent] - Azimuth changed from 323.3175699223158 to 323.7378172405539
2019-06-29 22:43:08.184 [vent.ItemStateChangedEvent] - Azimuth changed from 323.7378172405539 to 324.1503175199555
2019-06-30 00:00:31.740 [vent.ItemStateChangedEvent] - Azimuth changed from 324.1503175199555 to 341.3744795387063
2019-06-30 07:53:14.991 [vent.ItemStateChangedEvent] - Azimuth changed from NULL to 80.09736007521043
2019-06-30 07:53:19.140 [vent.ItemStateChangedEvent] - Azimuth changed from 80.09736007521043 to 80.0986899411146
2019-06-30 07:53:31.134 [vent.ItemStateChangedEvent] - Azimuth changed from 80.0986899411146 to 80.11806687085294
2019-06-30 07:53:31.923 [vent.ItemStateChangedEvent] - Azimuth changed from 80.11806687085294 to 80.11913271053953

A restart of openhab at 7:53 solves this problem (till the next malfunction).
Has anybody an idea what is happening? Any help is highly appreciated. Thanks in advance and have a nice day.

I increased the loglevel for this binding to debug:
log:set debug org.openhab.binding.astro
so I can send hopefully further information to this behaviour in the next days…

rlkoshak · June 30, 2019, 7:43pm

Do you have a lot of Timers or cron triggered Rules? Astro and these all share from the same thread pool, a pool of size 2. So you can only have two of these running at the same time. And if you have long running Timers or cron triggered Rules then the threads will be occupied for an extended period of time, increasing the likelihood of starving out the rest.

There was a bug in Astro many moons ago (pun intended) where the binding would just up and decide not to calculate the next day’s times or events, but that has long since been fixed I believe. You are running the snapshot so you should have that fix.

Do you see any other errors? I’m particularly interested in any error s the mention Quartz.

staehler67 · June 30, 2019, 8:57pm

Hi Rich,
first of all, thanks for your time and help.

Yes I have timers and crons and yes, I found some “quartz” messages like this, but the last one is from the 27th of June:

2019-06-26 18:43:40.071 [ERROR] [org.quartz.core.JobRunShell         ] - Job DEFAULT.2019-06-26T18:43:40.068+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
2019-06-26 18:43:40.241 [ERROR] [org.quartz.core.ErrorLogger         ] - Job (DEFAULT.2019-06-26T18:43:40.068+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
org.quartz.SchedulerException: Job threw an unhandled exception.
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) ~[?:?]
2019-06-26 18:51:00.522 [ERROR] [org.quartz.core.JobRunShell         ] - Job DEFAULT.2019-06-26T18:51:00.503+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
2019-06-26 18:51:01.303 [ERROR] [org.quartz.core.ErrorLogger         ] - Job (DEFAULT.2019-06-26T18:51:00.503+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
org.quartz.SchedulerException: Job threw an unhandled exception.
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) ~[?:?]
2019-06-27 18:32:28.153 [ERROR] [org.quartz.core.JobRunShell         ] - Job DEFAULT.2019-06-27T18:32:28.139+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
2019-06-27 18:32:28.199 [ERROR] [org.quartz.core.ErrorLogger         ] - Job (DEFAULT.2019-06-27T18:32:28.139+02:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
org.quartz.SchedulerException: Job threw an unhandled exception.
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [180:org.openhab.core.scheduler:2.5.0.201905160304]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) ~[?:?]

Do you think this can cause my astro malfunction nonetheless? The time differenz between the last quartz error and my astro failure is more than 2 days. Is there a chance to find out the root cause of my problem? Can I increase the thread pool to a higher amount?

But besides of my above mentioned problem, can you interpret my quartz messages - next problem, sorry What can cause this errors?

rossko57 · June 30, 2019, 9:25pm

They can be very difficult to pinpoint, mostly because the code is running outside of any rule context.
Maybe you can guess the culprit from the time it happens, which timer would you expect to be running around then? Then you can add some tell-tale logInfo to see what happens inside. It’s usually an unexpected, often null, value.

The one you can see failing is not the cause of your Astro problem - it has broken and released the thread.
There might be some lessons to learn though about how to write safe timer code. Rich is suggesting that you might have some other timer, perhaps even this faulty one under different circumstances, behaving badly and looping or suchlike to block the thread.
Increasing thread pool is not the cure.

Be aware that it is possible to provoke timer related errors by editing rules. When you save an edited rules file, any timers those rules previously spawned are still running but get cut off from the rule that created them. These orphans continue on to their appointed time and run their code - which very often fails due to a missing context, typically a lost/null variable.

staehler67 · September 8, 2019, 6:01am

@rlkoshak, @rossko57
I still fight with this error and meanwhile I have no further idea how to fix this issue. I changed all cron timers (19 crons), so that they should not run at the same time (second). I converted all long running “sleeps” to timer like you recommended in Ricks design patterns. But the error remains.
There is nothing to see in the openhab.log, astro binding dies silently without any message. I also recognized, that my onewire binding dies then - no updates of my temperatur sensors. A restart of openhab solves this for one day
How can I find the root of this problem.You mentioned that there is a pool of timers, which allow two concurrent timers. Is there a chance to minitor this pool, just to see when and which timers fill this queue? Or do you have any other hint for me? I’m quite desperate meanwhile …
Thanks a lot in advance.

hilbrand · September 8, 2019, 9:00am

For the hanging bindings see:

The error log looks like a rule that is scheduled and crashes or maybe has a typo. See if you can add some logging to the rules. It could this is unrelated to the atro binding stopping.