Thread scheduled using scheduleWithFixedDelay() never executing under 2.5 snapshot

BobA · May 17, 2019, 8:28pm

Hi. I recently got the Eclipse IDE working again for 2.5 development, and started doing some work on the Lutron binding. However, I soon noticed a problem with the current binding code running under 2.5. A routine which should be periodically executed by a thread scheduled using scheduleWithFixedDelay() is never being called. However, I can see the same code working on my 2.4 production system.

The problem schedule call is in IPBridgeHandler.java at line 219:
(https://github.com/openhab/openhab2-addons/blob/master/bundles/org.openhab.binding.lutron/src/main/java/org/openhab/binding/lutron/internal/handler/IPBridgeHandler.java#L219)

private ScheduledFuture<?> keepAlive;

[…]

keepAlive = scheduler.scheduleWithFixedDelay(this::sendKeepAlive, heartbeatInterval, heartbeatInterval, TimeUnit.MINUTES);

[…]

private void sendKeepAlive() {
    […]
}

When I add logger calls or use the debugger, I can see that the scheduleWithFixedDelay call succeeds. At least so far that I get back a good java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask. The heartbeatInterval variable is set to 5 (minutes), and I confirmed immediately after this call that keepAlive.getDelay(TimeUnit.SECONDS) was 299 (seconds). But sendKeepAlive() is never called, and I can’t figure out why.

I tried running the binding both in the Eclipse IDE set up for 2.5 development with bnd and under the latest 2.5 snapshot build (or the latest as of a day or two ago, anyway) with the same results. That same bit of code has been working fine under 2.4, though.

I checked to see if the thingHandler thread pool was out of threads, but it looks fine. I set org.eclipse.smarthome.threadpool:thingHandler to 10, but there are only one or two threads running in it when I look.

Maybe I’m missing something obvious, but I can’t really think of what else to look at. Does anyone have any suggestions? Have there been any changes in 2.5 that would cause this sort of problem?

J-N-K · May 18, 2019, 6:28am

I doubt that it is connected the migration, nothing has been changed then. I can‘t provide any more details but only confirm that scheduleWithFixedDelay is working in S1592 in other bindings.

BobA · May 24, 2019, 9:43pm

Thanks. I haven’t had much time to spend lately on looking in to this, but there does appear to be a problem. It looks like when threads in the thread pool time out from being idle, new ones aren’t being started for some reason. At least in my environment.

The Lutron binding’s IPBridgeHandler uses the ScheduledExecutorService “scheduler” from the BaseThingHandler class for task scheduling. This should end up using the “thingHandler” thread pool.

When IPBridgeHandler starts up, it starts one task in initialize() using schedule() to do some async initialization. This soon exits. Then it starts a long-running messageSender task using schedule(), and a periodic heartbeat task called keepAlive using scheduleWithFixedDelay(). There can be some other tasks scheduled in certain circumstances, but for most of the time in normal operation, it will be using just one active thread (messageSender) and have one periodic task queued (keepAlive).

If I set keepAlive to execute every 30 or 60 seconds, the thread pool size will stay at a size of 3 or 4, with one thread active and one task queued, and keepAlive will always be executed as expected. However, if I set it to run at intervals of 66 seconds or more, the pool size will drop to 1 before the first keepAlive execution because of the thread idle timeout (which is hard-coded at 65 seconds in ThreadPoolManager), and the keepAlive task will never be executed. It seems like once the thread pool size shrinks, it can’t grow again for some reason, and since one thread is already active, another task won’t be run.

Adding some logging of the thread pool and task states seems to prove this out. This is what happens when I set keepAlive to execute every 120 seconds:

00:58:05.080 [Thread-23] DEBUG o.o.b.l.i.handler.IPBridgeHandler:345 - Keepalive job status: delay -45 done false cancelled false

00:58:05.095 [Thread-23] DEBUG o.o.b.l.i.handler.IPBridgeHandler:346 - Scheduler stat: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@222fc61c[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false

You can see that the thread pool size has dropped to 1, with one active thread and 1 queued task. At the same time, the value returned by keepAlive.getDelay(TimeUnit.SECONDS) has gone negative, meaning it is overdue to run. It never will.

Setting keepAlive to run every 30 seconds results in successful execution and thread pool stats that look like this:

01:44:49.067 [Thread-23] DEBUG o.o.b.l.i.handler.IPBridgeHandler:346 - Scheduler stat: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@13b6f11d[Running, pool size = 4, active threads = 1, queued tasks = 1, completed tasks = 27] shut: false term: false

This test was running under the Eclipse IDE on Windows 10 using the Oracle 1.8.0_212 JRE. However, I noticed the same problem with a recent 2.5 snapshot build running on Windows.

I don’t see any memory exceptions in the log, or any other indication of a condition that should cause attempts to create new threads to fail.

Does anyone have any ideas?

mhilbush · May 25, 2019, 10:55am

I don’t have any ideas ATM, but I believe I’ve seen similar behavior a couple times on a 2.5 snapshot build. But, I didn’t have any logging in place to be able to confirm what you show above. I’ll monitor a little more closely to see if I can reproduce. I run with a pretty high number of ThingHandler threads (50), so maybe I’m masking the problem.

BobA · July 26, 2019, 2:35pm

I’ve finally been able to do a bit more testing of this issue. I created a test binding that reproduces the problem in a much simpler context. All it does is start a long running task with scheduler.schedule(), and then schedule a periodic task with scheduler.scheduleWithFixedDelay(). The binding has one thing threadtest with one parameter interval which sets the time interval for the periodic task in seconds. It also has one channel poolStats which triggers a dump of task info and scheduler pool stats when any string is sent to it. Since the channel is defined as a <command> channel, this is easily done by clicking on its button under Control in Paper UI.

If interval is set to 66 or more seconds, the periodic task will never be executed. This is because, as I noted above, the thread pool size drops to 1, and that one thread is presumably tied up executing the long running task. More threads are never added, and the scheduled periodic task is never executed.

If interval is less than 65 seconds, idle threads will not time out, and the problem will not occur. Interestingly, it seems that if there is no long running task to take up a thread the problem will also not occur.

The code for the test binding is available on github here

A jar file containing the test binding is available here

Here is sample log output:

10:07:02.049 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:60 - Scheduling periodic job with interval 90
10:07:02.050 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 89 done false cancelled false
10:07:02.050 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 3, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:07:25.586 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 66 done false cancelled false
10:07:25.587 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 3, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:07:40.615 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 51 done false cancelled false
10:07:40.615 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 3, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:07:50.839 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 41 done false cancelled false
10:07:50.839 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 3, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:08:03.557 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 28 done false cancelled false
10:08:03.557 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 3, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:08:17.951 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 14 done false cancelled false
10:08:17.951 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:08:27.246 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay 4 done false cancelled false
10:08:27.246 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:08:34.571 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay -2 done false cancelled false
10:08:34.571 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false
10:08:43.462 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:94 - Periodic job status: delay -11 done false cancelled false
10:08:43.462 [safeCall-2] INFO  o.o.b.t.i.ThreadTestNewHandler:96 - Scheduler status: org.eclipse.smarthome.core.internal.common.WrappedScheduledExecutorService@53c8289d[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1] shut: false term: false

Lolodomo · July 26, 2019, 4:35pm

You should probably fill an issue in Git with all these information.

BobA · July 26, 2019, 7:33pm

Thanks. I just wanted to see if anyone could point out something I was doing wrong before I did that. I’ve opened issue #942 under openhab-core for this.

hilbrand · July 26, 2019, 9:16pm

It’s good you created the issue, because this is something that needs to be investigated (if not know). However the scheduler should not be used for long running tasks. Besides the problem you encountered, it’s a limited thread pool shared by all bindings for short running tasks. So if you use it for long running tasks it blocks other bindings from using the scheduler. For long running tasks the pattern is to start and manage your own thread.

BobA · July 27, 2019, 8:00pm

Thanks. That makes sense. I didn’t originally write that bit of code, but I’ll look at updating it to start a thread for the message sender task rather than using the scheduler. That will have the added benefit of working around the scheduler thread pool issue.

I suspect there are other bindings that use the scheduler for long-running tasks too, thought, since I seem to recall that at one point the development docs stated that bindings should never start their own threads and should instead use the scheduler.