Consistent 100% CPU use of safeCall-queue thread

  • Platform information:
    • Hardware: Intel NUC. Intel(R) Core™ i5-8259U CPU @ 2.30GHz, 16 GB RAM
    • OS: Linux 5.15.0-58-generic #64-Ubuntu SMP x86_64 Ubuntu 22.04.1 LTS
    • Java Runtime Environment: OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing)
    • openHAB version:3.4.1

I’ve noticed that my openHAB is often above 100% CPU usage. I’ve got a lot going on, so haven’t been excessively concerned. But then I ran the following to figure out if it’s any particular thread that’s using all the CPU, and voila, yes, it most certainly is a single thread completely pegging a single CPU:

top -n 1 -H -p <pid of main openHAB process>

I typically get output such as:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                   
  56222 openhab   20   0 8925436   3.6g  24212 R  99.9  23.2  93:18.10 safeCall-queue   

Sometimes it will be split with upnp-remote-que or items-queue. Okay, so what on earth is that thread doing? Over in Karaf console, I do threads | grep -A 5 "queue". Every time I execute it, I get output like this (filter out the pipe-grep thread):

"safeCall-queue" Id=8239 in RUNNABLE
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
    at org.openhab.core.common.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:193)
    at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
--
"upnp-remote-queue" Id=14699 in RUNNABLE
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
    at org.jupnp.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:194)
    at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)

Okay, so going to look at that code, I first do log:set TRACE org.openhab.core.common.QueueingThreadPoolExecutor to see if I can figure out if it’s something I’m doing that’s causing a ton of work to flow through these threads. The only line I get (and I do get a lot of them) is [TRACE] [re.common.QueueingThreadPoolExecutor] - Executing queued task of thread pool 'items'. Hmm, not the thread routinely causing problems. Go look at the code, and the line that it seems to always be stuck in is final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);. It appears that we’re doing a spinning poll waiting for a task to become available. For relatively quite some time, compared to the number of tasks coming through. What’s weird to me is that the code implies if a task doesn’t come within a certain amount of time, the thread will exit itself until another task comes. But (a) I don’t ever see any Executing queued task of thread pool 'safeCall'. in the log. And (b) the thread ID never changes - neither in top nor in Karaf. So that’s weird. Is this possibly an issue that’s happening because I’m running with Java 17? Perhaps it’s just throwing InterruptedException a lot, and that doesn’t get logged, and it just spins around in the loop and tries again?

UPDATE

Ok, this has got to be a Java17 thing, no? I’ve modified the code, and confirmed my modifications are running because enabling logging shows many hits from the items pool. I can clearly see my upnp-main-queue thread chewing 100% CPU, and stuck within LinkedTransferQueue.pool, but I get zero logging from that thread. So it’s clearly been in that method for longer than 2 seconds. Here is my modified code:

logger.trace("About to start polling in thread pool '{}'", threadPoolName);
 Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
("Polling complete in thread pool '{}'; got {}", threadPoolName, runnable);

...

catch (InterruptedException e) {
  ("Interrupted waiting for pool of thread pool '{}'.", threadPoolName, e);
}

I’ve also played around with LinkedTransferQueue a bit in isolated tests, and it doesn’t seem to chew up CPU inside of poll normally.

UPDATE

:man_facepalming: JUPnP has its own QueuingThreadPoolExecutor class, that I haven’t modified and enabled logging for. I’ve done so now, and am waiting for the condition to recur. In the meantime, I’ve confirmed that the taskQueue for the proper thread pool executor thinks it was empty, at least from its public interface. Java reflection wouldn’t allow me to read the transient volatile head and tail from jdk17u-dev/LinkedTransferQueue.java at master · openjdk/jdk17u-dev · GitHub with my JSR223 script.

UPDATE

AHA! I was able to break out of this state by running the following (Ruby) script, which forces all threads in the pool to be hogged, and a new task to be queued. It really feels like LinkedTransferQueue has an issue with timing out sometimes.

11.times do
  org.openhab.core.common.ThreadPoolManager.getPool("safeCall").submit do
    logger.info("in thread #{Thread.current.name}")
    sleep 2
    logger.info("and done in thread #{Thread.current.name}")
  end
end

UPDATE

I changed the logging a bit to log exactly how long poll took if it took longer than 3s. I let it get into this state again, then broke it out with the above code. I have proof now that LinkedTransferQueue is not adhering to its contract. I’ve filed a bug report at bugreport.java.com with the relevant details (internal id 9074727).

                            long start = System.nanoTime();
                            final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
                            long end = System.nanoTime();
                            long duration = end - start;
                            if (duration > 3000000000L) {
                                logger.warn("Polling the task queue in thread pool '{}' took {}ms",
                                        threadPoolName, duration / 1000000L);
                            }

Log:

2023-01-27 08:05:17.338 [WARN ] [re.common.QueueingThreadPoolExecutor] - Polling the task queue in thread pool 'safeCall' took 1377089ms

cc @Kai @J-N-K. I’m not sure how widespread this issue is, or if it’s only Java 17 or what. QueueingThreadPoolExecutor has not changed in quite some time. But it may be related to some of the “takes a long time to shut down openHAB” issues such as JuPnP thread exhuastion/contention on upnp-async · Issue #1886 · openhab/openhab-core · GitHub. I normally have to wait 2 minutes as well, but when I had just “fixed” my queues so they’re not stuck, my shutdown was almost immediate. We could work around more permanently by using a different data structure.

Have you tried to dial up the pool sizes? There are some really weird things that happen when the pool is overrun on JuPnP.

As a test, I’d let it free spin with -1 on both and see if it helps.

Thanks, that’s a good workaround. It would definitely avoid the problem if the taskQueue never has to be used. And I’m running on a good sized NUC, so I’m not too worried about resources.

The problem with upnp, if problem is even the right word, is that jupnp has to listen to and deal with every upnp message on your network, not just things you’ve configured. The long shutdown is very tell tale of this. What I found a few years ago was that, because of thread exhaustion due to a large number of devices, upnp packets were missed and discarded because they had no where to go. It very litterally couldn’t create a listener to wait for the replies. My issue was sonos back then. When OH was shut down it was listening for replies from devices. Because some of those were lost, the threads hung and didnt dispose properly until everything finally just timed out. It sounds like your problem is similar. If you have the resources I always suggest setting to -1 to avoid any chance of issues. If that makes your problem go away, then in reality there isn’t anything wrong with OH. OH is tuned for “most users”. You just need to dial up that a bit for your use case.