- Platform information:
- Hardware: Intel NUC. Intel(R) Core™ i5-8259U CPU @ 2.30GHz, 16 GB RAM
- OS: Linux 5.15.0-58-generic #64-Ubuntu SMP x86_64 Ubuntu 22.04.1 LTS
- Java Runtime Environment: OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing)
- openHAB version:3.4.1
I’ve noticed that my openHAB is often above 100% CPU usage. I’ve got a lot going on, so haven’t been excessively concerned. But then I ran the following to figure out if it’s any particular thread that’s using all the CPU, and voila, yes, it most certainly is a single thread completely pegging a single CPU:
top -n 1 -H -p <pid of main openHAB process>
I typically get output such as:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56222 openhab 20 0 8925436 3.6g 24212 R 99.9 23.2 93:18.10 safeCall-queue
Sometimes it will be split with upnp-remote-que
or items-queue
. Okay, so what on earth is that thread doing? Over in Karaf console, I do threads | grep -A 5 "queue"
. Every time I execute it, I get output like this (filter out the pipe-grep thread):
"safeCall-queue" Id=8239 in RUNNABLE
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
at org.openhab.core.common.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:193)
at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
--
"upnp-remote-queue" Id=14699 in RUNNABLE
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
at org.jupnp.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:194)
at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
Okay, so going to look at that code, I first do log:set TRACE org.openhab.core.common.QueueingThreadPoolExecutor
to see if I can figure out if it’s something I’m doing that’s causing a ton of work to flow through these threads. The only line I get (and I do get a lot of them) is [TRACE] [re.common.QueueingThreadPoolExecutor] - Executing queued task of thread pool 'items'.
Hmm, not the thread routinely causing problems. Go look at the code, and the line that it seems to always be stuck in is final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
. It appears that we’re doing a spinning poll waiting for a task to become available. For relatively quite some time, compared to the number of tasks coming through. What’s weird to me is that the code implies if a task doesn’t come within a certain amount of time, the thread will exit itself until another task comes. But (a) I don’t ever see any Executing queued task of thread pool 'safeCall'.
in the log. And (b) the thread ID never changes - neither in top
nor in Karaf. So that’s weird. Is this possibly an issue that’s happening because I’m running with Java 17? Perhaps it’s just throwing InterruptedException a lot, and that doesn’t get logged, and it just spins around in the loop and tries again?
UPDATE
Ok, this has got to be a Java17 thing, no? I’ve modified the code, and confirmed my modifications are running because enabling logging shows many hits from the items
pool. I can clearly see my upnp-main-queue thread chewing 100% CPU, and stuck within LinkedTransferQueue.pool, but I get zero logging from that thread. So it’s clearly been in that method for longer than 2 seconds. Here is my modified code:
logger.trace("About to start polling in thread pool '{}'", threadPoolName);
Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
("Polling complete in thread pool '{}'; got {}", threadPoolName, runnable);
...
catch (InterruptedException e) {
("Interrupted waiting for pool of thread pool '{}'.", threadPoolName, e);
}
I’ve also played around with LinkedTransferQueue a bit in isolated tests, and it doesn’t seem to chew up CPU inside of poll
normally.
UPDATE
JUPnP has its own QueuingThreadPoolExecutor class, that I haven’t modified and enabled logging for. I’ve done so now, and am waiting for the condition to recur. In the meantime, I’ve confirmed that the taskQueue
for the proper thread pool executor thinks it was empty, at least from its public interface. Java reflection wouldn’t allow me to read the transient volatile
head
and tail
from jdk17u-dev/LinkedTransferQueue.java at master · openjdk/jdk17u-dev · GitHub with my JSR223 script.
UPDATE
AHA! I was able to break out of this state by running the following (Ruby) script, which forces all threads in the pool to be hogged, and a new task to be queued. It really feels like LinkedTransferQueue has an issue with timing out sometimes.
11.times do
org.openhab.core.common.ThreadPoolManager.getPool("safeCall").submit do
logger.info("in thread #{Thread.current.name}")
sleep 2
logger.info("and done in thread #{Thread.current.name}")
end
end
UPDATE
I changed the logging a bit to log exactly how long poll took if it took longer than 3s. I let it get into this state again, then broke it out with the above code. I have proof now that LinkedTransferQueue is not adhering to its contract. I’ve filed a bug report at bugreport.java.com with the relevant details (internal id 9074727).
long start = System.nanoTime();
final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
long end = System.nanoTime();
long duration = end - start;
if (duration > 3000000000L) {
logger.warn("Polling the task queue in thread pool '{}' took {}ms",
threadPoolName, duration / 1000000L);
}
Log:
2023-01-27 08:05:17.338 [WARN ] [re.common.QueueingThreadPoolExecutor] - Polling the task queue in thread pool 'safeCall' took 1377089ms
cc @Kai @J-N-K. I’m not sure how widespread this issue is, or if it’s only Java 17 or what. QueueingThreadPoolExecutor has not changed in quite some time. But it may be related to some of the “takes a long time to shut down openHAB” issues such as JuPnP thread exhuastion/contention on upnp-async · Issue #1886 · openhab/openhab-core · GitHub. I normally have to wait 2 minutes as well, but when I had just “fixed” my queues so they’re not stuck, my shutdown was almost immediate. We could work around more permanently by using a different data structure.