Consistent 100% CPU use of safeCall-queue thread

  • Platform information:
    • Hardware: Intel NUC. Intel(R) Core™ i5-8259U CPU @ 2.30GHz, 16 GB RAM
    • OS: Linux 5.15.0-58-generic #64-Ubuntu SMP x86_64 Ubuntu 22.04.1 LTS
    • Java Runtime Environment: OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing)
    • openHAB version:3.4.1

I’ve noticed that my openHAB is often above 100% CPU usage. I’ve got a lot going on, so haven’t been excessively concerned. But then I ran the following to figure out if it’s any particular thread that’s using all the CPU, and voila, yes, it most certainly is a single thread completely pegging a single CPU:

top -n 1 -H -p <pid of main openHAB process>

I typically get output such as:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                   
  56222 openhab   20   0 8925436   3.6g  24212 R  99.9  23.2  93:18.10 safeCall-queue   

Sometimes it will be split with upnp-remote-que or items-queue. Okay, so what on earth is that thread doing? Over in Karaf console, I do threads | grep -A 5 "queue". Every time I execute it, I get output like this (filter out the pipe-grep thread):

"safeCall-queue" Id=8239 in RUNNABLE
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
    at org.openhab.core.common.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:193)
    at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
--
"upnp-remote-queue" Id=14699 in RUNNABLE
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
    at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
    at org.jupnp.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:194)
    at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)

Okay, so going to look at that code, I first do log:set TRACE org.openhab.core.common.QueueingThreadPoolExecutor to see if I can figure out if it’s something I’m doing that’s causing a ton of work to flow through these threads. The only line I get (and I do get a lot of them) is [TRACE] [re.common.QueueingThreadPoolExecutor] - Executing queued task of thread pool 'items'. Hmm, not the thread routinely causing problems. Go look at the code, and the line that it seems to always be stuck in is final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);. It appears that we’re doing a spinning poll waiting for a task to become available. For relatively quite some time, compared to the number of tasks coming through. What’s weird to me is that the code implies if a task doesn’t come within a certain amount of time, the thread will exit itself until another task comes. But (a) I don’t ever see any Executing queued task of thread pool 'safeCall'. in the log. And (b) the thread ID never changes - neither in top nor in Karaf. So that’s weird. Is this possibly an issue that’s happening because I’m running with Java 17? Perhaps it’s just throwing InterruptedException a lot, and that doesn’t get logged, and it just spins around in the loop and tries again?

UPDATE

Ok, this has got to be a Java17 thing, no? I’ve modified the code, and confirmed my modifications are running because enabling logging shows many hits from the items pool. I can clearly see my upnp-main-queue thread chewing 100% CPU, and stuck within LinkedTransferQueue.pool, but I get zero logging from that thread. So it’s clearly been in that method for longer than 2 seconds. Here is my modified code:

logger.trace("About to start polling in thread pool '{}'", threadPoolName);
 Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
("Polling complete in thread pool '{}'; got {}", threadPoolName, runnable);

...

catch (InterruptedException e) {
  ("Interrupted waiting for pool of thread pool '{}'.", threadPoolName, e);
}

I’ve also played around with LinkedTransferQueue a bit in isolated tests, and it doesn’t seem to chew up CPU inside of poll normally.

UPDATE

:man_facepalming: JUPnP has its own QueuingThreadPoolExecutor class, that I haven’t modified and enabled logging for. I’ve done so now, and am waiting for the condition to recur. In the meantime, I’ve confirmed that the taskQueue for the proper thread pool executor thinks it was empty, at least from its public interface. Java reflection wouldn’t allow me to read the transient volatile head and tail from jdk17u-dev/LinkedTransferQueue.java at master · openjdk/jdk17u-dev · GitHub with my JSR223 script.

UPDATE

AHA! I was able to break out of this state by running the following (Ruby) script, which forces all threads in the pool to be hogged, and a new task to be queued. It really feels like LinkedTransferQueue has an issue with timing out sometimes.

11.times do
  org.openhab.core.common.ThreadPoolManager.getPool("safeCall").submit do
    logger.info("in thread #{Thread.current.name}")
    sleep 2
    logger.info("and done in thread #{Thread.current.name}")
  end
end

UPDATE

I changed the logging a bit to log exactly how long poll took if it took longer than 3s. I let it get into this state again, then broke it out with the above code. I have proof now that LinkedTransferQueue is not adhering to its contract. I’ve filed a bug report at bugreport.java.com with the relevant details (internal id 9074727).

                            long start = System.nanoTime();
                            final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
                            long end = System.nanoTime();
                            long duration = end - start;
                            if (duration > 3000000000L) {
                                logger.warn("Polling the task queue in thread pool '{}' took {}ms",
                                        threadPoolName, duration / 1000000L);
                            }

Log:

2023-01-27 08:05:17.338 [WARN ] [re.common.QueueingThreadPoolExecutor] - Polling the task queue in thread pool 'safeCall' took 1377089ms

cc @Kai @J-N-K. I’m not sure how widespread this issue is, or if it’s only Java 17 or what. QueueingThreadPoolExecutor has not changed in quite some time. But it may be related to some of the “takes a long time to shut down openHAB” issues such as JuPnP thread exhuastion/contention on upnp-async · Issue #1886 · openhab/openhab-core · GitHub. I normally have to wait 2 minutes as well, but when I had just “fixed” my queues so they’re not stuck, my shutdown was almost immediate. We could work around more permanently by using a different data structure.

3 Likes

Have you tried to dial up the pool sizes? There are some really weird things that happen when the pool is overrun on JuPnP.

As a test, I’d let it free spin with -1 on both and see if it helps.

Thanks, that’s a good workaround. It would definitely avoid the problem if the taskQueue never has to be used. And I’m running on a good sized NUC, so I’m not too worried about resources.

The problem with upnp, if problem is even the right word, is that jupnp has to listen to and deal with every upnp message on your network, not just things you’ve configured. The long shutdown is very tell tale of this. What I found a few years ago was that, because of thread exhaustion due to a large number of devices, upnp packets were missed and discarded because they had no where to go. It very litterally couldn’t create a listener to wait for the replies. My issue was sonos back then. When OH was shut down it was listening for replies from devices. Because some of those were lost, the threads hung and didnt dispose properly until everything finally just timed out. It sounds like your problem is similar. If you have the resources I always suggest setting to -1 to avoid any chance of issues. If that makes your problem go away, then in reality there isn’t anything wrong with OH. OH is tuned for “most users”. You just need to dial up that a bit for your use case.

Here’s the Java bug: https://bugs.java.com/bugdatabase/view_bug?bug_id=8301341

Note that -1 doesn’t work. It’s not an allowed value, and will throw an exception when trying to create or modify the thread pool to use that: ThreadPoolExecutor (Java SE 11 & JDK 11 ). I set it to 100 instead.

I’m confused, what are you having issues setting to -1? You should be setting:

org.jupnp:threadPoolSize=-1
org.jupnp:asyncThreadPoolSize=-1

org.jupnp:threadPoolSize=100

Yeah, that. If I have that at -1, this (JRuby; an easy way for me to read internals of objects in a running openHAB instance) returns null:

upnp = OpenHAB::OSGi.service("org.jupnp.UpnpService")
cfg = upnp.configuration
cfg.class.field_reader :mainExecutorService
tp = cfg.mainExecutorService

If it’s at 100, it returns a valid value that I can read getMaximumPoolSize from.

Well that’s interesting. It’s definitely in the code…

It was committed in 2021 here…

Ahhh: Introduce remote thread pool and option to deactivate pool limits by kaikreuzer · Pull Request #138 · jupnp/jupnp · GitHub

it just doesn’t even create the thread pool if it’s -1. So perhaps it does work. I don’t think the same applies to ThreadPoolManager from openhab-core, though. I definitely saw an exception logged when I tried to set it to -1 there.

I believe you are correct. It’s been a minute since we did those changes. Basically when -1 is set, it creates a new single thread “pool” when anything asks for it

protected ExecutorService getMainExecutorService() {
   if ( mainThreadPool == true ) {
       return mainExecutorService;
   } else {
       return Executors.newCachedThreadPool();
   }
}

So is there a way to change the thread pool or something?
I seem to be suffering from this issue as noted in this 4.0 milestone thread

Yes. Open openhab-cli console, and run the following commands:

config:edit org.openhab.threadpool
config:property-set safeCall 100
config:update

You can confirm that it worked in several ways:

In console, you can run these to view all current values:

config:edit org.openhab.threadpool
config:property-list

You can do cat /var/lib/openhab/config/org/openhab/threadpool.config (depending on where your installation is located).

Or you can drop the following JRuby script into OPENHAB_CONF/automation/ruby (when you have the JRuby add-on installed):

org.openhab.core.common.ThreadPoolManager.field_reader :pools
tp = org.openhab.core.common.ThreadPoolManager.pools["safeCall"]
logger.info(tp.maximum_pool_size)

You can even recover from a blocked up thread pool with this script:

org.openhab.core.common.ThreadPoolManager.field_reader :pools
tp = org.openhab.core.common.ThreadPoolManager.pools["safeCall"]

def unblock_thread_pool(tp)
  (tp.maximum_pool_size + 1).times do
    tp.submit { sleep 1 }
  end
end

unblock_thread_pool(tp)

OK thanks Cody
I have set

config:property-set safeCall 100

will report back
edit: 24 hour… cpu usage in the single digits

So apparently setting the safeCall = 100 does not survive a restart?

I tried your script Cody and it works

Yeah I’ve seen it reset, but assumed it was cause I had updated openHAB. It can probably be set with a startup script.

There is no need for it. It these changes to the threadpool are still required after recent fixes for persistence services, you can add the following to openhab/services/runtime.cfg

org.openhab.threadpool:safeCall=100

I’m still having problems with 100% CPU usage and I have uninstalled ALL persistence services. I’m on OH4 snapshot 3460 on a I3 desktop w 8GB RAM

This sounds like an older SNAPSHOT, I am on 3485 actually.

OK, I wanted to report I finally found the problem that was causing the 100% CPU usage. I had a jRuby rule which had a error in it, which I was aware of, but the rule still, for the most part, worked.

I had uninstalled all persistence services about a week earlier when Jan had mentioned a bug in the OH4 snapshot thread and it made no difference. I had stop changing the safeCall to 100 because it wasn’t really helping.

Now, I know some of you are reading this thinking no wonder if you knew you had a broken rule, shame on you but… a rule with an error in it should not bring OH to it’s knees.
This reminds me a lot of when OH3 came out, some people who had a file based DSL rule which had an error in it would cause the same type problem. Mark found several bug and fixed them but still wasn’t sure he had found a smoking gun. I dig around and try to find the thread.

For the record, it is a UI based jRuby rule, I no longer run file based anything. I am not pointing the finger at jRuby, I think the issue runs deeper

edit to add: After I stopped resetting safeCall to 100, it was pinning the CPU every 3 or 4 hours. After disabling the rule (and doing nothing else) it has been over 3 days and Java never goes above single digit CPU usage. I hadn’t changed anything else prior for over a week

1 Like