Consistent 100% CPU use of safeCall-queue thread

org.jupnp:threadPoolSize=100

Yeah, that. If I have that at -1, this (JRuby; an easy way for me to read internals of objects in a running openHAB instance) returns null:

upnp = OpenHAB::OSGi.service("org.jupnp.UpnpService")
cfg = upnp.configuration
cfg.class.field_reader :mainExecutorService
tp = cfg.mainExecutorService

If it’s at 100, it returns a valid value that I can read getMaximumPoolSize from.

Well that’s interesting. It’s definitely in the code…

It was committed in 2021 here…

Ahhh: Introduce remote thread pool and option to deactivate pool limits by kaikreuzer · Pull Request #138 · jupnp/jupnp · GitHub

it just doesn’t even create the thread pool if it’s -1. So perhaps it does work. I don’t think the same applies to ThreadPoolManager from openhab-core, though. I definitely saw an exception logged when I tried to set it to -1 there.

I believe you are correct. It’s been a minute since we did those changes. Basically when -1 is set, it creates a new single thread “pool” when anything asks for it

protected ExecutorService getMainExecutorService() {
   if ( mainThreadPool == true ) {
       return mainExecutorService;
   } else {
       return Executors.newCachedThreadPool();
   }
}

So is there a way to change the thread pool or something?
I seem to be suffering from this issue as noted in this 4.0 milestone thread

Yes. Open openhab-cli console, and run the following commands:

config:edit org.openhab.threadpool
config:property-set safeCall 100
config:update

You can confirm that it worked in several ways:

In console, you can run these to view all current values:

config:edit org.openhab.threadpool
config:property-list

You can do cat /var/lib/openhab/config/org/openhab/threadpool.config (depending on where your installation is located).

Or you can drop the following JRuby script into OPENHAB_CONF/automation/ruby (when you have the JRuby add-on installed):

org.openhab.core.common.ThreadPoolManager.field_reader :pools
tp = org.openhab.core.common.ThreadPoolManager.pools["safeCall"]
logger.info(tp.maximum_pool_size)

You can even recover from a blocked up thread pool with this script:

org.openhab.core.common.ThreadPoolManager.field_reader :pools
tp = org.openhab.core.common.ThreadPoolManager.pools["safeCall"]

def unblock_thread_pool(tp)
  (tp.maximum_pool_size + 1).times do
    tp.submit { sleep 1 }
  end
end

unblock_thread_pool(tp)

OK thanks Cody
I have set

config:property-set safeCall 100

will report back
edit: 24 hour… cpu usage in the single digits

So apparently setting the safeCall = 100 does not survive a restart?

I tried your script Cody and it works

Yeah I’ve seen it reset, but assumed it was cause I had updated openHAB. It can probably be set with a startup script.

There is no need for it. It these changes to the threadpool are still required after recent fixes for persistence services, you can add the following to openhab/services/runtime.cfg

org.openhab.threadpool:safeCall=100

I’m still having problems with 100% CPU usage and I have uninstalled ALL persistence services. I’m on OH4 snapshot 3460 on a I3 desktop w 8GB RAM

This sounds like an older SNAPSHOT, I am on 3485 actually.

OK, I wanted to report I finally found the problem that was causing the 100% CPU usage. I had a jRuby rule which had a error in it, which I was aware of, but the rule still, for the most part, worked.

I had uninstalled all persistence services about a week earlier when Jan had mentioned a bug in the OH4 snapshot thread and it made no difference. I had stop changing the safeCall to 100 because it wasn’t really helping.

Now, I know some of you are reading this thinking no wonder if you knew you had a broken rule, shame on you but… a rule with an error in it should not bring OH to it’s knees.
This reminds me a lot of when OH3 came out, some people who had a file based DSL rule which had an error in it would cause the same type problem. Mark found several bug and fixed them but still wasn’t sure he had found a smoking gun. I dig around and try to find the thread.

For the record, it is a UI based jRuby rule, I no longer run file based anything. I am not pointing the finger at jRuby, I think the issue runs deeper

edit to add: After I stopped resetting safeCall to 100, it was pinning the CPU every 3 or 4 hours. After disabling the rule (and doing nothing else) it has been over 3 days and Java never goes above single digit CPU usage. I hadn’t changed anything else prior for over a week

1 Like

JSR223 scripts get wrapped in a safeCall, as well as rule executions. I would guess that your “bad” rule caused many many calls to be sent through the safeCall queue, greatly increasing the probability of hitting the race condition in LinkedTransferQueue. Good to know though that if you’re hitting this quite often, it might be because something in your setup is causing it to get triggered more often.

Thanks for reading Cody. For the record, this is a rule that runs every 30 seconds for most the day (from sunrise to sunset) So it executes a lot. But to be fair, the DSL version ran for over two years before I converted the rule to jRuby. The jRuby version was fine until the switch to OH4 and the new jRuby library which is where the error came from. I been meaning to start another thread to ask for help on that.

I have rules that executes a lot more often than every 30 seconds. The frequency shouldn’t be the issue.

It sounds like you might need to open a new topic and paste the rule so we can take a look in case it’s something that can be mitigated in the helper library or core.

Do you still experience that issue today?

Florian, are you asking me? If so, yes, I’m still experiencing this issue. (most recently yesterday morning) Running 4.0.0 snapshot 3460.

I am more or less asking anyone here, just interested what is the state here.

That is a really old snapshot, the last 4.0.0 SNAPSHOT was probably 355x, the latest SNAPSHOT is 3570.

I’m still experiencing the issue. I created a script from Cody’s post above and run it occasionaly when the problem occurs and it clears the issue

yeah, sorry, I disabled most my rules and sort of gave up. I figured because no one else was reporting that it was something unique to my set up and have been meaning to start fresh from a clean install but never got around to it.

1 Like