High CPU usage after migration to OH4

Andrew_Rowe · August 2, 2023, 12:13pm

For me the is not the case. In my case, when it happens, the cpu pins at 100% and stays there until I do something about it. Either I must restart openHAB or run a script Cody wrote found in this post

Here is the script. I just use the run now button to run it

org.openhab.core.common.ThreadPoolManager.field_reader :pools
tp = org.openhab.core.common.ThreadPoolManager.pools["safeCall"]

def unblock_thread_pool(tp)
  (tp.maximum_pool_size + 1).times do
    tp.submit { sleep 1 }
  end
end

unblock_thread_pool(tp)

rene54321 · August 3, 2023, 3:20pm

Hi,

I have also migrated to the new hue ai v2. But also have the CPU peaks without the binding at all. The high CPU usage is not all the time. I have attached my CPU usage over the last 24 hours shown by the white line.

With version 3.4.4 I had only 5% CPU usage now over 50%.

best regards René

Lolodomo · August 3, 2023, 6:25pm

I have probably also a higher CPU usage than before even if I am not sure and even if it seems to have 0 impact on my home automations.

I have now installed and setup the systeminfo binding (refresh every 3s for high priority channels) and I have the strange feeling that the CPU usage is even higher now. I can see in top my java process around 65% now. It was less before installing the systeminfo binding.
But at least, I will have charts now and see during a day if it is stable or not.

Is there someone having a clear understanding of the “Linux load average” ? For example, what means exactly 0,20 ? Is it a good value for a RPI3 ?
Is it normal to have “load average 0,16 0,17 0,18” and the %CPU of the java process at 62,2% ?

Lolodomo · August 3, 2023, 6:47pm

After some readings, I understand that a load average of 4 for a RPI3 with 4 core would be normal but would mean a high load.
So a value between 0,2 and 0,4 is probably considered for such processor as a low CPU load.

Mherwege · August 3, 2023, 6:57pm

Be careful. On Linux, load average is about process load and includes waiting processes, e.g. for IO, that may not impact CPU usage. A high load average is indeed not what you want, but it could be more than the number of cores, depending on what the processes are doing. CPU usage on linux is by processor/core and adds up. There is a CPU problem if load average and CPU usage evolve in the same way.

pleedell · August 3, 2023, 6:59pm

Can confirm. On my rpi4/openhabian/OH4.0.1 install I track a number of parameters from the rpi using persistence. The ‘load average’ parameter is generally between 0.2 and 0.5 with occasional spikes up to 2-4 that seem to quickly settle.

No issues with my upgrade as I’ve been running the pre-prod OH4 versions since they came out, but this is load I’m seeing even since OH3.

Cheers

Lolodomo · August 3, 2023, 7:25pm

I just read another article saying that :

You can calculate the overall CPU utilization using the idle time using the formula given below:
CPU utilization = 100 - idle time

while the iddle time is a value you can find in the third line of top.

I see in top the iddle time between 94 and 98, meaning an overall CPU usage between 2% and 6%.

For those mentioning a high CPU usage, where do you get this information ?

ccutrer · August 3, 2023, 10:16pm

The number I usually look at is the per-process CPU usage in top. This one where it’s currently showing 21%:

That number is relative to a single CPU core. So for example the items-queue problem that number will always be >= 100%, because a single CPU core is completely pegged, and anything else is the other threads in the process using some CPU.

The percentage in the third line (by default) that starts with %Cpu(s) should sum to 100%, and is relative to all available CPU cores. You can press 1 to alternate between total and per-CPU-core though. This could be useful if you have a thread pinned to a specific CPU core, and using all of its resources (I’ve seen this in very high I/O environments where IRQ requests can only be processed by a specific core). If the idle number is near 0% consistently, it’s a good indication that your system as a whole is short on CPU resources.

The load average is defined as the number of processes (over the specific period of time) that are ready to be run (or running). The “general” rule of thumb is that if this number is greater than the number of CPU cores you have, that means the CPUs are overworked and accumulating a backlog. But like @Mherwege says, this can be deceptive depending on what exactly the load is. Consistently very high load numbers can mean your system is overworked, but not necessarily a strong signal of CPU-only starvation.

Andrew_Rowe · August 3, 2023, 10:19pm

On an apt installed OH running on a Mint Linux host I simply use TOP which reads out in % percentage. When it would run away the java thread would go over 100%.
I would notice this because the cooling fan would throttle up and get very loud.
Load average is listed at the top and I think it is a 1 minute, 5 minute and 15 minute intervals. When the cpu runs away, it would read over 4
Here is what it looks like right now (normal)

load average: 0.58, 0.63, 0.34

Andrew_Rowe · August 4, 2023, 11:14pm

I thought I had found my problem, but I wanted to test it for a few days and be sure. I had a shelly device that I don’t really use anymore. When I upgraded to OH4, I didn’t bother to reinitialize it, but I also didn’t remove the Thing from my Things. It occurred to me about a week ago that perhaps this was the problem. I had read another thread a few weeks back where a guy had a shelly device that was at the edge of his wifi coverage and it had caused problems when it was unreachable. So I deleted the shelly device.

I had previously disabled the vast majority of my Jruby rules because it was the only way to keep my system from constantly pinning the cpu at 100%. My system has been crippled by this problem since April. After deleting the shelly device and waiting a day or two, I figured I’d re-enable my Jruby rules and see what happen. Sure enough, my problem with 100% cpu usage is back.
I hate to point the finger at Jruby rules but I have a couple dozen DSL rules running which don’t cause a problem. As soon as I re-enable a few of my Jruby rules, the problem immediately returns. This problem is highly reproducible for me in case anybody wants me to try anything or give a clue as to how to track the problem down myself.
I have a very strong suspicion this has something to do with the safeCall queue problem discussed at length here

I just want to add that these same Jruby rules ran happily on OH3 with no problems. This problem started with the upgrade to OH4. But just for context, OH4 also introduced a new Jruby helper library, don’t know if that has anything to do with it

ccutrer · August 5, 2023, 1:12pm

The problem is identical to the safeCall-queue. And the OH4 upgrade isn’t necessarily due to JRuby changes. It’s a bug in Java 17 (which OH4 switched to) in the LinkedTransferQueue that openHAB core uses. Certain workloads are clearly more likely to trigger the bug, even though they’re not buggy in and of themselves. I also wouldn’t be surprised if the bug only happens on certain hardware combinations - I didn’t quite follow the chatter on the upstream bug (https://bugs.java.com/bugdatabase/view_bug?bug_id=8301341) which is now fixed, but only available in a pre-release of Java 22. OH4 did also start using SafeCaller more for executing JSR rules (to help prevent errant rules from running amok with no way to stop them hit to restart openHAB), so are probably more likely to trigger the safeCall-queue incarnation.

Andrew_Rowe · August 5, 2023, 2:35pm

Thank you Cody for the link. It is interesting to me that the bug report is authored by an openHAB user (apparently) who uses openHAB in his steps to reproduce the problem.
If I understand what I’ve read, the problem is rare and not easily back-ported to prior versions so there is zero chance of it being fixed in the Java version currently used in openHAB. I would further assume that considering the effort required to move openHAB from Java version 11 to version 17, moving to version 22 isn’t going to happen any time soon.
When you say that openHAB did also start using SafeCaller more for executing JSR rules, I’m guessing you mean javascript? I guess this is why some users above who also reported suffering from this issue report having problems with javascript rules?
It seems I am in the minority and this is not anything which is going to be fixed any time soon.

rene54321 · August 6, 2023, 7:58pm

Hi,

I switched back to the latest OH3 release. Same configuration as with OH4 (same amount of bindings, things, items and rules). Installed again Java 11 and have now back again my 4% CPU usage, CPU temp went down from 68°C to 49°C. Attached also the chart, switch to OH3 was completed at 4PM.

best regards René

MerlinChiodo · August 9, 2023, 10:30am

Same problems here. On a RPI 3b+. Thank you for the visualisation. Im interested if this will be adressed soon. Probably as mentioned above, a fix will take some time due to java 17 is the newest JDK with LTS and 21 will be released in September 23. Nevertheless if @ccutrer is right, than the fix is in pre version JDK22

apas_csc · August 25, 2023, 12:49pm

I noticed that the CPU load on my Synology ARM is constantly high since a couple of days. I can confirm @ccutrer analysis: htop shows upnp-main-queue as the offending thread.

Thread dump shows:

"upnp-main-queue" #300878 prio=5 os_prio=0 cpu=296024047.59ms elapsed=298256.24s tid=0x566655e0 nid=0x57ab runnable  [0x48fde000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@17.0.8/LinkedTransferQueue.java:652)
        at java.util.concurrent.LinkedTransferQueue.xfer(java.base@17.0.8/LinkedTransferQueue.java:616)
        at java.util.concurrent.LinkedTransferQueue.poll(java.base@17.0.8/LinkedTransferQueue.java:1294)
        at org.jupnp.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:194)
        at java.lang.Thread.run(java.base@17.0.8/Thread.java:833)

I hope they will downport the fix for https://bugs.java.com/bugdatabase/view_bug?bug_id=8301341.
Or is it ok to run Openhab 4 also with a higher Java version than 17? The time of breaking changes (removal of Nashorn, JDK encapsulation) seems to be over for now.

ccutrer · August 25, 2023, 1:09pm

openHAB 4.0.2 has a “fix” (just using the LinkedTransferQueue from Java 11) automatically included. So you shouldn’t need to worry about changing Java versions.

mdalacu · August 30, 2023, 6:38am

Hi! openHAB 4.0.2 is released? No updates available on my ubuntu system. Installed from deb

Wolfgang_S · August 30, 2023, 6:38pm

Installed from a downloaded deb ? Then you won’t see any updates. The package needs to be installed from the repository to get updates from the same repository.

wborn · August 30, 2023, 7:15pm

Yes you can read how to configure APT to get updates here APT Based Systems

mdalacu · August 31, 2023, 11:35am

I have the apt repo installed but it has it’s key expired. i have tried to put the new one but i don’t think thant is working. The update whent well a week before to 4.01 from 3.4. Strange…

blahblah@OpenHAB:~$ sudo apt update
[sudo] password for mdalacu:
Hit:1 Index of /ubuntu focal InRelease
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:3 Index of /ubuntu focal-updates InRelease [114 kB]
Get:5 https://apt.grafana.com stable InRelease [5,984 B]
Get:6 https://repos.influxdata.com/ubuntu focal InRelease [7,037 B]
Hit:7 https://repos.azul.com/zulu/deb stable InRelease
Get:8 Index of /ubuntu focal-backports InRelease [108 kB]
Err:5 https://apt.grafana.com stable InRelease
The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY 963FA27710458545
Err:6 https://repos.influxdata.com/ubuntu focal InRelease
The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY D8FF8E1F7DF8B07E
Ign:4 https://repos.azulsystems.com/ubuntu stable InRelease
Hit:9 https://repos.azulsystems.com/ubuntu stable Release
Get:10 Index of /ubuntu focal-updates/main amd64 Packages [2,783 kB]
Get:11 https://openhab.jfrog.io/artifactory/openhab-linuxpkg stable InRelease [12.8 kB]
Err:11 https://openhab.jfrog.io/artifactory/openhab-linuxpkg stable InRelease
The following signatures were invalid: EXPKEYSIG 075721F6A224060A openHAB Bintray Repositories owner@openhab.org
Fetched 3,126 kB in 2s (1,956 kB/s)
Reading package lists… Done
Building dependency tree
Reading state information… Done
All packages are up to date.
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://apt.grafana.com stable InRelease: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY 963FA27710458545
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repos.influxdata.com/ubuntu focal InRelease: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY D8FF8E1F7DF8B07E
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://openhab.jfrog.io/artifactory/openhab-linuxpkg stable InRelease: The following signatures were invalid: EXPKEYSIG 075721F6A224060A openHAB Bintray Repositories owner@openhab.org
W: Failed to fetch https://apt.grafana.com/dists/stable/InRelease The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY 963FA27710458545
W: Failed to fetch https://repos.influxdata.com/ubuntu/dists/focal/InRelease The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY D8FF8E1F7DF8B07E
W: Failed to fetch https://openhab.jfrog.io/artifactory/openhab-linuxpkg/dists/stable/InRelease The following signatures were invalid: EXPKEYSIG 075721F6A224060A openHAB Bintray Repositories owner@openhab.org
W: Some index files failed to download. They have been ignored, or old ones used instead.