CPU pegged at 100%+ all the time, unable to nail down cause

  • Platform information:
    • Hardware: Rpi 3 B+
    • OS: Linux openHABianPi 4.19.66-v7+
    • Java Runtime Environment: openjdk version “11.0.11” 2021-04-20 LTS
      OpenJDK Runtime Environment Zulu11.48+21-CA (build 11.0.11+9-LTS)
      OpenJDK Client VM Zulu11.48+21-CA (build 11.0.11+9-LTS, mixed mode)
    • openHAB version: openHAB 3.1.0-1 (Release Build)

For weeks I’ve been trying to discover why my OH is so unstable. It started before upgrading to OH3, and got worse, if anything, after upgrading.

The pattern is that if I set up a task in my rules to run every 10 minutes, the task will run once and then stop. htop shows the CPU pegged at 100%+ all the time.

The log shows an ever-increasing number of errors, such as:

Fatal transport error: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2021-07-28 15:10:45.849 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = xxxxxxx, base URL = http://localhost:8080)
2021-07-28 15:11:00.556 [INFO ] [io.openhabcloud.internal.CloudClient] - Connected to the openHAB Cloud service (UUID = xxxxxxxxx, base URL = http://localhost:8080)
2021-07-28 15:11:11.203 [ERROR] [io.openhabcloud.internal.CloudClient] - Error connecting to the openHAB Cloud instance
A connection to https://myopenhab.org/ was leaked. Did you forget to close a response body?
Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@1be2a69' takes more than 5000ms.

and always ends with:

java.lang.OutOfMemoryError: Java heap space

All my rules are file-based Javascript rules. I’ve tried stripping down the rules to identify something specific, but I can never find the culprit. If I just leave the .js files that don’t contain triggers of any kind, i.e. just functions and classes used by the rules with triggers (that’s most of them), the CPU load stays at normal levels. If I add just one rule containing any kind of trigger (even if the trigger never fires), the CPU load seems to go up dramatically. Right now I’m doing tests with just two rule files containing triggers, none of which fire at any point (they’re triggers which would fire because I modify the value of an item). The system is close to or above 100% CPU usage without it actually doing anything.

I don’t know what else to do to identify the problem.

I’m at the end of my tether here. Zipato is collapsing and I can’t even get openHab to run the ventilation reliably.

My first instinct with things like this is, “sounds like a corrupt SD card”. But that’s a knee-jerk response.

I don’t know anything about JavaScript rules, but I’d suggest disabling all of them and creating a single rule through the UI to see if the problem crops up there, too.

I’m pretty sure it’s not the SD card, because there would be other symptoms, and I’ve switched to SD backups more than once.

I could try creating UI rules, but since I have a ton of .js files loaded by all the rules containing triggers, it doesn’t really reproduce the case.

Maybe I could create a trigger in the UI which invokes the other .js files?

Definitely not an SD problem then.

You can add scripts to UI rules, but again, I don’t know much about that. I assume that you can call other scripts. Your previous post suggested to me that the problem was with triggers, so I was really just thinking to make a simple if-this-then-that-rule (no JS) to identify if that’s the case. It’s a quick way to at least confirm that the issue is related to JavaScript.

What does this all mean? You have two Java’s installed? Which one is openHAB using?

I thought openJDK and Zulu where different, hence the questions!

First a bit about htop (or process monitors on Linux in general). That screenshot is showing that one of your CPU cores is indeed pegged at 100%. But your three other CPU cores on that machine are doing nothing (well, htop is using a tiny bit on core 3). So that gives us some information. Whatever is causing OH to run amuck is also causing it to get stuck and not utilize those other cores too. OH is multithreaded and can theoretically do four different things at the same time (unless configured to be pegged to just one CPU).

That’s your real error. Whatever it is that is running amuck is also generating a memory leak. So it’s both blocking other parts of OH from running which is likely to cause timeout errors (like the one shown) and leading up to the ultimate out of memory error is likely to generate other errors as well.

So in general the errors themselves in the logs are not all that useful because you can tell what’s a cause and what’s a symptom.

I’ve seen some reports that the Language Server is causing some users some problems in OH 3.1/3.2. It’s very unusual for an openHAB rule in any language to cause a problem like this. There are constraints in rules that make it such that you almost have to really try to cause a problem like this by creating infinite loops that create data structures that just continue to grow until all the memory is used up.

Your best bet is to look at all the many “OutOfMemory” threads on the forum and see some of the things they have done to diagnose which part of OH the problem is coming from.

That’s the standard output when you run java --version for Zulu.

2 Likes

I had the exact same problem - only that all my four cores were pegged at ~100% in htop (and RPi-Monitor that showed core temperatures very close to 80℃) and the system was more or less impossible to use.
I tried to analyze what the different threads and bundles were doing but could not figure out what was going on. I finally realized I had done an upgrade of my system some days ago so I took a chance and upgraded to the installation to the (not very) bleeding edge version of openHABian. That was the cure - not that I know why, but it worked and I saw my coming dentist bills and scull fractures reduced to normal levels :slight_smile:

sudo openhabian-config
01 Select Branch
( ) main (The openHABian version to contain the very latest code for openHAB 3 is called “main”.)

After updating and rebooting I once again had a fully functional and very responsive system, loafing along at some 5-10% CPU load and a reasonable use of memory.

Note that the login header still says:
|_| 3.1.0 - Release Build

Hope this helps.

2 Likes

I disabled the LSP connection to Visual Studio Code (in the VSC config) after seeing reports of it causing problems in the forum, but I still see the Language Server starting up in the log:

Started Language Server Protocol (LSP) service on port 5007

Is there something else I should try disabling?

It does seem that it’s something to do with my rules, because if I take away the trigger rules the CPU calms right down. Weirder still, the rules are not doing anything at all.

In could try creating a test trigger which loads a small subset of .js files and see if the problem occurs. If it doesn’t, I’ll add more until it breaks.

I’ve also seen reports here of problems due to the MQTT Broker and Hue Emulation, so I removed those (which I wasn’t using). No effect though. I also managed to get rid of some old things that were stuck and I was unable to uninstall in 2.5. Again, no effect.

I’ll give it a go, thanks!

I take it that 3.2 is not a stable release yet?

Not sure about 3.2 but I installed 3.1.0 main branch:

openhab-cli info => Version: 3.1.0 (Build)

Right, major progress, I think.

After testing for hours I’ve concluded that the CPU load is directly proportional to the number of .js files in the “tree” loaded by the .js file containing the trigger - in other words, the .js file with the trigger loads some .js files, which load others, which load others. Even if the trigger rule does nothing (all its content is commented out) and even if the trigger never fires, something goes haywire and pegs the CPU at 100%.

So I wondered: what if I just make one huge rule file? I already have a PowerShell script doing various operations on my rule files before deploying them on the Raspberry Pi, so I added another: all the “load” commands pertaining to my own rule files are stripped out, and then all the .js files are concatenated into one huge file.

Result: everything is working. The rules load in seconds, all the timer-based rules are chugging along happily, the HTTP requests work, and the CPU stays at about 4-12%.

So, I’m going to say this is solved, but there appears to be something amiss with the handling of multiple .js files. I don’t know whose department that is, but they might want to take a look at it. I can supply my files if it helps to find the problem.

1 Like

Incidentally, this problem seems markedly worse on OH3. The same thing happened on OH2.5, but it seemed to take longer to reach complete collapse. I can’t be very specific though, since I’m not in a position to test it any more.

I should also say that there was a time on OH2.5 when it was all working quite happily. I’m not sure what changed (I may have upgraded to a newer version, for example).

  • Log into the karaf console
  • bundle:list | grep lsp

You should get something like

openhab> bundle:list | grep lsp
136 │ Active │  80 │ 0.10.0.v20201105-1103 │ org.eclipse.lsp4j
137 │ Active │  80 │ 0.10.0.v20201105-1103 │ org.eclipse.lsp4j.jsonrpc
  • Execute bundle:stop <bundle number> where <bundle number> is the value in the first column of results from the previous command for both bundles. Listing them again should show they are no longer active.
136 │ Resolved │  80 │ 0.10.0.v20201105-1103 │ org.eclipse.lsp4j
137 │ Resolved │  80 │ 0.10.0.v20201105-1103 │ org.eclipse.lsp4j.jsonrpc

I’m not certain if this is permanent or not but you could restart OH, disable them first thing and see if that addresses the problem. If not then it’s something else. If so we can explore how to disable it permanently.

3.2 is milestone releases. OH releases work as follows:

  • Release: every six months a new “stable” release is created. When there are no major breaking changes the release is treated as a point release. The current release is 3.1.

  • Milestone Releases/AKA Testing Releases: every month a new milestone release is created. These have no known major problems but problems sometimes come through. The current milestone release is 3.2 M1.

  • Snapshots: this is the bleeding edge and a new snapshot is created with the latest code merged into the baseline every day (unless the build breaks).

However, just because a baseline is called a release or a milestone doesn’t mean it’s completely free of bugs. Stable just means it’s not changing much for six months. Also, despite being on the bleeding edge, the snapshot releases are remarkably free from bugs almost all the time. Many of us run on the snapshots in our production system with only very rare problems. I personally run either a milestone version when my time is short or the snapshots when I have more time, upgrading the snapshot version a couple of times a week. Very rarely do I encounter any problems and because each upgrade include fewer changes over all, I end up spending less time dealing with breaking changes than I would had I stuck to the release versions and had to deal with six months of changes all at once.

So if you are encountering a problem with a milestone or release version and a fix exists in a snapshot I do not hesitate to recommend installing the snapshot version with the fix and stop upgrades of OH at least until the next milestone or release that includes the fix too.

File an issue on the openhab-core repo. See How to file an Issue for details.

Do you perhaps have an inadvertant circular reference somewhere in your load statements? I doubt Nashorn is smart enough to prevent that sort of thing. So if a.js loads b.js which loads c.js which loads a.js you’d end up in an infinite loop of loading.

Pay attention to both your code and the Helper Libraries as well.

1 Like

Thanks for all the info.

It’s possible, although I did go over it pretty carefully to make sure I didn’t fall into that trap (it’s happened to me before). I’ll have another look.

The reason I suspect that it’s not the case is that, as I moved down the tree of includes in my test file, the CPU load went down (and the time taken to load the test file went down too). If it were a circular reference then you’d see the CPU load snap from normal to maxed out as you added the offending load statement.

I just checked the dependencies, in particular one load statement that, when I removed it, made the CPU load go back to normal (it points to a lot of other files). There aren’t any circular references in there. I organised the files in layers to make it pretty clear where each ones falls in the hierarchy.

The other reason to suspect a problem with the handling of the files is that, if I just leave out the five .js files at the top of the tree (the ones containing the triggers), the CPU behaves normally. The files load with almost the entire dependency tree with no problem. If there were a circular dependency, this would not be the case. The top-level files can’t introduce a circular dependency because there’s nothing above them.

OK I’ll have a look at it.

My system has been running happily for four and a half hours :grinning:.

1 Like

Did you create an issue on the openhab-core repo? If you do, then post a link here so others can find it and also if there is an issue that needs looking at it wont start to get looked at until an issue is created.

I tried to create a minimum repro case to avoid posting all my rules, but it didn’t fail in the same way. So I have to have another go at creating one.