The log shows an ever-increasing number of errors, such as:
Fatal transport error: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2021-07-28 15:10:45.849 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = xxxxxxx, base URL = http://localhost:8080)
2021-07-28 15:11:00.556 [INFO ] [io.openhabcloud.internal.CloudClient] - Connected to the openHAB Cloud service (UUID = xxxxxxxxx, base URL = http://localhost:8080)
2021-07-28 15:11:11.203 [ERROR] [io.openhabcloud.internal.CloudClient] - Error connecting to the openHAB Cloud instance
A connection to https://myopenhab.org/ was leaked. Did you forget to close a response body?
Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@1be2a69' takes more than 5000ms.
and always ends with:
java.lang.OutOfMemoryError: Java heap space
I don’t know what else to do to identify the problem.
First a bit about htop (or process monitors on Linux in general). That screenshot is showing that one of your CPU cores is indeed pegged at 100%. But your three other CPU cores on that machine are doing nothing (well, htop is using a tiny bit on core 3). So that gives us some information. Whatever is causing OH to run amuck is also causing it to get stuck and not utilize those other cores too. OH is multithreaded and can theoretically do four different things at the same time (unless configured to be pegged to just one CPU).
That’s your real error. Whatever it is that is running amuck is also generating a memory leak. So it’s both blocking other parts of OH from running which is likely to cause timeout errors (like the one shown) and leading up to the ultimate out of memory error is likely to generate other errors as well.
So in general the errors themselves in the logs are not all that useful because you can tell what’s a cause and what’s a symptom.
I’ve seen some reports that the Language Server is causing some users some problems in OH 3.1/3.2. It’s very unusual for an openHAB rule in any language to cause a problem like this. There are constraints in rules that make it such that you almost have to really try to cause a problem like this by creating infinite loops that create data structures that just continue to grow until all the memory is used up.
Your best bet is to look at all the many “OutOfMemory” threads on the forum and see some of the things they have done to diagnose which part of OH the problem is coming from.
That’s the standard output when you run java --version for Zulu.
I had the exact same problem - only that all my four cores were pegged at ~100% in htop (and RPi-Monitor that showed core temperatures very close to 80℃) and the system was more or less impossible to use.
I tried to analyze what the different threads and bundles were doing but could not figure out what was going on. I finally realized I had done an upgrade of my system some days ago so I took a chance and upgraded to the installation to the (not very) bleeding edge version of openHABian. That was the cure - not that I know why, but it worked and I saw my coming dentist bills and scull fractures reduced to normal levels
01 Select Branch
( ) main (The openHABian version to contain the very latest code for openHAB 3 is called “main”.)
After updating and rebooting I once again had a fully functional and very responsive system, loafing along at some 5-10% CPU load and a reasonable use of memory.
Note that the login header still says:
|_| 3.1.0 - Release Build
I disabled the LSP connection to Visual Studio Code (in the VSC config) after seeing reports of it causing problems in the forum, but I still see the Language Server starting up in the log:
Started Language Server Protocol (LSP) service on port 5007
Is there something else I should try disabling?
It does seem that it’s something to do with my rules, because if I take away the trigger rules the CPU calms right down. Weirder still, the rules are not doing anything at all.
In could try creating a test trigger which loads a small subset of .js files and see if the problem occurs. If it doesn’t, I’ll add more until it breaks.
I’ve also seen reports here of problems due to the MQTT Broker and Hue Emulation, so I removed those (which I wasn’t using). No effect though. I also managed to get rid of some old things that were stuck and I was unable to uninstall in 2.5. Again, no effect.
After testing for hours I’ve concluded that the CPU load is directly proportional to the number of .js files in the “tree” loaded by the .js file containing the trigger - in other words, the .js file with the trigger loads some .js files, which load others, which load others. Even if the trigger rule does nothing (all its content is commented out) and even if the trigger never fires, something goes haywire and pegs the CPU at 100%.
So I wondered: what if I just make one huge rule file? I already have a PowerShell script doing various operations on my rule files before deploying them on the Raspberry Pi, so I added another: all the “load” commands pertaining to my own rule files are stripped out, and then all the .js files are concatenated into one huge file.
Result: everything is working. The rules load in seconds, all the timer-based rules are chugging along happily, the HTTP requests work, and the CPU stays at about 4-12%.
So, I’m going to say this is solved, but there appears to be something amiss with the handling of multiple .js files. I don’t know whose department that is, but they might want to take a look at it. I can supply my files if it helps to find the problem.
Incidentally, this problem seems markedly worse on OH3. The same thing happened on OH2.5, but it seemed to take longer to reach complete collapse. I can’t be very specific though, since I’m not in a position to test it any more.
Execute bundle:stop <bundle number> where <bundle number> is the value in the first column of results from the previous command for both bundles. Listing them again should show they are no longer active.
I’m not certain if this is permanent or not but you could restart OH, disable them first thing and see if that addresses the problem. If not then it’s something else. If so we can explore how to disable it permanently.
3.2 is milestone releases. OH releases work as follows:
Release: every six months a new “stable” release is created. When there are no major breaking changes the release is treated as a point release. The current release is 3.1.
Milestone Releases/AKA Testing Releases: every month a new milestone release is created. These have no known major problems but problems sometimes come through. The current milestone release is 3.2 M1.
Snapshots: this is the bleeding edge and a new snapshot is created with the latest code merged into the baseline every day (unless the build breaks).
However, just because a baseline is called a release or a milestone doesn’t mean it’s completely free of bugs. Stable just means it’s not changing much for six months. Also, despite being on the bleeding edge, the snapshot releases are remarkably free from bugs almost all the time. Many of us run on the snapshots in our production system with only very rare problems. I personally run either a milestone version when my time is short or the snapshots when I have more time, upgrading the snapshot version a couple of times a week. Very rarely do I encounter any problems and because each upgrade include fewer changes over all, I end up spending less time dealing with breaking changes than I would had I stuck to the release versions and had to deal with six months of changes all at once.
So if you are encountering a problem with a milestone or release version and a fix exists in a snapshot I do not hesitate to recommend installing the snapshot version with the fix and stop upgrades of OH at least until the next milestone or release that includes the fix too.
Do you perhaps have an inadvertant circular reference somewhere in your load statements? I doubt Nashorn is smart enough to prevent that sort of thing. So if a.js loads b.js which loads c.js which loads a.js you’d end up in an infinite loop of loading.
Pay attention to both your code and the Helper Libraries as well.
The reason I suspect that it’s not the case is that, as I moved down the tree of includes in my test file, the CPU load went down (and the time taken to load the test file went down too). If it were a circular reference then you’d see the CPU load snap from normal to maxed out as you added the offending load statement.
I just checked the dependencies, in particular one load statement that, when I removed it, made the CPU load go back to normal (it points to a lot of other files). There aren’t any circular references in there. I organised the files in layers to make it pretty clear where each ones falls in the hierarchy.
The other reason to suspect a problem with the handling of the files is that, if I just leave out the five .js files at the top of the tree (the ones containing the triggers), the CPU behaves normally. The files load with almost the entire dependency tree with no problem. If there were a circular dependency, this would not be the case. The top-level files can’t introduce a circular dependency because there’s nothing above them.
OK I’ll have a look at it.
My system has been running happily for four and a half hours .
Did you create an issue on the openhab-core repo? If you do, then post a link here so others can find it and also if there is an issue that needs looking at it wont start to get looked at until an issue is created.