OH4 runs out of memory

What you’re describing was exactly what I had with the remote openhab binding. I’m running the Shelly binding with about 45 active ones and about add 20 just on the network (test devices).

We need an efficient way for analyzing, everything else is just guess work.

1 Like

Remote openhab also uses its own http client. The “answer” is for people to do thread dumps while it’s throwing the errors and posting the log for analysis.

My 2 cents worth, take it for whatever. However I would think a heap dump and parsing it through MAT would yield a lot of insight as to what is holding the memory space. not enough memory to spawn a new thread is not a the same as thread pool exhaustion. Something is holding the memory addresses and not allowing GC to free up space in the heap thus you have a OOM or you have an object that is being loaded in a recursive cyclic loop condition would be the place I would start. I have not had any of the OOM folks are discussing and I am running 4.1M1 with Zwave 20+ items and OPENhab ZigBee (not zigbee2 mqtt) with 125 items as well as some of the other common bindings astro NTP and so on.

I do not use the Openhab binding :frowning:
What would you suggest for efficient analyzing? What can i do ?

If I was chasing the OOM you are asking about I would start Openhab in debug mode then when I see it climbing close to the max heap size acquire a heap dump and analyze it with a tool like eclipse MAT linked above. as @wborn mentioned. If you are not good a debugging perhaps if you get the heap dump and provide a location on github or some other file share then someone else may be willing to pull it down and comb through it. Otherwise best bet is start ripping bindings out till you no longer see the issue. My comment about the Openhab zigbee binding was a point of saying I have some heavy bindings in my 4.1M1 instance and am not seeing the issue. so, it is most likely not a core issue but a specific binding like has been pointed out previously.

Thanks for the pointer to Nest. I had the same issue here. Turns out I was still using Nest WWN API which was deprecated Sept 29th 2023. This caused OH’s memory to fill up with HTTP threads trying to reach the deprecated API, and eventually OH to get OOM.

Although ultimately of course I should switch to the new API, this is a rather harsh way to find out about it. The Nest binding could have handled WWN API connection errors in a more graceful way…

That timing corresponds with when I started experiencing the issue on 4.0.1, and it continued after U upgraded to 4.0.3.

I was using the WWN API for my Nest Protect Smoke Alarm but also using the newer SDM API for my Nest Doorbell. According to the Binding doco it looks like the Nest Protect only works with the older WWN API :frowning:

@justaoldman you are correct, this isn’t a thread pool exhaustion. At least in my case, something yet to be determined was causing the threads to be abandoned. They never closed so they were still taking up memory. As they werent stopped, garage collection didnt eat them either. Mine would go for 2-3 days and OOM.

An easy check while it’s broken is to run shell:threads from the karaf console and dump that to a log. We should be able to see what’s causing the issue from there.

If you’re feeling adventurous, @wborn just posted a patch to the current snapshot builds that may fix this. Can you try snapshot 3659 or newer to see if the OOM stops?

Just my experience.
even if is not advised, since I disable ZRAM, no problem about run out of memory appears since 2 years and several OH 3 and 4 upgrading.

I have not the proove that is direcly linked but I suspect is it.

Consider to have a test on these direction.

Current situation: all my bindings are enabled (Except Shelly) System works fine for 4-5 days without error. I decided to do a try and installed the Shelly binding but Disabled all my Shelly things. About 1 hour later i had only errors in the logs, karaf was no more accessible and openhab webpage also unaccesible and had to restart openhab.

Error in the logs was:

[WARN ] [ab.core.internal.events.EventHandler] - The queue for a subscriber of type ‘class org.openhab.core.internal.items.ItemUpdater’ exceeds 5000 elements. System may be unstable.

Thats why i highly suspect the Shelly binding.

@justaoldman thank you for the hint but i have no idea on how to start in debug mode and on how to acquire a heap dump but will google it :slight_smile:
I tried shell:threads which gives me a long list of things but how to get that in a log?
is:

dev:dump-create

the command i need? in there i find a file called threads that contain about 281 threads (When system works normally) Should i try this with Shelly binding running?
Thank you for your help

You can probably run the dump before the crash and see the problem. I doubt highly that it just magically happens. Give it 30-45 mins and you should be able to still get into karaf. Otherwise just log the ssh session and youre good.

To note, this was one of the errors I had when neeo freaked out on me.

Hi @cmachtel,
Sorry for slow reply that thing called work got in the way and I just saw your post.
Yes, that command will create something that is often referred to as a “support zip” or “support bundle”.
should be located in your userdata directory It will likely contain 6 files 5 txt files with various info such as what was loaded version and all of those threads you saw using the other command. as well as a “heapdump.hpof” file. That you can open with a tool like Eclipse MAT referenced previously that will do an analysis of what the object were using all your memory space and point out “likely suspect”
It should tell you pretty much everything that was happening at the time you triggered it.
Also, assuming you launched openhab using the start_debug script and If you want a second heap dump after the console gets sluggish/unresponsive you can also trigger one from Eclipse MAT by selecting the related Openhab java process and allow it to also acquire “capture” a heap dump as well.

1 Like

Hi @justaoldman and all the others who helped me.
I managed to get the .zip containing the dump and i directly see that the threads grow from 331 without Shelly binding to 2572 after 30mn of Shelly binding installed without Things… 5mn later the system broke down.

I will now try to get used with the Eclipse mat. but if anybody is interested in helping, here is my last dump file: Dropbox - 2023-10-05_114925.zip - Simplify your life

Thank you

1 Like

Please see if the newest snapshot (3659 or above) resolves this OOM issue. There was a patch pushed to fix a bug upstream.

@morph166955 You mean a snapshot of the Shelly binding?
How can i update that?
I tried downloading here: GitHub - openhab/openhab-addons: Add-ons for openHAB
but it gives me not a Jar for installing in OH

No, im referring to OH 4.1 SNAPSHOT.

Looking at your heap dump - it doesn’t seem to have memory issues, its fairly small (300 megs). From thread list however I can see a lot of WebSocketClient instances, way too much for a regular launch. Looking at references it seems to be related to the lgthinq binding. I can’t see anything wrong in its code for now. I could be blind or wrong (I prefer later!).

Edit: Second suspect place is shelly and its Shelly2RpcSocket. Looking at its code - there are multiple calls to close which is grateful shutdown of connection.
@cmachtel Heap dump contains 261 instances of websocket sessions. Do you have so many shelly devices?

Cheers,
Łukasz

Hi @splatch ,
Thank you for taking time for me.
I have around 60 Shelly devices and 4LG devices. The LGBinding works well and if Shelly is not installed there is no issue and the thread numbers are constant around 330.

I currently upgrading to the latest snapshoot (What i would have preferred not to do on a daily used system) but lets see…
@markus7017 sorry for disturbing but you seems to be maintainer of Shelly binding, could you help me on this?