OH3.1/3.2 out of memory error

lampy · October 19, 2021, 8:26pm

Here an update on the current situation.

Unfortunately there is still a slight slope in the used heap, but very small. You don’t see it in the nice graphs from @matt1 when only showing 2 days, but on 7 days it is clear.
Created several heap dumps and used the compare function between two snapshots, but no leak is detected.

Didn’t changed anything the last week, only started with creating a group to modify later the percistance strategy from default to store what is needed.

Next steps are:

change percistance
if no change, start with disabling bindings 1 by 1. But this will take some time, especially due to the very small slope.

matt1 · October 20, 2021, 2:06am

What are the number of running ‘threads’ doing, are they also climbing?

lampy · October 20, 2021, 8:37pm

Enabled this morning to monitor the number of threads from the Systeminfo binding, this is the result until now.
Seeing the very small slope inclining on heap perhaps the elapsed time is to short to say something about this?

lampy · October 22, 2021, 3:00pm

Below the graphs of 3 days ago. Guess one could slowly start seeing a slight incline in number of threads.
Coming weekend no time to work on openhab, so will keep the system run w/o making any changes this weekend.

lampy · November 4, 2021, 8:55pm

Small update from my side, unfortunately the issue isn’t solved yet. So I am following the advise of @matt1 to disable the bindings one by one and see the result.

Can exclude now:

custom binding for the Yamaha musiccast
modbus
solaredge
shelly
chromecast

To go:

zwave
RFlink
network
network UPS
Deconz
mqtt
systeminfo (hope it isn’t this one, hard to find the issue then )

crystollic · November 4, 2021, 9:15pm

I have being doing the same for the last two weeks think my issue was homekit.

lampy · November 4, 2021, 9:34pm

Actually it is not to bad to do. Of course nice is different, but with the great help here and the really helpful feature of showing the heap in a graph is it pretty quick to determine if there is a change.
With about 12h you can see already a bit the direction, with ~24h you know pretty sure if there is a change. And you see the small slope, I have more the 2 weeks before the free space is gone.

Edit: don’t have homekit, but thnx for the tip!

lampy · November 11, 2021, 3:18pm

All bindings have been checked except systeminfo. No change in the inclining slope.
Will try once removing all bindings after the weekend, but this is quite strange…
can it be the issue is connected to using the API connected to Nodered or something else?
I only have 1 rule forwarding the notification, can also try to disable that one.

Then I am out of idea’s, so if somebody had additional things to look at would be helpful.

lampy · November 16, 2021, 5:53pm

Again an update on the current status. Tried now to delete all bindings and see the result.
Below two screenshots with the outcome, from 16/11 onward the system was removed with all bindings and restarted. 14-16/11 was with all bindings on.
Conclusion, there is a slight difference but not a lot. Will activate the bindings again and continue the search in other directions. Maybe the persistence services or NodeRed is causing issues.

For background also the monitoring of the number of threads, this is quite stable.

crystollic · November 16, 2021, 6:11pm

Are you graphing item state update per hour?
I have a Resol binding that goes out of memory. The binding increases the updates to 150,000 per hour. I wonder whether persistence can keep up.

I also have a mem leak with Homekit but you don’t run that.

lampy · November 16, 2021, 9:21pm

Thnx for you reaction!
Use 2 ways of Persistence currently:

the build in strategy in RRD4j → by removing all bindings there where no updates of items anymore (disabled also all bridges, so there is no trying of things to poll for updates)
some dedicated items towards InfluxDB → reduced almost to no updates at all, only a few. Offloaded quite some work to the NAS which has the database anyway, so OH isn’t busy with this.

Now removed the RRD4j service, let’s see if this brings any reduction.
I am preparing also to move away from the default persistence configuration, unfortunately didn’t finished this work yet.

lampy · January 8, 2022, 9:44am

Small update from my side, also to help others. After having all bindings removed, persistence removed and things disabled I reduced the searching and increased the heap size. Didn’t had much time and was out of idea’s.

But, didn’t stopped thinking and when new idea’s popped up I tried them. Also after updates kept an eye on the heap level to see if it improved.

Now I’ve found something concrete, apparently I still had 1 .item file which contained a link to a non-existing thing. Didn’t caused any errors, but it did increase the heap growth!
You can see it in the graph below, improved it at #1.

But, then I found also the scene control in the marketplace and decided to install. Also installed the JS scripting extension when I wasn’t succeeding at first. This event is #2 in the graph.
After seeing the heap level slope increasing again, removed only the JS scripting (#3) (actually wasn’t necessary at all to have it installed) and it seems to stabilize a bit again.

Will wait for 1 or 2 days, if stable, I will install it again to see if it is reproduceable.

lampy · January 9, 2022, 9:31pm

After some more days letting the system run without further changes it seems stable.
Thanks all (special thanks for @matt1, @rossko57, @wborn, @Andrew_Rowe ) for the good assistance and the work on the new used heap channel of the Systeminfo binding to get a quick insight.

Summary:
List of changes made which might of might not helped, for sure some made the heap leak less steep:

Remove dead nodes from the z-wave system
reduced persistence level to reduce workload
reduced unused items/channels, especially of items that where updated every second to reduce workload
improved modbus by instructions of Rossko57 (link)

Finally, think most important change:

removed 1 ‘dead’ link in an old .items file. Had one .items file and completely forgot about this one.

Took also the opportunity to also remove one .rules file and move it to the UI, so I don’t make the same mistake again.

Saw the JS scripting was already noticed in other threads so will leave that discussion there.

mstormi · January 9, 2022, 9:37pm

Can you reproduce that problem ?
If so it probably is a bug. Please open a GitHub issue with openhab-core.

lampy · January 9, 2022, 9:41pm

Actually didn’t try to put the file back, was happy finally the growth of the memory stopped.
Will place the old file back from a backup and if the growth starts again I will open an issue.

matt1 · January 9, 2022, 10:09pm

Please do that as we all want any bugs to get fixed so that other people don’t go throu what u have just done. Please narrow down which item it is and post the line here or a link to the GitHub issue. Also from your last graph it can not be from 3.2 Stable, suggest you backup and then upgrade when you have time to fault find any issues.

lampy · January 9, 2022, 10:33pm

Will continue the search, no prob.have placed the file back, so should see it within a day or two.
According openhab self I am on 3.2 stable and I also have the new features from 3.2 (eg marketplace).
Why do you think I am not?
Didn’t updated the naming of the item when updating from your custom build jar though if that’s your reference now.

Edit: actually didn’t thought this would be a bug, just me making an error. (if this ia really reproduceable)

matt1 · January 10, 2022, 12:27am

It is a bug if Openhab can not handle the user error gracefully. Not acceptable for a program to run out of memory and crash so if you can narrow it down so others can reproduce the fault, then someone can take a look to fix it.

I am wrong then about u on old build as it must have been a major change in the core that caused memory to make sudden changes to the memory and not the heap resizing and showing in the graph. Ignore my comments on that.

lampy · January 15, 2022, 2:30pm

Ok, well, maybe the problem is not solved…
After placing the file back I did manage to get the heap increase, but after removing I it doesn’t stop anymore.
This might mean the root cause wasn’t the invalid channel, but there is something else. Perhaps I was to fast with concluding.

Still have a RPI3B doing nothing, will install this one with a fresh openhabian image and import my configuration towards there. This to excluded it is caused by any configuration issue/bug. Think I can exclude any of the bindings since I’ve removed them all 1 by 1, including all things disabled.
Maybe my system is damaged ‘somewhere’ which isn’t overwritten by updating or switching from milestone to stable.

matt1 · January 15, 2022, 9:44pm

That is possible as corruption can do anything. Make sure you have a UPS to prevent unplanned power resets when the SD card is in the middle of writing to the card. Also openhabian has zram features that will help, make sure they are setup.