OH3.1/3.2 out of memory error

lampy · October 2, 2021, 12:40pm

Thnx, additional entries are created in the console. Will post again the result when available.

edit: Now I am getting like 100 entries per second in the log file, is this not creating other issues?

lampy · October 2, 2021, 12:47pm

Used the new binding version, let’s see how the full graph is developing now.

No other ERRORS in the log around the time the graph stopped. I have some regular errors from the modbus not able to read the SDM120 electricity meter. But this is ‘normal’ for some years now, I’ve seen more people here having issues with that.
Indeed needed to check the log rotation files already, since there where a couple of hours passing between the issue occuring and me being back at home.

I am using openhabian as setup, would like to keep with the recommended setup as they developed. Would assume they use the recommended java engine?

matt1 · October 2, 2021, 12:57pm

Azul = zulu
Your good, but yes it is confusing.

lux · October 2, 2021, 5:38pm

shame on me…

sorry, didn’t realize that you already using Zulu

lampy · October 3, 2021, 12:28pm

System crashed (it seems to happen more quick now then before).
Attached the latest two log files, didn’t filter anything so you are able to see every entry in the logs, though the rotation file was way to big to upload it, so it is zipped. After downloading change the extension back to .zip.
openhab.log (88.5 KB)
openhab-rotation-zip.txt (736.7 KB)

Around 10:15, when the issue occured, I arrived at home so actually everything happened at once. Radio started and the light scenes where activated, both triingering zwave and RfLink to work.

rossko57 · October 3, 2021, 1:44pm

Commentary;

For non-Modbus users, the apparently frequent modbus errors in the log are not especially alarming. It’s allowed for modbus devices to be too busy to immediately service requests, etc., resulting in a timeout at the openHAB end.
That’s why we have retries built in to the binding. Consistent “try 1/3 failed” means try 2 worked okay, so its just nuisance reporting and overhead.
It is sometimes possible to tune the binding parameters to reduce or eliminate the nuisance, that’s another topic really

Setting that aside, how extensive is your Modbus configuration? With default settings in place, Modbus binding is capable of pouring hundreds of updates per second into openHAB Items. This in turn can wallop persistence systems with default settings, and have serious performance costs.

wborn · October 3, 2021, 7:58pm

Yes it seems quite busy with events. Based on your logging it is handling ~43 events per second. It could be that it simply cannot keep up with this and eventually the queues fill up with work and it runs out of memory. Adding more system info events and increasing the log level also increases the load which may explain why the issue got worse.

lampy · October 3, 2021, 8:26pm

Thnx Rossko.

Actually only 3 modbus SDM120 electricity readers. First had only one reader, then it showed the error only a couple of times per day. Now a bit more often, but only with that specific reader.
Endpoint configuration:

UID: modbus:tcp:endpointSDM120-mains
label: Modbus endpont SDM120 Mains
thingTypeUID: modbus:tcp
configuration:
  rtuEncoded: false
  timeBetweenTransactionsMillis: 60
  connectMaxTries: 1
  reconnectAfterMillis: 0
  port: 502
  timeBetweenReconnectMillis: 0
  host: 192.168.0.158
  connectTimeoutMillis: 10000
  id: 4
  enableDiscovery: false

Poller configuration:

UID: modbus:poller:endpointSDM120-mains:Mains_inputRegisters
label: Poller SDM120 Mains
thingTypeUID: modbus:poller
configuration:
  start: 0
  length: 80
  refresh: 1200
  maxTries: 3
  cacheMillis: 50
  type: input
bridgeUID: modbus:tcp:endpointSDM120-mains

Wouldn’t expect this amount of updates should be an issue?

lampy · October 3, 2021, 8:38pm

Hmm, my setup isn’t even that big if I compare it with others here. Just 102 things, from which already 40 are the modbus.
The system runs on a RPI with 4gb RAM, which isn’t even close to being used. Would this then be more a limitation of openhab?
Can check to reduce the number of items which create all the events to confirm if the load reduces.

Below the graph of the development of the heap level and memory usage last couple of days. You see today the available heap level reduces, but around 2130 it was able to recover again the used heap. Though the used memory keeps growing.

wborn · October 3, 2021, 8:55pm

If you have links for each thing channel with unused items it quickly adds up. It also depends on what is done with the events. If you store each and every item change with persistence, there is also a lot more work to do for each event. Some parts of the event handling are done single threaded. So it may seem like there is not much CPU load and some CPU cores will remain idle. But if the single threaded code cannot keep up, the queues of remaining work will grow.

lampy · October 3, 2021, 9:11pm

OK clear thnx.
For now I’ve reduced the load by quickly disable about 30 channels, leaving the most important in.
Indeed by default also a lot of information is send to percistance. Don’t need all that information being kept. Will check later if I can remove the percistance from them.

rossko57 · October 3, 2021, 9:18pm

No, not excessive at all.

I’d still recommend reviewing performance tweaks as in linked thread.
TCP tweaks can reduce overhead and may improve your nuisance error rate.
You’re configured to produce 40 openHAB event bus Item updates per second as it stands.

Knock-on impact on persistence services unknown - potentially a much bigger overhead.

matt1 · October 3, 2021, 11:23pm

The above shows the heap being 63% full in that moment in time.

From your recent graphs, did they start just after a restart of openHAB? They seem to show it does not go below 87% even just after a GC. Have you made the heap smaller or changed something?

A heap of 890,880 kbytes is tons for 99% of people, I dont know what your setup consists of but I would be expecting it to run more like 30%, but I do not know your setup details.

It would be interesting to uninstall each binding one by one and do a restart of openHAB, what causes the heap to make a dramatic drop in usage? Installing the binding again should make everything go back to normal.

Another point to make is that many channels are marked as advanced as the binding dev considered the channel to not be as useful and they wanted the user to not tick the show advanced button and then start linking 100% of channels to items. Not all bindings have had someone make this choice so if a binding has heaps of channels and none are marked as adv. it is not a good idea to add them all.

With the heap graphing, the two channels show the same thing but the % channel is easier to understand without a deeper understanding of what committed and max are. The available heap will show when the heap resizes, so not as easy to understand and is why it is advanced and it is a double up of the % channel. Recommend just going with the % channel which always compares to the max size the heap can be.

Lastly it does appear that the heap is steadily growing after 10- 12 hours has passed, you can see it growing on a slight diagonal. If you disable a binding look for that trend to disappear but since it looks to only start after 10+ hours it may take some time.

Hopefully it is just a heap of items that once you disable some, the system can handle it.

I believe I have seen this happen. If someone is using an IMAGE item and filling it with a picture from a ipcamera the state/contents of the image item passes through the event bus. A few cameras placing a single 1mb picture every second is enough to cause an issue. If you have any image items try stopping them for a while.

It would be interesting to know why:

Your heap is staying so high right from a reboot/restart? can you narrow it down to X binding? This should be quick to find an answer to.
What is causing the steady increase after 10+ hours in the heap.

If the same issue is causing both, then trying to find an answer to question 1 first would be my plan to attack this.

EDIT: Your system ram/memory looks fine. It is normal for it to keep going up as Linux keeps everything that goes into ram as cache and will only clean up when you actually need the ram. If you want to see a break down of this type in free -h in the linux terminal.

Andrew_Rowe · October 4, 2021, 12:13am

you have basically described the only real method of finding a memory leak in openHAB that I have ever seen work after a couple years of hanging out in this forum. Most folks don’t have the time or patients to do so but this method does work for those who really try.

look at this, depending on the DB service doing the ‘persisting’, there can be a big performance hit. Some are native and quick, others are dogs and unsuitable for a Pi.

this

lampy · October 4, 2021, 7:53pm

Thnx Rossko, no clue if 40 event/seconds is a lot, but based on your and @wborn comments I expect it’s a lot. I’ve been able to reduce it to about max 20 items/second.
Checked your thread and indeed there are some good comments which I for sure can improve. On first look indeed the number of errors dropped heavily.

Persistence is by default now rrd4j and for many items. Will change this to manual configuration.
Just to be sure, there is no possibility to remove individual items from default OH3 persistence, is there? To modify the persistence, I need to create a manual configuration file and link the individual items?

No didn’t change any setting, indeed the crashing is strange while there is plenty of room available.

bindings:

chromecast
deconz
modbus
mqtt
network
networkupstools
rflink
serial
shelly
solaredge
systeminfo
yamahamusiccast
zwave

102 things which are now about 30-35 deactive, 364 items, 1 rule (node-red for automation)

Most of the times I am only adding the channels which bring added value. Nevertheless I think still a good review will reduce the channels used.

Removed the available heap.

Will start indeed first with the thread which Rossko linked and look to number of events to drop. Didn’t lately changed a lot on the system and there are not many persons complaining about out of memory errors, with a bit of luck we’re good again after the cleanup.

No camera’s or image files used in the system, though while I am writing I know the yamaha musiccast binding is sending some pictures with the current album. It’s just 103kb, but will disable it for now.

Using the new version of your binding, which is showing the available heap right? In that case the item description isn’t correct. If it is really showing the used heap, then I will start disabling the bindings one by one.

Thnx for confirming.

Use the default RRD4J and influxdb, last one writing towards a NAS.

Now I’ve implemented the following changes:

stopped the trace logging on threads
improved the MODBUS configuration with @rossko57 tips
cleaned the unused channels
- now 321 items, 78 things
stopped using the image file (just ~100kb 2 items)
restarted openhab service

Now I will leave the system running for a day and see if the graphs already show an improvement.
Next step is disabling one binding per time, depending on the feedback of Matt if the channel is showing used or free heap.

matt1 · October 4, 2021, 9:38pm

The channel shows how much is used in % of your max heap size. Hitting 100% is bad, but what matters is the value it hits after garbage collection is done. The minimum value. You have a mean calculation that changes it? I would graph the raw values and not put a mean filter on it so you can see the dips and the peaks.

Btw if you upgrade to the milestone 3 just released you can get rid of the jars and install system info as it’s now merged.

rossko57 · October 4, 2021, 10:28pm

Yes. Creating a manual config will stop the default actions. There is no syntax for “everything except X”, so you have to specify everything you want.

A convenient way to manage this is using Groups - put Items you want to persist in a group like ‘gCharting’ and specify that in your xx.persist file.

So that would have been 60-70 odd persists a second? A chunk of them going off to a remote service over HTTP? Sounds like a bottleneck.

lampy · October 5, 2021, 6:58pm

First of all, thanks for all the support already. Indeed the system seems a bit overloaded. Since yesterday’s changes the used heap changed to a horizontal line. Will continue with some improvements.

There was/is indeed some grouping. Though zooming in doesn’t really change the min/max values.
Below a screenshot from OH graphing from the last 4h, if i compare this with the current shell:info, then it is doesn’t seem to match the graph values?

  Current heap size           117,349 kbytes
  Maximum heap size           890,880 kbytes
  Committed heap size         191,728 kbytes

Shouldn’t it be lower then currently is the case? In that case I will upgrade the system to 3.2M3 and start investigating the individual bindings as well.

Thanks, will follow this.

Correct. Would it be better/less resources to move the database to the OH server or use e.g. MQTT to share the values to the NAS? Moved it actually to the NAS to lower the load on the RPI, but if this is increasing the load my goal isn’t reached.

rossko57 · October 5, 2021, 8:53pm

I have no idea,not my field. You’ve traded local file i/o traffic for HTTP traffic. There’ll be competition for both, but which is a bottleneck? Relative speed depends on the mediums for each. Is ZRAM involved for instance?

matt1 · October 6, 2021, 1:24am

Bug found and PR made. The graph will be accurate, just much higher up the axis then it should. Thanks for confirming. The sharp drop in the graph would have been the heap resizing bigger, it won’t do that in future.

I would not go uninstalling bindings now, wait to see if it is fixed.

it is looking much better now but to be sure you need to look over a week to see if you can see a tiny upward trend. If it stays horizontal and not heading up then your all good.