OH3.1/3.2 out of memory error

Thnx Rossko.

Actually only 3 modbus SDM120 electricity readers. First had only one reader, then it showed the error only a couple of times per day. Now a bit more often, but only with that specific reader.
Endpoint configuration:

UID: modbus:tcp:endpointSDM120-mains
label: Modbus endpont SDM120 Mains
thingTypeUID: modbus:tcp
configuration:
  rtuEncoded: false
  timeBetweenTransactionsMillis: 60
  connectMaxTries: 1
  reconnectAfterMillis: 0
  port: 502
  timeBetweenReconnectMillis: 0
  host: 192.168.0.158
  connectTimeoutMillis: 10000
  id: 4
  enableDiscovery: false

Poller configuration:

UID: modbus:poller:endpointSDM120-mains:Mains_inputRegisters
label: Poller SDM120 Mains
thingTypeUID: modbus:poller
configuration:
  start: 0
  length: 80
  refresh: 1200
  maxTries: 3
  cacheMillis: 50
  type: input
bridgeUID: modbus:tcp:endpointSDM120-mains

Wouldn’t expect this amount of updates should be an issue?

Hmm, my setup isn’t even that big if I compare it with others here. Just 102 things, from which already 40 are the modbus.
The system runs on a RPI with 4gb RAM, which isn’t even close to being used. Would this then be more a limitation of openhab?
Can check to reduce the number of items which create all the events to confirm if the load reduces.

Below the graph of the development of the heap level and memory usage last couple of days. You see today the available heap level reduces, but around 2130 it was able to recover again the used heap. Though the used memory keeps growing.

If you have links for each thing channel with unused items it quickly adds up. It also depends on what is done with the events. If you store each and every item change with persistence, there is also a lot more work to do for each event. Some parts of the event handling are done single threaded. So it may seem like there is not much CPU load and some CPU cores will remain idle. But if the single threaded code cannot keep up, the queues of remaining work will grow.

OK clear thnx.
For now I’ve reduced the load by quickly disable about 30 channels, leaving the most important in.
Indeed by default also a lot of information is send to percistance. Don’t need all that information being kept. Will check later if I can remove the percistance from them.

No, not excessive at all.

I’d still recommend reviewing performance tweaks as in linked thread.
TCP tweaks can reduce overhead and may improve your nuisance error rate.
You’re configured to produce 40 openHAB event bus Item updates per second as it stands.

Knock-on impact on persistence services unknown - potentially a much bigger overhead.

The above shows the heap being 63% full in that moment in time.

From your recent graphs, did they start just after a restart of openHAB? They seem to show it does not go below 87% even just after a GC. Have you made the heap smaller or changed something?

A heap of 890,880 kbytes is tons for 99% of people, I dont know what your setup consists of but I would be expecting it to run more like 30%, but I do not know your setup details.

It would be interesting to uninstall each binding one by one and do a restart of openHAB, what causes the heap to make a dramatic drop in usage? Installing the binding again should make everything go back to normal.

Another point to make is that many channels are marked as advanced as the binding dev considered the channel to not be as useful and they wanted the user to not tick the show advanced button and then start linking 100% of channels to items. Not all bindings have had someone make this choice so if a binding has heaps of channels and none are marked as adv. it is not a good idea to add them all.

With the heap graphing, the two channels show the same thing but the % channel is easier to understand without a deeper understanding of what committed and max are. The available heap will show when the heap resizes, so not as easy to understand and is why it is advanced and it is a double up of the % channel. Recommend just going with the % channel which always compares to the max size the heap can be.

Lastly it does appear that the heap is steadily growing after 10- 12 hours has passed, you can see it growing on a slight diagonal. If you disable a binding look for that trend to disappear but since it looks to only start after 10+ hours it may take some time.

Hopefully it is just a heap of items that once you disable some, the system can handle it.

I believe I have seen this happen. If someone is using an IMAGE item and filling it with a picture from a ipcamera the state/contents of the image item passes through the event bus. A few cameras placing a single 1mb picture every second is enough to cause an issue. If you have any image items try stopping them for a while.

It would be interesting to know why:

  1. Your heap is staying so high right from a reboot/restart? can you narrow it down to X binding? This should be quick to find an answer to.
  2. What is causing the steady increase after 10+ hours in the heap.

If the same issue is causing both, then trying to find an answer to question 1 first would be my plan to attack this.

EDIT: Your system ram/memory looks fine. It is normal for it to keep going up as Linux keeps everything that goes into ram as cache and will only clean up when you actually need the ram. If you want to see a break down of this type in free -h in the linux terminal.

you have basically described the only real method of finding a memory leak in openHAB that I have ever seen work after a couple years of hanging out in this forum. Most folks don’t have the time or patients to do so but this method does work for those who really try.

look at this, depending on the DB service doing the ‘persisting’, there can be a big performance hit. Some are native and quick, others are dogs and unsuitable for a Pi.

this

Thnx Rossko, no clue if 40 event/seconds is a lot, but based on your and @wborn comments I expect it’s a lot. I’ve been able to reduce it to about max 20 items/second.
Checked your thread and indeed there are some good comments which I for sure can improve. On first look indeed the number of errors dropped heavily.

Persistence is by default now rrd4j and for many items. Will change this to manual configuration.
Just to be sure, there is no possibility to remove individual items from default OH3 persistence, is there? To modify the persistence, I need to create a manual configuration file and link the individual items?

No didn’t change any setting, indeed the crashing is strange while there is plenty of room available.

bindings:

  • chromecast
  • deconz
  • modbus
  • mqtt
  • network
  • networkupstools
  • rflink
  • serial
  • shelly
  • solaredge
  • systeminfo
  • yamahamusiccast
  • zwave

102 things which are now about 30-35 deactive, 364 items, 1 rule (node-red for automation)

Most of the times I am only adding the channels which bring added value. Nevertheless I think still a good review will reduce the channels used.

Removed the available heap.

Will start indeed first with the thread which Rossko linked and look to number of events to drop. Didn’t lately changed a lot on the system and there are not many persons complaining about out of memory errors, with a bit of luck we’re good again after the cleanup.

No camera’s or image files used in the system, though while I am writing I know the yamaha musiccast binding is sending some pictures with the current album. It’s just 103kb, but will disable it for now.

Using the new version of your binding, which is showing the available heap right? In that case the item description isn’t correct. If it is really showing the used heap, then I will start disabling the bindings one by one.

Thnx for confirming.

Use the default RRD4J and influxdb, last one writing towards a NAS.

Now I’ve implemented the following changes:

  • stopped the trace logging on threads
  • improved the MODBUS configuration with @rossko57 tips
  • cleaned the unused channels
    - now 321 items, 78 things
  • stopped using the image file (just ~100kb 2 items)
  • restarted openhab service

Now I will leave the system running for a day and see if the graphs already show an improvement.
Next step is disabling one binding per time, depending on the feedback of Matt if the channel is showing used or free heap.

1 Like

The channel shows how much is used in % of your max heap size. Hitting 100% is bad, but what matters is the value it hits after garbage collection is done. The minimum value. You have a mean calculation that changes it? I would graph the raw values and not put a mean filter on it so you can see the dips and the peaks.

Btw if you upgrade to the milestone 3 just released you can get rid of the jars and install system info as it’s now merged.

Yes. Creating a manual config will stop the default actions. There is no syntax for “everything except X”, so you have to specify everything you want.

A convenient way to manage this is using Groups - put Items you want to persist in a group like ‘gCharting’ and specify that in your xx.persist file.

So that would have been 60-70 odd persists a second? A chunk of them going off to a remote service over HTTP? Sounds like a bottleneck.

1 Like

First of all, thanks for all the support already. Indeed the system seems a bit overloaded. Since yesterday’s changes the used heap changed to a horizontal line. Will continue with some improvements.

There was/is indeed some grouping. Though zooming in doesn’t really change the min/max values.
Below a screenshot from OH graphing from the last 4h, if i compare this with the current shell:info, then it is doesn’t seem to match the graph values?

  Current heap size           117,349 kbytes
  Maximum heap size           890,880 kbytes
  Committed heap size         191,728 kbytes

Shouldn’t it be lower then currently is the case? In that case I will upgrade the system to 3.2M3 and start investigating the individual bindings as well.

Thanks, will follow this.

Correct. Would it be better/less resources to move the database to the OH server or use e.g. MQTT to share the values to the NAS? Moved it actually to the NAS to lower the load on the RPI, but if this is increasing the load my goal isn’t reached.

1 Like

I have no idea,not my field. You’ve traded local file i/o traffic for HTTP traffic. There’ll be competition for both, but which is a bottleneck? Relative speed depends on the mediums for each. Is ZRAM involved for instance?

Bug found and PR made. The graph will be accurate, just much higher up the axis then it should. Thanks for confirming. The sharp drop in the graph would have been the heap resizing bigger, it won’t do that in future.

I would not go uninstalling bindings now, wait to see if it is fixed.

it is looking much better now but to be sure you need to look over a week to see if you can see a tiny upward trend. If it stays horizontal and not heading up then your all good.

Will think about it, the internal i/o towards SSD via USB3 is always faster then HTTP traffic towards a HDD. Since there is a SSD no ZRAM is involved.

Will think about some changes in the setup. With a fresh installed heatpump I am doing some storage of measured data to understand if the heating is working properly. But this will cause a lot of additional items updating, creating additional load on the event bus. Part of the data is send with MQTT to openhab, perhaps better to use e.g. telegraf to put the data direct into InfluxDB and visualize it via grafana instead OH.

Nice!

Great. Just updated to 3.2M3 to get rid of the additional JAR files and will keep the system running to see if it the trend keeps being horizontal.

mm, but working remotely also shifts “the database” management processing load off the openHAB box 
 it’s all fumbling in the dark when no clear view of where your bottleneck is/was.
I’d err towards simplicity until a problem arises.

1 Like

@lampy How is it going now after 3 days?

System is running w/o any issues currently, so that’s nice!
Though looking to the last couple of day’s the heap used seems to increase slightly.
The first couple of hours seems horizontal, but then started to incline a bit.
Not 100% sure if it is connected, but made a couple of small changes which I reverted back just now. Let’s see if this was the cause.

With such a large heap you probably could run for 2 weeks before the OOME occurs at that rate, but since its heading upwards it will happen. Being able to see it and make changes to narrow it down should speed things up.

2 Likes

Update on the memory issue. Reverted back the changes made, though the system keeps inclining in heap unfortunately:

So, actually I still have somewhere an issue.
Thought to make a heap dump, since the issue seems not very big anymore.

Two possible leaks are detected:


Checking the first:

Checking the second:

So, z wave is on the radar. Currently one actuator (dimmer) who decided to stop reacting to commands and go offline, so I suspect that one to causing the issue. But it is quite some work to take if out of the wall, reset it and re-add it to the network.
Think later this evening I can make some time to do this, hopefully this is the solution (at least for the current heap increase).

Reviewed the z-wave node’s. Removed some unnecessary reporting from one busy node (every 60 seconds sending status updates) and re-added the dead node (after power failure it isn’t connecting anymore).

To say something about the slope is to early, but the amount of used heap did drop quite a lot.

Will leave the system running again for a day or 2 and share the results.