OH3.1/3.2 out of memory error

lampy · September 27, 2021, 7:08pm

Platform information:
- Hardware: Raspberry Pi 4 Model B Rev 1.1
- OS: openhabian [main]{2021-08-28T12:47:26+02:00}(a7f9438)
- Java Runtime Environment: OpenJDK Client VM version 11.0.12+7-LTS
- openHAB version: 3.1
- kernel: Linux 5.10.52-v7l+
Issue of the topic: Out of memory error and OH freezes

Dear all, I’ve had in the past some issues with out of memory errors, where I added the rule to create a heapdump on out of memory errors. Luckily the error was out for quite some time, but 2 weeks ago during my holiday the out of memory error occurred again.
I was able to restart OH remotely, so the system continued with it presence simulation again and is running for about a week again.

Unfortunately don’t have the logs anymore, though the heapdump was created. Tried to understand a bit from it by using MemoryAnalyser, though unfortunately I don’t have any clue where to start looking.
Barely have any rules active, since I do use NodeRed for my automation.
I don’t see any hints to a certain binding or so in the analysis.

Somebody able to help me a step further where to continue my investigation?
Some screenshots:

Below a screenshot of the event in the monitoring, it took me a few days before recognizing the system didn’t worked anymore.

matt1 · September 28, 2021, 12:02am

Java does not use your ram like other programs, it uses a small piece of your ram called the heap. What do you have your heap size set to? What is the output of ‘shell:info’ in the openhab console after a restart and also after it has been running for a few days?

wborn · September 28, 2021, 11:30am

Most likely your openHAB runs out of memory because the OSGiEventManager is no longer processing events. If events are no longer processed the events queue in the OSGiEventManager will grow endlessly. So running out of memory is merely a side effect of the thread (that would process the events) being dead or blocked.

Next time when the issue occurs, you can see what the thread is doing by creating a thread dump on the Console with: threads --locks --monitors

The thread that does the processing has OH-OSGiEventManager as name.

When it is idle it will show up like:

"OH-OSGiEventManager" Id=546 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7de01538
    at java.base@11.0.12/jdk.internal.misc.Unsafe.park(Native Method)
    at java.base@11.0.12/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
    at java.base@11.0.12/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
    at java.base@11.0.12/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:458)
    at org.openhab.core.internal.events.ThreadedEventHandler.lambda$0(ThreadedEventHandler.java:60)
    at org.openhab.core.internal.events.ThreadedEventHandler$$Lambda$1025/0x00000007c0bd6440.run(Unknown Source)
    at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)

That way you can see if the thread is still there and what it is doing instead of processing events.

matt1 · September 28, 2021, 12:45pm

I created a special build of the system info binding that will allow you to graph the heap size.

uninstall the built in binding.
unzip the jar files to your addons folder
delete and re-add the thing so it can see the new channels.
Add the channels and create items.
Use metadata on the item to set a state description, and for the pattern use %.1f MB for the channel called availableHeap so it is not in bytes.

The other channel is probably more useful as it displays how much you are using of the MAX heap size in %

Download the jar here:

Edit: binding was updated and new channel is in the latest openHAB stable builds.

lampy · September 28, 2021, 9:04pm

Thnx both for the reply’s!

This is the current shell:info, didn’t restarted the system yet.

Karaf
  Karaf version               4.3.2
  Karaf home                  /usr/share/openhab/runtime
  Karaf base                  /var/lib/openhab
  OSGi Framework              org.eclipse.osgi-3.16.200.v20210226-1447

JVM
  Java Virtual Machine        OpenJDK Client VM version 11.0.12+7-LTS
  Version                     11.0.12
  Vendor                      Azul Systems, Inc.
  Pid                         11358
  Uptime                      9 days 5 hours
  Process CPU time            1 day 8 hours
  Process CPU load            0.04
  System CPU load             0.09
  Open file descriptors       362
  Max file descriptors        102,642
  Total compile time          21.035 seconds
Threads
  Live threads                228
  Daemon threads              150
  Peak                        273
  Total started               580311
Memory
  Current heap size           131,421 kbytes
  Maximum heap size           890,880 kbytes
  Committed heap size         230,560 kbytes
  Pending objects             0
  Garbage collector           Name = 'Copy', Collections = 32360, Time = 12 minutes
  Garbage collector           Name = 'MarkSweepCompact', Collections = 11, Time = 4.467 seconds
Classes
  Current classes loaded      23,347
  Total classes loaded        28,773
  Total classes unloaded      5,426
Operating system
  Name                        Linux version 5.10.52-v7l+
  Architecture                arm
  Processors                  4

Ok, thanks! Will keep the system running and as soon as the system stalls again will follow these steps and share the result.

Thanks! Added both channels, current heap level is between 2-8% [25-75MB] in the first 15 minutes. Will let it run a couple of days.

matt1 · September 29, 2021, 12:20am

I have looked at dumps before and sadly they are limited in usefulness because everything that runs in openHAB shows up as the same process. You can not get a ‘X binding’ or its is a rule that is the cause. You can look to see how many byte vs strings and other variable types there are and how much data they hold, but a static look is not much help unless you compare to a system just booted and run for a few minutes. Knowing that the total amount of byte is growing can be useful to know, but if you dont know which binding to look at or make a report against it is of limited value…

Best to disable rules and bindings one by one until the issue stops so the cause is known. If you can see the issue in realtime, it helps speed up the process and downtime of looking.

What the heap should look like, note that each time the garbage collection is done, the heap returns to the same value.

What to look out for if there is a memory leak. Note how the leak means the heap is not cleaned and it is increasing over time. You may find that it only occurs when you use a certain feature that the heap does not return back to its normal average amount.

https://akshaymathur.space/wp-content/uploads/2019/11/MemoryLeakHeapUsage.png

This is a serious leak, and most likely you wont notice it growing until you look at the graphs over a few days, or it may only grow when a feature is used.
Hopefully it shows something.

wborn · September 29, 2021, 6:30am

It will also help to see what all other threads are doing. Then if the thread processing events is waiting for a lock hold by another thread, we can see what that other thread is doing and why it is not releasing the lock.

lampy · September 29, 2021, 5:08pm

OK thanks for the explanation. Thought there could be some indication (once Iohaf issues with the zwave binding caused by a dead node and it indicated zwave).
After 24h it seems normal, but just keep an eye on it. For sure the graphs will help connecting the issue to a certain event or occurance.

lampy · October 1, 2021, 4:17pm

The system stopped working today, so below I will post all the information guarded. For some reason the dashboard is still available, but no physical response to any command.

This two entries show up with the command dmesg:

[1195137.775403] cgroup: fork rejected by pids controller in /system.slice/openhab.service
[2920791.082324] cgroup: fork rejected by pids controller in /system.slice/openhab.service

Output of shell:info

openhab> shell:info
Karaf
  Karaf version               4.3.2
  Karaf home                  /usr/share/openhab/runtime
  Karaf base                  /var/lib/openhab
  OSGi Framework              org.eclipse.osgi-3.16.200.v20210226-1447

JVM
  Java Virtual Machine        OpenJDK Client VM version 11.0.12+7-LTS
  Version                     11.0.12
  Vendor                      Azul Systems, Inc.
  Pid                         11358
  Uptime                      12 days 1 hour
  Process CPU time            1 day 18 hours
  Process CPU load            0.04
  System CPU load             0.11
  Open file descriptors       385
  Max file descriptors        102,642
  Total compile time          23.876 seconds
Threads
  Live threads                207
  Daemon threads              145
  Peak                        4909
  Total started               2399084
Memory
  Current heap size           560,750 kbytes
  Maximum heap size           890,880 kbytes
  Committed heap size         890,880 kbytes
  Pending objects             0
  Garbage collector           Name = 'Copy', Collections = 43217, Time = 16 minutes
  Garbage collector           Name = 'MarkSweepCompact', Collections = 15, Time = 10.534 seconds
Classes
  Current classes loaded      23,605
  Total classes loaded        31,594
  Total classes unloaded      7,989

@wborn
The entire output of threads --locks --monitors the can be found putty.log (176.6 KB)
.
Looked for the OH-OSGiEventManager, but this is not in the log for some reason.

Checked also the events.log, no special things to see there. Just the regular updates of measured data as regular.

In openhab.log the following error is present:

2021-10-01 11:08:18.698 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception: 
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.lang.Thread.start0(Native Method) ~[?:?]
	at java.lang.Thread.start(Thread.java:798) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor.reExecutePeriodic(ScheduledThreadPoolExecutor.java:360) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:307) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
2021-10-01 11:08:35.836 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception: 
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.lang.Thread.start0(Native Method) ~[?:?]
	at java.lang.Thread.start(Thread.java:798) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor.reExecutePeriodic(ScheduledThreadPoolExecutor.java:360) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:307) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

lampy · October 1, 2021, 4:25pm

Also checked the status of the openhab.service:

openhabian@openhabian:/var/lib/openhab $ sudo systemctl status openhab
[sudo] password for openhabian:
● openhab.service - openHAB instance, reachable at http://openhabian:8080
   Loaded: loaded (/lib/systemd/system/openhab.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/openhab.service.d
           └─override.conf
   Active: active (running) since Sun 2021-09-19 16:48:37 CEST; 1 weeks 5 days ago
     Docs: https://www.openhab.org/docs/
           https://community.openhab.org
 Main PID: 11358 (java)
    Tasks: 206 (limit: 4915)
   CGroup: /system.slice/openhab.service
           └─11358 /usr/bin/java -XX:-UsePerfData -Dopenhab.home=/usr/share/openhab -Dopenhab.conf=/etc/openhab -Dopenhab.runtime=/usr/share/openhab/runtime -Dopenhab.userdata=/var/lib/open

Oct 01 11:17:49 openhabian karaf[11358]: java.lang.NullPointerException
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.lang.Thread.run(Thread.java:829)
Oct 01 11:17:49 openhabian karaf[11358]: java.lang.NullPointerException
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Oct 01 11:17:49 openhabian karaf[11358]:         at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
``

edit: wasn't at home during the time this happened. Besides regular monitoring OH shouldn't perform any activities.

lux · October 1, 2021, 5:30pm

try to change your OpenJDK to the recommended Zulu JVM!

https://www.azul.com/downloads/?version=java-11-lts&package=jdk#download-openjdk

the Zulu JVM has even more better Ram Garbage

wborn · October 1, 2021, 7:48pm

Yes it seems to be missing for some reason so events will pile up and OH will run out of memory. According to the code it should be able te recover from exceptions so it is strange the thread is no longer there.

Maybe if you enable trace logging you can see what the thread is doing and what the last event is which may cause this. E.g. enable this trace logging:

log:set trace org.openhab.core.internal.events.ThreadedEventHandler
log:set trace org.openhab.core.internal.events.EventHandler

matt1 · October 2, 2021, 1:18am

The heap graph may be showing your system has a gradual leak over the first half, need to watch over a greater time frame, but forget this for now and come back to it later… Then you notice the way you were using the system changed, maybe a new feature was used at that point, this is normal. Then the big sharp spike from something major failing and the garbage collection grabbed back a lot of memory. This will be the thread that processes the events getting cleaned up. Guessing the graph stopped working as it needs the events to be processed to work. When that occurred I would be expecting to see an ERROR in the logs, did nothing show up? Did the WARN about OutOfMemoryError fill the log and push all ERROR messages away? There can be a log rotation where when the logs reach a certain size they are saved to a separate file with a incrementing number added.

lampy · October 2, 2021, 12:40pm

Thnx, additional entries are created in the console. Will post again the result when available.

edit: Now I am getting like 100 entries per second in the log file, is this not creating other issues?

lampy · October 2, 2021, 12:47pm

Used the new binding version, let’s see how the full graph is developing now.

No other ERRORS in the log around the time the graph stopped. I have some regular errors from the modbus not able to read the SDM120 electricity meter. But this is ‘normal’ for some years now, I’ve seen more people here having issues with that.
Indeed needed to check the log rotation files already, since there where a couple of hours passing between the issue occuring and me being back at home.

I am using openhabian as setup, would like to keep with the recommended setup as they developed. Would assume they use the recommended java engine?

matt1 · October 2, 2021, 12:57pm

Azul = zulu
Your good, but yes it is confusing.

lux · October 2, 2021, 5:38pm

shame on me…

sorry, didn’t realize that you already using Zulu

lampy · October 3, 2021, 12:28pm

System crashed (it seems to happen more quick now then before).
Attached the latest two log files, didn’t filter anything so you are able to see every entry in the logs, though the rotation file was way to big to upload it, so it is zipped. After downloading change the extension back to .zip.
openhab.log (88.5 KB)
openhab-rotation-zip.txt (736.7 KB)

Around 10:15, when the issue occured, I arrived at home so actually everything happened at once. Radio started and the light scenes where activated, both triingering zwave and RfLink to work.

rossko57 · October 3, 2021, 1:44pm

Commentary;

For non-Modbus users, the apparently frequent modbus errors in the log are not especially alarming. It’s allowed for modbus devices to be too busy to immediately service requests, etc., resulting in a timeout at the openHAB end.
That’s why we have retries built in to the binding. Consistent “try 1/3 failed” means try 2 worked okay, so its just nuisance reporting and overhead.
It is sometimes possible to tune the binding parameters to reduce or eliminate the nuisance, that’s another topic really

Setting that aside, how extensive is your Modbus configuration? With default settings in place, Modbus binding is capable of pouring hundreds of updates per second into openHAB Items. This in turn can wallop persistence systems with default settings, and have serious performance costs.

wborn · October 3, 2021, 7:58pm

Yes it seems quite busy with events. Based on your logging it is handling ~43 events per second. It could be that it simply cannot keep up with this and eventually the queues fill up with work and it runs out of memory. Adding more system info events and increasing the log level also increases the load which may explain why the issue got worse.