openHAB 3 runs out of memory / java heap space errors, CPU 100%+ after a few hours

Can someone run a stack trace of the system in a broken state and upload please? Im curious as to what specific threads are causing the issue. There were some major changes to the rules engine between 2.5 and 3.0 (and in the more recent snapshots)

you may want to add instructions or a link to something that explain how to do so because some of these folks who are experiencing this issue may not know how to do so
Thanks

Valid point!

jstack -l PID > my-dump.txt

PID is the Java process ID

1 Like

this is for Linux Iā€™m guessing?

Yes. Several above mentioned raspberry pi so this should work workout issue on there.

1 Like

Iā€™ll do this as soon at is happens again.

In my case though, the system can run stable for days, even weeks, before it suddenly gets into this. Would be very interesting to know what triggers itā€¦

Hereā€™s the stack trace of one process that run during the freeze:
my-dump301.txt (146.0 KB)

1 Like

I had the same problem in the past, even with Snapshot 3.1 (<#2165).
I upgrade to #2165 a week ago. Before the upgrade I changed this:

File: /etc/default/openhab
EXTRA_JAVA_OPTS="-Xms1024m -Xmx2048m"

I am running on a PI 4+ with 8GB and still had the problemā€¦I upgraded a week ago, and did not do any more upgrades, just let the system run, and do normal setup work in rules/widgets/items, so system was not in ā€œpause modeā€.
And it run for a week now w/o any problems. Current Status:
image

Maybe it works for others as well (it is from an older entry here in the forum)

Hello,
I have similar problems but I donā€™t see anything strange in openhab Logs.
The behavious is that after some hours or days (randomly) openhab stop works, the free RAM memory decrease under 30% (at startup is more than 55% free) and SSH is not avaible anymore
Raspberry need to be hardware restarted.

Here I copy my sistem configuration, I can perform some test to better understand if the problem is linked to Java or textual .rules files ?

runtimeInfo:
  version: 3.0.1
  buildString: Release Build
locale: it_IT
systemInfo:
  configFolder: /etc/openhab
  userdataFolder: /var/lib/openhab
  logFolder: /var/log/openhab
  javaVersion: 11.0.9
  javaVendor: Azul Systems, Inc.
  javaVendorVersion: Zulu11.43+88-CA
  osName: Linux
  osVersion: 4.9.35-v7+
  osArchitecture: arm
  availableProcessors: 4
  freeMemory: 40478264
  totalMemory: 153206784

Iā€™m not sure if this is related or not, but the RRD4j service seems to be using an enormous amount of CPU. I compared to my OH3 (which is running RRD4j) and I donā€™t see any of these threads being under load. Iā€™ve also tuned my RRD4j down to only provide persistence for a small number of items instead of everything as the default currently has. As a test, can you disable/remove the RRD4j service and see if this problem calms down?

"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-1]" #250 daemon prio=5 os_prio=0 cpu=26610.80ms elapsed=141491.92s tid=0xe109b400 nid=0x265 waiting on condition  [0xd7ebd000]
"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-2]" #251 daemon prio=5 os_prio=0 cpu=28106.98ms elapsed=141491.89s tid=0xe35af400 nid=0x266 waiting on condition  [0xd7e6d000]
"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-3]" #252 daemon prio=5 os_prio=0 cpu=27348.65ms elapsed=141491.88s tid=0xe229c800 nid=0x267 waiting on condition  [0xd7e1d000]
"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-4]" #253 daemon prio=5 os_prio=0 cpu=26975.74ms elapsed=141491.88s tid=0xe229e400 nid=0x268 waiting on condition  [0xd7dcd000]
"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-5]" #254 daemon prio=5 os_prio=0 cpu=27351.98ms elapsed=141491.87s tid=0xe35bf000 nid=0x26a waiting on condition  [0xd7d7d000]
"RRD4J Sync-ThreadPool for org.rrd4j.core.RrdSyncThreadPool@1357647 [Thread-6]" #255 daemon prio=5 os_prio=0 cpu=27822.25ms elapsed=141491.86s tid=0xe0e2e800 nid=0x26b waiting on condition  [0xd7d2d000]
"OH-RRD4j-1" #259 prio=5 os_prio=0 cpu=90183.52ms elapsed=141490.48s tid=0x002c8800 nid=0x26f waiting on condition  [0xd7c24000]
"OH-RRD4j-2" #261 prio=5 os_prio=0 cpu=90188.39ms elapsed=141490.42s tid=0xda74a400 nid=0x272 waiting on condition  [0xd99be000]
"OH-RRD4j-3" #263 prio=5 os_prio=0 cpu=90222.19ms elapsed=141488.81s tid=0xe45be000 nid=0x275 waiting on condition  [0xd7bd4000]
1 Like

Just for what itā€™s worth: I am not (and have never been) using RRD4j, Iā€™m using jdbc persistence.

Also for me currently java is still using less than 1% cpu two days after last restart of OH. So for me this is something that obviously happens very suddenly when it happens, not something that builds up over time.

Now Iā€™m experiencing heap space exception tooā€¦it seems related to the scheduler:

2021-02-04 05:04:37.680 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.io.monitor.internal.EventLogger@166d93e' takes more than 5000ms.
2021-02-04 05:05:04.064 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
`java.lang.OutOfMemoryError: Java heap space`

My openhab is stable version 3.0.1-2 on raspberry 4 -_-

RRD4j is on by default on OH3. It must be disabled if you donā€™t want it.

Iā€™m not entirely sure exactly what happens during an upgrade, but at least in my case the jdbc addon that was installed in 2.5 was also installed after upgrading to 3.0 and no other persistence addons has ever been installed.

This in not necessarily the case. This is just where the memory happened to be out.
Java will spend more and more time garbage collecting when it runs out of memory which explains the high load on the system.

I had the same problems, even with OH 2.5 and this was the reason I migrated to Groovy rules and started to develop a metric addon.
Using metrics I could see that the memory would build up over a couple of days.
With RPi 4 (at least the 4GB+ versions) the problem isnā€™t that present any more, but with smaller memory I had this a lot.

Different reasons can lead to the same symptoms here so we should not mingle everything. I think removing all DSL rules and see if this still occurs is the first important step.

1 Like

In openHAB 3 the default strategy is to store the state of all items in a rrd4j database. As this is done under a scheduled process it can explain the high number of scheduler processes. I donā€™t know if this causes the out-of-memory problem.
For more on the default persistence strategy see: rrd4j - Persistence Services | openHAB

I think you are on to something. I have test cluster with one rpi3 that is not running zram. It runs out of memory and I can see a lot of rrd4j activity. I bet disabling rrd4j will make things a lot better

Same here. I was running 3.0 with file based rules without problems. Now with 3.0.1 stable I am running in Java Heap space exception every 1 or two days. My first suspicion was the MODBUS Binding, after I added it and got some pollers on my Pluggit Ventilation, the problem occured. But now I disabled all ModBus-Things, and it still happens.

From openhab.log

2021-02-04 17:15:48.748 [INFO ] [era.internal.handler.IpCameraHandler] - The alarm stream was not running for camera 192.168.75.32, re-starting it now
2021-02-04 17:16:05.475 [INFO ] [era.internal.handler.IpCameraHandler] - The alarm stream was not running for camera 192.168.75.32, re-starting it now
2021-02-04 17:16:39.171 [INFO ] [era.internal.handler.IpCameraHandler] - The alarm stream was not running for camera 192.168.75.32, re-starting it now
2021-02-04 17:17:57.072 [INFO ] [era.internal.handler.IpCameraHandler] - The alarm stream was not running for camera 192.168.75.32, re-starting it now
2021-02-04 17:18:32.636 [INFO ] [era.internal.handler.IpCameraHandler] - The alarm stream was not running for camera 192.168.75.32, re-starting it now
2021-02-04 17:20:49.560 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.io.monitor.internal.EventLogger@85831e' takes more than 5000ms.
2021-02-04 17:21:14.167 [WARN ] [com.zaxxer.hikari.pool.HikariPool   ] - 1m51s816ms328Ī¼s101ns - Thread starvation or clock leap detected (housekeeper delta=yank-default).
2021-02-04 17:23:28.837 [WARN ] [hannel.AbstractChannelHandlerContext] - An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
2021-02-04 17:23:29.785 [WARN ] [netty.util.concurrent.DefaultPromise] - An exception was thrown by org.openhab.binding.ipcamera.internal.handler.IpCameraHandler$2.operationComplete()
java.lang.OutOfMemoryError: Java heap space
2021-02-04 17:23:29.809 [WARN ] [org.eclipse.jetty.io.ManagedSelector] - java.lang.OutOfMemoryError: Java heap space
2021-02-04 17:23:14.074 [ERROR] [se.californium.elements.UDPConnector] - Exception in network stage thread [UDP-Receiver-0.0.0.0/0.0.0.0:5683[0]]:
java.lang.OutOfMemoryError: Java heap space
2021-02-04 17:22:45.863 [ERROR] [org.apache.felix.fileinstall        ] - In main loop, we have serious trouble
java.lang.OutOfMemoryError: Java heap space
2021-02-04 17:23:30.836 [ERROR] [io.openhabcloud.internal.CloudClient] - Error connecting to the openHAB Cloud instance
2021-02-04 17:23:32.098 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = 710d4355-00d2-412a-a507-adca5dec9cce, base URL = http://localhost:8080)
2021-02-04 17:23:32.112 [INFO ] [y.internal.handler.ShellyBaseHandler] - shellyswitch25-10a5fa: Thing goes OFFLINE: message.offline.status-error-watchdog
2021-02-04 17:23:33.210 [INFO ] [y.internal.handler.ShellyBaseHandler] - shellyswitch25-f3fb9b: Thing goes OFFLINE: message.offline.status-error-watchdog
2021-02-04 17:23:44.767 [INFO ] [.reconnect.PeriodicReconnectStrategy] - Try to restore connection to '127.0.0.1'. Next attempt in 60000ms
2021-02-04 17:23:47.925 [INFO ] [.transport.mqtt.MqttBrokerConnection] - Starting MQTT broker connection to '127.0.0.1' with clientid 2e54a7bd-7dc3-44f0-9075-1fc7940ef0d5
2021-02-04 17:23:59.092 [WARN ] [era.internal.handler.IpCameraHandler] - !!!! Camera possibly closed the channel on the binding, cause reported is: Java heap space
2021-02-04 17:24:02.184 [WARN ] [com.zaxxer.hikari.pool.HikariPool   ] - 3m41s182ms62Ī¼s223ns - Thread starvation or clock leap detected (housekeeper delta=yank-default).
2021-02-04 17:26:26.303 [WARN ] [com.zaxxer.hikari.pool.HikariPool   ] - 1m37s459ms62Ī¼s834ns - Thread starvation or clock leap detected (housekeeper delta=yank-default).
2021-02-04 17:25:19.174 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = 710d4355-00d2-412a-a507-adca5dec9cce, base URL = http://localhost:8080)
2021-02-04 17:29:26.470 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception: 
java.lang.OutOfMemoryError: Java heap space

And so onā€¦

Quick poll. Of the people having this issue, are you on Pi2/Pi3 type systems or systems with more memory/cpu?

My system is a Proxmox VM with Buster. I gave 2 cpu(Celeron) und 2 GB memory.