Openhab out of memory, looks to have a lot of threads waiting on web socket

PaulL1 · October 13, 2023, 6:37am

I’m on a Raspberry pi 3 with 1gb RAM. Openhab 4.1.0.M1. The main addons I have are Homematic, MQTT, BigAssFans, TPLink, GPIO, Shelly.

My install has been pretty stable for years. In the last few months I’ve added the Shelly binding, but not a lot else has changed.

In the last couple of weeks I’ve started getting out of memory errors. These broadly look like:

2023-10-06 02:24:22.775 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@ddaba4' takes more than 5000ms.
2023-10-06 02:24:32.708 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@ddaba4' takes more than 5000ms.
2023-10-06 02:24:52.968 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@ddaba4' takes more than 5000ms.
2023-10-06 02:25:21.839 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception: 
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.lang.Thread.start0(Native Method) ~[?:?]
        at java.lang.Thread.start(Thread.java:802) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:945) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1593) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor.reExecutePeriodic(ScheduledThreadPoolExecutor.java:360) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:307) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

I’ve used jstack and jmap to see what’s going on. In broad terms it looks to me like my heap use is pretty consistent through the day - I’ve looked a few times and it seems to be using around 40-60MB, more at startup, a bit less once the garbage collector has been running.

However, what I see is that when these errors started today I had an awful lot of threads - 2,228 in total. A lot of them looked to be websockets that were waiting on a concurrent lock - 1,721 of them in total. Further below is a grep of the threads from the dump, and some examples of what those threads appear to be doing.

I’m interested in any hints as to what might be going on (have people seen something like this before), or what my next step in diagnostics would be to see what’s happening. None of my rules have explicit concurrent locks, so this must be something in the code itself. Given Shelly is the latest thing I added, that might suggest a problem with the Shelly binding, but it could be coincidental.

The list of threads is too long to include (I can attach it if it’s useful, but I suspect it’s not), most of the volume of threads are as follows:

pi@raspberrypi2:~ $ fgrep prio= threaddump.txt 
...
"HttpClient@18ab912-scheduler-1" #23913 prio=5 os_prio=0 cpu=0.41ms elapsed=15018.15s tid=0x58244098 nid=0x28bc waiting on condition  [0x2ed1e000]
"WebSocketClient@29182530-23942" #23942 daemon prio=5 os_prio=0 cpu=0.59ms elapsed=14958.41s tid=0x2ef10650 nid=0x28de runnable  [0x2eaee000]
"WebSocketClient@29182530-23943" #23943 daemon prio=5 os_prio=0 cpu=0.43ms elapsed=14958.40s tid=0x31ef6d10 nid=0x28df runnable  [0x2eb3e000]
"WebSocketClient@29182530-23944" #23944 daemon prio=5 os_prio=0 cpu=17.28ms elapsed=14958.40s tid=0x31ef5d58 nid=0x28e0 waiting on condition  [0x2ea4e000]
"WebSocketClient@29182530-23945" #23945 daemon prio=5 os_prio=0 cpu=15.95ms elapsed=14958.40s tid=0x2ef044a8 nid=0x28e1 waiting on condition  [0x2e9fe000]
"WebSocketClient@29182530-23946" #23946 daemon prio=5 os_prio=0 cpu=12.41ms elapsed=14958.40s tid=0x2ef0e158 nid=0x28e2 waiting on condition  [0x2e9ae000]
"WebSocketClient@29182530-23947" #23947 daemon prio=5 os_prio=0 cpu=13.30ms elapsed=14958.40s tid=0x31ef47e8 nid=0x28e3 waiting on condition  [0x2e95e000]
"WebSocketClient@29182530-23948" #23948 daemon prio=5 os_prio=0 cpu=14.86ms elapsed=14958.40s tid=0x2ef126c8 nid=0x28e4 waiting on condition  [0x2e90e000]
"WebSocketClient@29182530-23949" #23949 daemon prio=5 os_prio=0 cpu=13.74ms elapsed=14958.40s tid=0x2ef15648 nid=0x28e5 waiting on condition  [0x2e8be000]
"HttpClient@1bd4a42-scheduler-1" #23950 prio=5 os_prio=0 cpu=0.40ms elapsed=14958.39s tid=0x00862bf0 nid=0x28e6 waiting on condition  [0x2e86e000]
"WebSocketClient@7372344-23960" #23960 daemon prio=5 os_prio=0 cpu=0.34ms elapsed=14958.09s tid=0x2ef22a58 nid=0x28f0 runnable  [0x2e77e000]
"WebSocketClient@7372344-23961" #23961 daemon prio=5 os_prio=0 cpu=0.24ms elapsed=14958.09s tid=0x2ef207a8 nid=0x28f1 runnable  [0x2e68e000]
"WebSocketClient@7372344-23962" #23962 daemon prio=5 os_prio=0 cpu=15.79ms elapsed=14958.09s tid=0x2ef21760 nid=0x28f2 waiting on condition  [0x2e6de000]
"WebSocketClient@7372344-23963" #23963 daemon prio=5 os_prio=0 cpu=13.96ms elapsed=14958.09s tid=0x2ef13880 nid=0x28f3 waiting on condition  [0x2e81e000]
"WebSocketClient@7372344-23964" #23964 daemon prio=5 os_prio=0 cpu=13.03ms elapsed=14958.09s tid=0x2ef17a60 nid=0x28f4 waiting on condition  [0x2e7ce000]
...

Examples of a few of them with full log are as follows:

"WebSocketClient@24187389-829" #829 daemon prio=5 os_prio=0 cpu=303.16ms elapsed=76894.48s tid=0x59256148 nid=0x254e waiting on condition  [0x54c64000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
        - parking to wait for  <0x6f2f0488> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/LockSupport.java:252)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@17.0.7/SynchronousQueue.java:401)
        at java.util.concurrent.SynchronousQueue.poll(java.base@17.0.7/SynchronousQueue.java:903)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait(ReservedThreadExecutor.java:324)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:399)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
        at java.lang.Thread.run(java.base@17.0.7/Thread.java:833)

"WebSocketClient@24187389-830" #830 daemon prio=5 os_prio=0 cpu=378.20ms elapsed=76894.48s tid=0x59256d80 nid=0x254f waiting on condition  [0x54c14000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
        - parking to wait for  <0x6f2f0698> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.7/AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.lang.Thread.run(java.base@17.0.7/Thread.java:833)

"WebSocketClient@24187389-831" #831 daemon prio=5 os_prio=0 cpu=132.94ms elapsed=76894.48s tid=0x59257db8 nid=0x2550 waiting on condition  [0x54bc4000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
        - parking to wait for  <0x6f2f0698> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.7/AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.lang.Thread.run(java.base@17.0.7/Thread.java:833)

"WebSocketClient@24187389-832" #832 daemon prio=5 os_prio=0 cpu=253.63ms elapsed=76894.48s tid=0x59259050 nid=0x2551 waiting on condition  [0x54b74000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
        - parking to wait for  <0x6f2f0698> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.7/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.7/AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.lang.Thread.run(java.base@17.0.7/Thread.java:833)

Lolodomo · October 13, 2023, 7:10am

The Shelly binding was identified last week as problematic with its handling of web sockets.
A fix was merged this week but the binding author is investigating more largely as he is not convinced we fixed the root cause.

PaulL1 · October 13, 2023, 8:43am

Ah, excellent. I can run a regular reboot to control it meantime. In the past I was getting a lot of latency from the binding, the two things could potentially be related.

It’d be nice if the two things got fixed together (assuming they’re related).

The Shellys are capable of driving MQTT anyway, so there’s another option to just not use the binding at all. Most of what I do with it I think could be done with native MQTT, I’m just not sure if it’ll send the double click events.

If there’s anything I can do to help diagnose let me know. It seems to be hanging every couple of days, so I can easily grab some dumps or whatever, or alternative run the beta code for a while to see if the problem seems to have gone away (never a guaranteed test with these sort of intermittent issues, but if it runs for a week that’d be different than what I’m otherwise seeing).

Lolodomo · October 13, 2023, 11:24am

Here is the topic where it is discussed.

Ideally you should have answered to that other topic rather than creating a mew one

PaulL1 · October 13, 2023, 5:16pm

I agree that would have been ideal. I did a search without finding it though, so here we are.

justaoldman · October 13, 2023, 5:39pm

Hi,
So, in the other post the user who reported the high thread count was able to identify a specific thing that it was hanging on waiting reply during polling the web socket of that device and as a result the next polling cycle started a entire new batch of web sockets while the previous ones sat in indefinitely waiting condition as was seen in the support bundle and heap dump he provided. There clearly is some condition that needs to be addressed with that binding. Where you able to figure out what thing of yours that was failing to respond in a timely manner?

PaulL1 · October 13, 2023, 7:58pm

I don’t have an easy way to work that out. I’d hypothesise a couple of things:

I have 5 x Shelly motion. These are battery powered devices, they sleep for periods of time. The other poster had a wall screen, which I think is also battery powered. That could be a pattern
To work that out I’d need to remove the 5 Shelly motions. I can do that if it’s useful diagnosis, it means I lose motion sensors on my lights though
If it is the Shelly motions, it’s obviously not always, only sometimes. I’d hypothesise (with no evidence) that it starts off attempting to contact it, it’s asleep and fails, and the next attempt to contact it comes along before it awakes and/or the thread stalls somehow and can’t get the update when it next awakes
There’s some sort of concurrency locking going on, presumably it won’t start another thread trying to talk to that device until the first thread releases its lock. So once one thread is blocked we’re screwed - every attempt to communicate with the device is creating a new thread which waits for the first thread, and stacks up

If that hypothesis is true (based on no code review and no knowledge), then the place to look would be either some sort of timeout that happens on the web socket (gives up waiting) without releasing the lock/destroying the thread, or some logic path that causes it to create a new web socket for a device that already has one. In my mental model you’d have a single web socket for the device and all communication would be via that socket. It’s unusual to be creating new sockets for a device - that seems like it’d lead to trouble of this nature. Perhaps there’s an obscure logic path that creates a new web socket because the existing one doesn’t respond or times out or something?

cmachtel · October 15, 2023, 7:18am

Hi, @PaulL1
I Was the Guy with the all Screen, This one is not battery powered. And is definitely the problem. When powered on it Creates around 5000 Threads in ± 20minutes. I Also have some battery powered devices (1 Motion and 1 H&T) non of them cause any issue. They work fine and the system is stable without Threads growing.
I have send @markus7017 a private message with tracelog of the Wall display. I’m not advanced enough to figure out the problem but hopefully he can find some hints in the tracelog.

As walldisplay is a quite net product, there are a lot of firmware updates and they may also solve the issue in some weeks. But it could also lead in that a firmware in other products conducts to the same issue in future, thats why a binding secured thread number control would be great.

markus7017 · October 16, 2023, 6:27pm

I’ll look into that at the weekend, the TRACE log will help to find the root cause

markus7017 · October 16, 2023, 6:38pm

Please see this post:

and continue there.