Out of threads after update to 4.0.4

Brent_Chartrand · November 14, 2023, 5:21pm

Hello, I’m trying to debug a problem that I see after migrating from a OH 3.x setup to 4.0.4.
Some information:

The original install, and recent update, was done via apt-get on Debian on an x86 NUC
System information:

openhab> shell:info
Karaf
  Karaf version               4.4.3
  Karaf home                  /usr/share/openhab/runtime
  Karaf base                  /var/lib/openhab
  OSGi Framework              org.eclipse.osgi-3.18.0.v20220516-2155

JVM
  Java Virtual Machine        OpenJDK 64-Bit Server VM version 17.0.9+8-LTS
  Version                     17.0.9
  Vendor                      Azul Systems, Inc.
  Pid                         8080
  Uptime                      22 hours 31 minutes
  Process CPU time            2 hours 7 minutes
  Process CPU load            0.03
  System CPU load             0.43
  Open file descriptors       2,291
  Max file descriptors        102,642
  Total compile time          8 minutes
Threads
  Live threads                4887
  Daemon threads              123
  Peak                        4897
  Total started               45495
Memory
  Current heap size           461,405 kbytes
  Maximum heap size           2,097,152 kbytes
  Committed heap size         1,048,576 kbytes
  Pending objects             0
  Garbage collector           Name = 'G1 Young Generation', Collections = 1547, Time = 37.442 seconds
  Garbage collector           Name = 'G1 Old Generation', Collections = 0, Time = 0.000 seconds
Classes
  Current classes loaded      32,847
  Total classes loaded        44,225
  Total classes unloaded      11,378
Operating system
  Name                        Linux version 4.19.0-25-amd64
  Architecture                amd64
  Processors                  4
  Total physical memory       8,052,296 kbytes
  Free physical memory        278,068 kbytes

The system starts normally, but the problem occurs at some point (hours) where one (of two) ZWave sticks stops working. Disabling/enabling the binding, via UI or console, does not restore functionality. Stopping/starting the service seems to work.

2023-11-13 09:18:40.204 [DEBUG] [wave.discovery.ZWaveDiscoveryService] - Creating ZWave discovery service for zwave:serial_zstick:zwavesecure with scan time of 60
2023-11-13 09:18:40.204 [DEBUG] [wave.discovery.ZWaveDiscoveryService] - ZWave discovery: Active zwave:serial_zstick:zwavesecure
2023-11-13 09:18:40.205 [INFO ] [zwave.handler.ZWaveControllerHandler] - Attempting to add listener when controller is null
2023-11-13 09:18:40.355 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - Initializing ZWave serial controller.
2023-11-13 09:18:40.355 [DEBUG] [zwave.handler.ZWaveControllerHandler] - Initializing ZWave Controller zwave:serial_zstick:zwavesecure.
2023-11-13 09:18:40.368 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - Initializing ZWave thing handler zwave:device:zwavesecure:node2.
2023-11-13 09:18:40.369 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - Initializing ZWave thing handler zwave:device:zwavesecure:node6.
2023-11-13 09:18:45.390 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - Connecting to serial port '/dev/ttyACM0'
2023-11-13 09:18:45.399 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - Starting receive thread
2023-11-13 09:18:45.400 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - Serial port is initialized
2023-11-13 09:18:45.400 [DEBUG] [WaveSerialHandler$ZWaveReceiveThread] - Starting ZWave thread: Receive
2023-11-13 09:18:45.400 [DEBUG] [zwave.handler.ZWaveControllerHandler] - Initialising ZWave controller
2023-11-13 09:18:45.402 [INFO ] [ve.internal.protocol.ZWaveController] - Starting ZWave controller
2023-11-13 09:18:45.405 [INFO ] [ve.internal.protocol.ZWaveController] - ZWave timeout is set to 5000ms. Soft reset is false.
2023-11-13 09:18:45.406 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.lang.Thread.start0(Native Method) ~[?:?]
        at java.lang.Thread.start(Thread.java:809) ~[?:?]
        at java.util.Timer.<init>(Timer.java:188) ~[?:?]
        at java.util.Timer.<init>(Timer.java:170) ~[?:?]
        at java.util.Timer.<init>(Timer.java:143) ~[?:?]
        at org.openhab.binding.zwave.internal.protocol.ZWaveController.<init>(ZWaveController.java:168) ~[?:?]
        at org.openhab.binding.zwave.handler.ZWaveControllerHandler.initializeNetwork(ZWaveControllerHandler.java:190) ~[?:?]
        at org.openhab.binding.zwave.handler.ZWaveSerialHandler.watchSerialPort(ZWaveSerialHandler.java:142) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-11-13 09:18:45.408 [DEBUG] [nal.protocol.ZWaveTransactionManager] - ZWaveReceiveThread queue empty
2023-11-13 09:18:45.409 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Transaction SendNextMessage 0 out at start. Holdoff false.
2023-11-13 09:18:50.401 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@3b415270' takes more than 5000ms.
2023-11-13 09:18:50.403 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@3b415270' takes more than 5000ms.
2023-11-13 09:18:50.404 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@3b415270' takes more than 5000ms.

I graphed my heap usage over time, and it stays within expected (and generous) limits.

What I did see increasing was the thread count. Looking at the shell:info I pasted above, you can see where the threads are in the 4887-4897 range; when it gets here, OH starts doing bad things.

Looking into related threads regarding out or memory/threads, I found a thread where dev:dump-create is used. Looking into the threads.txt, I see literally thousands of entries like this:



"HttpClient@66c1ab89-6296" Id=6296 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
        at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)



"HttpClient@66c1ab89-6297" Id=6297 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
        at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)



"HttpClient@66c1ab89-6298" Id=6298 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@29589826
        at java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
        at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
        at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)



"HttpClient@66c1ab89-scheduler-1" Id=6299 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5275657a
        at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5275657a
        at java.base@17.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
        at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:506)
        at java.base@17.0.9/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465)
        at java.base@17.0.9/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436)
        at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1623)
        at java.base@17.0.9/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170)
        at java.base@17.0.9/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
        at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1062)
        at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
        at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)

Now I’m stuck… I can’t figure out what is creating all those threads that are waiting on a network “something” to happen?

Any suggestions of how to proceed from here with debug?

[Edited to correct OH version of 4.0.4 in the text, rather than Karaf version]

hmerk · November 14, 2023, 5:50pm

What openHAB version ?
4.4.3 is the Karaf version.
openHAB 4.1 will be the next release…

Brent_Chartrand · November 14, 2023, 6:02pm

Sorry, forgot to include that!

4.0.4 (the stable release selected via apt-get update as of two-ish days ago)

Thanks!

hmerk · November 14, 2023, 6:08pm

I edited the title😉

morph166955 · November 14, 2023, 11:35pm

What bindings do you have installed?

Brent_Chartrand · November 15, 2023, 12:23am

Quite a few… my current activity is to try a binary search enabling and disabling bindings, but given I use quite a few bindings and the fact that it takes a few hours to see the effects, it’s slow going, and I have not found the culprit, or even narrowed it down, yet.

Bindings:
Amazon Echo, aqi, astro, bluetooth, insteon, minecraft, mqtt, nest, roku, systeminfo, tesla, Tradfi, UPnP, http, samsungTV, unifi, zwave, ecobee, opensprinkler, openweather, TpLinkRouter, Zoneminder

Some of those are more suspect by their nature than others, such as http, UPnP, and Nest, but the experiment is still running.

It would be really nice to figure out how to relate threads to the bindings an easier way. Or, since they are httpClient threads, even the server they are waiting on would be immensely useful.

justaoldman · November 15, 2023, 5:26am

so, there was another thread a few weeks back where some users seemed to have a similar condition, but they were using the shelly binding. So you may want to give it a read it offers some tips as well.

Since you did not mention that binding in your list, we will assume you are not using it.
as for how to tie a thread to a binding.
you could trigger a support bundle.
cli dev:dump-create
Then install something like eclipse memory analyzer and open the dump file and run the analysis and poke around and you should see what binding is making all those requests or at least give you more insight as to what is occurring.
You do have memory constraint as you can see from your resource print out
free memory is now low and yes very likely from all the threads in a timed_waiting state

laursen · November 15, 2023, 6:07am

For Nest, do you have any WWN Things left?

Brent_Chartrand · November 15, 2023, 4:55pm

I did see that thread, and that started my investigation on the heap side of things, but I did not see any unexpected heap behaviors. While the server memory itself is low, the OH Java process has plenty of memory…
Current heap size 461,405 kbytes
Maximum heap size 2,097,152 kbytes
Committed heap size 1,048,576 kbytes

I see the heap grow to ~1GB (the committed size), then the GC runs and it drops down to about 300MB, and repeats… it never goes above 1GB, and the long-term average doesn’t seem to be trending up. So, this is why I’m thinking it’s a thread-related problem, and not Java heap, or server RAM…

I have a dump, but the eclipse memory analyzer was getting a bit far from my areas of expertise. If that’s the best/only way to see what’s spawning all the threads I guess I’ll have to dig into that, but I haven’t really played with it before…

Thank you for your response!

Brent_Chartrand · November 15, 2023, 4:57pm

Yes… um… why do you ask… ?
That binding was on the “suspicious” list as it’s been giving me problems. It was disabled last night (along with many others) and I did not see the problem… so I will look more closely there today.

Thanks!

laursen · November 15, 2023, 5:21pm

See OH4 runs out of memory - #29 by robbert

It seems there was a bug in the binding that was triggered when Google shut down the WWN service. Since it can’t be used anymore, the solution is simple: Delete your WWN Things. WWN support has been entirely removed in 4.1 and thereby also the bug.

Brent_Chartrand · November 17, 2023, 12:51am

Yeah, it was the Nest binding. I had looked at the thread that you pointed me to, but in those 113 replies largely about out-of-heap debugging I missed the three that mentioned the Nest binding and threads that you pointed me directly to, so thank you very much. I’ve disabled that binding and now everything is fine.

Brent_Chartrand · November 17, 2023, 12:56am

To be clear, in language that I am trying to optimize for future searching purposes…

The Nest binding causes Openhab 4.0.4 to fill up with threads, and be unable to create native threads. In heap-limited systems the bad binding behavior seems to show up as out-of-memory issues. (See links above). In systems with a lot of memory allocated, the thread limit seemed to be hit first with many httpClient threads in WAITING state. This was seen, in my system, as the ZWave binding giving errors due to not being able to create new native threads.