Karaf thread count comparison while resolving "OOM : unable to create new native Thread"

gitMiguel · March 8, 2020, 10:26am

Hi,

My system has few times gone unresponsive (not crashed) and logs showed multiple lines of "java.lang.OutOfMemoryError : unable to create new native Thread". So the problem is not the memory but the overall thread count. To me it seems that there’s quite a lot of HttpClient threads running even if they are in TIMED_WAITING state. To further debug the problem I have disabled most of my rules and bindings and have added EXTRA_JAVA_OPTS="-XX:+HeapDumpOnOutOfMemoryError". I’ll enable bindings and rules over time one by one and wait for the OOM.

But while waiting could you post your thread counts for comparison? And if somebody has an idea for better comparison I’ll happily use it. I’m not sure if this is the best method.

Here’s mine just after restart of openHAB:

openhab> shell:threads | grep '\".*\"' | wc -l
94
openhab> shell:threads | grep HttpClient | wc -l
1
openhab> shell:threads | grep 'HttpClient.*TIMED_WAITING' | wc -l
1
openhab> shell:threads | grep 'HttpClient.*RUNNABLE' | wc -l
1
openhab> shell:threads | grep 'HttpClient.*BLOCKED' | wc -l
1

And after two days only a few rules disabled:

openhab> shell:threads | grep '\".*\"' | wc -l
2650
openhab> shell:threads | grep HttpClient | wc -l
2561
openhab> shell:threads | grep 'HttpClient.*TIMED_WAITING' | wc -l
1921
openhab> shell:threads | grep 'HttpClient.*RUNNABLE' | wc -l                                   
641
openhab> shell:threads | grep 'HttpClient.*BLOCKED' | wc -l                                        
1

Here’s my system details:
ProxmoxVE 6.1-7 with Debian 10 container running openHAB.
4-core Xeon with 8GB RAM. All cores and 4GB of RAM shared to openHAB container.
openHAB 2.5.2 and rules written in jython.

root@openHAB:/# java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (Zulu 8.44.0.11-CA-linux64) (build 1.8.0_242-b20)
OpenJDK 64-Bit Server VM (Zulu 8.44.0.11-CA-linux64) (build 25.242-b20, mixed mode)

splatch · March 8, 2020, 10:53am

Unable to create a native thread error indicates that you have a lot of threads. Given no information about kind of operations you run its difficult to determine exact reason. It might be a thread leak in one of bindings, so each call ends up with new thread instead of reusing threads grouped in a thread pool.

Intermediate solution is increase of thread stack size via -Xss JVM option: https://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom

Thread dump might help indicating kind of threads which are created. Be aware that if thread caller do not assign a proper name it will be difficult to nail cause by this way.

Best,
Łukasz

gitMiguel · March 8, 2020, 5:02pm

Agreed. That’s why I’m requesting thread counts from random users to make rough comparisons along the way while enabling bindings. That hopefully gives me a clue where to dig deeper. To me HttpClient threads count seems too high and is the first suspect. I know this doesn’t give me anything specific regarding bindings but at least it’s a number to track and when it goes higher that normal, I know the last binding enabled might be the source of my problem.

What information do you need?

How does increasing stack size help to limit thread count? Doesn’t this reserve more and more memory when new threads are created and do the opposite? So that this error would happen earlier? Accordingly when decreasing the value more threads would fit into memory but then the case might turn into overflow errors? Please correct me if I’m wrong.

I already took one heap dump and used Eclipse Memory Analyzer. Most of the threads was named HttpClient@blahblah1245 as you guessed. First time using EMA so might have to take another look if I can follow stack traces to the root cause.

For analyzing heap dumps has anybody used https://fastthread.io/ ? Looks promising but not the securest thing as it is an online tool. If I use it am I in danger leaking passwords etc?

gitMiguel · March 10, 2020, 8:33am

For now the problem seems to be solved. I’t was one of my own bindings which was not finishing http requests properly or did not consume content response. Fixed everything and thread counts haven’t increased over night so I take this as a victory

As a reminder for fellow debuggers. Disabling everything suspicious in openHAB, rules and/or bindings, and enabling them one at a time once again led to the root cause.

Martin_Zobel-Helas · May 1, 2020, 10:29pm

I would be very interested what this exactly was, because i am currently running into the exact same problem. What do you mean by “my own bindings”? Have you written your own binding?

Andrew_Rowe · May 2, 2020, 2:13am

Yes, Miika was using a self authored binding
consider taking his advise

gitMiguel · May 2, 2020, 8:34am

As Andrew stated above it is a binding that I made for my self to update Ip address changes to various services. You could say it’s like a DynDNS client.

Part of it still uses Apaches HttpClient. Plan is to do a full swap to Jetty. A bit sloppy implementation caused that not all connections wasn’t closed. I then changed to CloseableHttpClient with try-with resources and used EntityUtils.consume().

BigGeorgeTx · June 17, 2020, 2:09am

Thanks for sharing your experience.

I am relatively new to OpenHAB and to Linux. I am running OpenHABian on a Raspberry Pi 4 b.

I was receiving a lot of the “unable to create new native Thread” warnings, which led me to your post.

I can see the number of threads by typing “ps -eLf, but I would like to see it summarized as in your post. But when I enter “shell:threads | grep '”.*”’ | wc -1" at the OpenHAB prompt (using SSH my Windows 10 laptop), I get a message “shell:threads” command not found.

Any suggestions?

Thanks for your assistance.

gitMiguel · June 17, 2020, 4:48am

You need to be in karaf console for that command to work. Easiest would be:

openhab-cli console

with password habopen

BigGeorgeTx · June 17, 2020, 3:58pm

Thanks! That worked. (and I got the thread information I was looking for once I realized the “l” was a lowercase “L” and not the number one.

And how do I return to Openhabian without closing something I don’t mean to? I see there is an “exit” command, but am not sure if that will simply take me back to OpenHAB, or if it will completely exit.

gitMiguel · June 19, 2020, 4:32pm

I might be late here but:

https://karaf.apache.org/manual/latest/remote

so CTRL-D or just type logout.

rossko57 · June 19, 2020, 4:36pm

You may find the recent ‘amazon echo’ posts relevant.

BigGeorgeTx · June 19, 2020, 8:50pm

Thanks for the note. I had found those posts. I am now trying to figure out how to load the new .jar to my Pi 4 set up to fix the problem.

I was able to load it using this procedure: [SOLVED] Openhabian: How do I update one binding to latest snapshot only

(and it appears to have solved the OOM error with the amazonecho control.