Openhab stops sending commands, but can still receive them

kosken · December 2, 2022, 11:36am

Hi,
I’ve ran into a very strange problem.
I’m running OpenHab 3.3 release build in a Linux Container with Debian Buster. The container is running on Proxmox 7.1 For Java I use Zulu 11.0.16

My problem is that (as far as I can see) OpenHab randomly stops being able to send commands to any devices. Updates from the devices are received and their items are updated as they should, but no commands are sent out. It doesn’t matter which Binding/Proticol i’m trying to use, Zigbee, MQTT, Zwave, Wifi… or if the item is triggered via a rule (all rules are running as they should) or manually, same problem.

I don’t get any errors in the Openhab log and in the event log the commands are visible and looks normal, but if we take MQTT for instance, no messages are shown on the subjects.

A reboot of Openhab service (systemctl restart openhab.service) fixes the issue, even though I think the reboot takes longer time than usual. After a reboot it runs fine for a few days and then it appears again.

The problem started a month or two ago and it feels like that after each time I reboot the shorter time it takes for the problem to occur again. The first time it took more than a week and now I rebooted yesterday morning and 25h later I got the same problem again.

I don’t even know where to start troubleshooting

Udo_Hartmann · December 2, 2022, 9:28pm

What about the RAM usage? I found some weird behavior of openHAB when reaching the limits. Setting a bit more RAM for the LXC did the trick for me.

By the way: I would recommend debian bullseye + openJDK Java 11 sdk, as well as an upgrade to Proxmox 7.3-3

kosken · December 2, 2022, 9:42pm

Thanks @Udo_Hartmann Udo, I have 4GB of RAM allocated to the container and it runs stable around 2GB of usage (I seldom exceed 2,5GB). But I’ll keep an eye on it.

Regarding the upgrades

I just upgraded to Proxmox 7.3-3. Thanks for the heads up
I was too fast writing the Debian dist name, I am running Bullseye
Maybe I don’t follow the “lingo”. This is what I get when I run the command “java -version”:

openjdk version "11.0.16" 2022-07-19 LTS
OpenJDK Runtime Environment Zulu11.58+15-CA (build 11.0.16+8-LTS)
OpenJDK 64-Bit Server VM Zulu11.58+15-CA (build 11.0.16+8-LTS, mixed mode)

Is this what you mean openJDK?

Seaside · December 2, 2022, 11:22pm

Look at cpu usage. If you have memory leak garbage collect will eventually always run in some cases causing 100% cpu usage.

What about rules threads, previously you could run out of rules threads. You can check this in karaf shel:threads. You can also thread dump the openhab java process to see what it is doing.

kosken · December 3, 2022, 9:30am

The CPU is not a problem, the container is running on around 1% CPU load with some peaks up tp 40%

However, the rule thread is something I’ve come to suspect since I’m using thread:sleep even though I know I shouldn’t. I was just not sure how to check that. I will keep an eye on it and definitely check the log next time it freezes. What do you mean by thread dump and how do i do it?

Udo_Hartmann · December 3, 2022, 12:18pm

Well, at least in openHAB3.3 that should not do any harm anymore (although it’s bad style).

In openHAB1 and openHAB2 there were 5 + 2 threads for rules - five for normal event triggers, two for scheduled triggers like Time cron or createTimer(). And a rule could be executed in parallel to itself, so when using Thread::sleep() with big values, there was a good chance to use all threads for one rule and no thread left for the rest of them.

In openHAB3 each rule has its own thread and can’t be executed twice at the same time, but the event should be saved and the execution should be caught up. (Uhh, is this correct grammar?)

kosken · December 3, 2022, 6:35pm

Don’t know about the grammar, but I understand what you mean Thanks for the explanation!

kosken · December 4, 2022, 5:59am

@Seaside Now the same thing happened again. When I check the threads I get the following results:

shell:threads --list |grep "RuleEngine" |wc -l
1

shell:threads --list |grep "thingHandler" |wc -l
6

I don’t think there is anything wrong with the rules nor the things. However, when I check for BLOCKED I see that my ChromeCast binding has caused a few. Could this be the issue and how do I troubleshoot it? I know that I get some warnings about the ChromeCast binding in my OH log, but I think I had those before this problem started.

shell:threads | grep BLOCKED
"OH-safeCall-283" Id=39574 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-284" Id=39587 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-285" Id=39647 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-295" Id=40043 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-296" Id=40050 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-317" Id=42821 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-323" Id=43934 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-324" Id=43938 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-330" Id=44136 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d
"OH-safeCall-331" Id=44234 in BLOCKED on lock=su.litvak.chromecast.api.v2.ChromeCast@5ab7889d

Seaside · December 4, 2022, 7:06am

Next step would be to thread dump to get an idea off why the threads are being blocked.
There are a number of ways to thread dump. See https://www.baeldung.com/java-thread-dump
I’m usually using jstack.

So thread dumping will give you stacktraces for all current running threads and can give an indication on where the threads are being stuck. I would report a bug on the chromcast binding if you think that is the issue. Then you have to describe how you are using it and provide the threaddump as well.

One easy way to see if it is the chromecast binding is to remove the binding and rules and see if you still have the issue or not.

/S