Docker - Server stop working properly until restart

Hello everyone,

I have a problem with my openhab installation. It’s an old installation (2.0 if not 1.x) that I upgraded many time. In the past, it was on a dedicated hardware that I later mover onto the official docker.

The OS under is Unraid, which is an hypervisor like Proxmox and such based on slackware.

The problem I have is after a while that my docker run, I start having issue. It could be that bindings stop triggering from external update (like a tapo light switch that turn on, it’s still off in openhab while if I use openhab to trigger it, it will trigger properly the switch), notification not working, etc. Once I restart the openhab docker, it all work again properly.

For exemple, right now, notification stopped working. Looking at the log, I’m getting these kind of error.

2026-02-17 08:47:06.251 [ERROR] [.handler.AbstractScriptModuleHandler] - Script execution of rule with UID ‘nodered-1’ failed: The name ‘sendBroadcastNotification’ cannot be resolved to an item or type; line 8, column 3, length 89 in nodered

Once I restart, it start working again.

This is a simple rule that I use all the time used by nodered to send broadcast notification, nothing crazy.

rule "Send Broadcast"
when

    Item vBroadcast received update
then
    if (sBroadcastIcon.state.toString == ""){
        sendBroadcastNotification(vBroadcast.state.toString,"","info")
    }else{
        sendBroadcastNotification(vBroadcast.state.toString,sBroadcastIcon.state.toString,"info")
    }
end

I had the same problem with the simple notification rule

2026-02-17 08:50:39.115 [ERROR] [.handler.AbstractScriptModuleHandler] - Script execution of rule with UID ‘nodered-3’ failed: The name ‘sendNotification’ cannot be resolved to an item or type; line 30, column 3, length 98 in nodered

rule "Send Notification to "
when
    Item vNotifyMe received update
then
    if (sNotifyIcon.state.toString == ""){
        sendNotification("####@####", vNotifyMe.state.toString,"","info")  
    }else{
        sendNotification("####@####", vNotifyMe.state.toString,sNotifyIcon.state.toString,"info")
    }
end

Out of nowhere, this stopped working. Just restarted the openhab docker and boom, it work.

How can I find what is causing this?

First you need to find out what version of OZh you are actually running.

The Docker part is probably irrelevant. The issue is most likely either overall system load is too high, or one or more bindings is having problems.

When it is having problems, getting a listing of the threads or a thread dump might be able to identify which bindings are at fault. Or you can systematically remove add-ons until. the problem goes away.

There are instructions in the docs describing how to do this from the Karaf console.

This seems like a specific problem related to the “OH cloud functionality”. I know next to nothing about that - and Xtext, but it seems like the error comes from here:

I can only speculate as to how this works, but I assume that these commands are “registered” by some component, perhaps the “OH cloud binding”. The registration then seems to fail after a while, making it basically “an unknown command”. Those that know how this part of the system works should be able to figure out how this might happen.

I’ve looked a bit at the code in the “cloud connector”, and I see a general lack of thread safety and some other potential issues there (like the fact that if no cloud connector has been started, notification attempts will just be dropped silently, without logging anything).

I also see that NotificationAction.cloudService is never set to null in the `@deactivate´ method, which means that if the “cloud service” has been stopped for some reason, scripts will still try to invoke send using the now shutdown/non-operational “service”. This might produce behavior similar to what we see here.

Making the code thread-safe, actually logging when notifications can’t be sent, and nulling the “service” when it’s no longer available might all help improve the “user experience” of using the cloud connector.

Thanks Nadahar to have looked at the exemple I stated and not just say it’s “load issue”. Clearly, the problem here is not a load issue since we are talking about a command that disapear. In that case, the cloud was working but I did receive one day a notification when I opened the android app saying notification service was unavailable. Nothing new, saw that often in the past when the cloud openhab have issue. But it seems that it’s creating the problem you describe. Is there something to do to have it fixed?

OZh?

With the issue in description, load is not a problem. I’m talking here about a command that stopped working, coming from the openhab cloud connector. I think Nadar find the problem.

The server itself load, if it was the issue, wouldn’t be fixed with just a restart. So a problem with the load in the Docker itself? That could be it, but not the server overall. Server itself never reach even 25% of CPU load. Next time I see problem, I’ll do the dump like you want

The problem should be described and “formalized” by creating an issue on GitHub:

I’m not really sure who’s “responsible” for the cloud connector, it looks like many different developers have done bits and pieces. What really need to happen is that those that know and maintain the code are made aware of the issues. Maybe @lsiepel or @laursen knows who should deal with it?

Ok. In the meantime, I’ll wait for my other problem, the slow binding reply. Probably sure it’s something with TP-Link Binding that is not maintained anymore since it’s where I saw the problem at first. Some say load problem, I’ll have the dump to see if it’s that. It’s not an overload of the hardware capability for sure, but maybe the docker OS itself.

1 Like

OH. The version of OH. Was typing from my phone and missed the typo.

No? If you are out of RAM and using SWAP which slows things down a reboot definitely would solve the problem, at least until you run out of RAM again. A simple htop can answer whether this is the problem or not.

It can also be a problem within a single program if there are threads that are misbehaving and not returning resources to the pool. In that case. Dumping the threads can tell us if that’s a problem.

Load has nothing to do with the CPU. The load is the number of processes that are stuck waiting for some resource (e.g. parts of itself to be fetched from SWAP). When a machine has a high load, the CPU is usually doing almost nothing because everything is stuck waiting for some resource.

It’s a super simple first thing to look at when trying to figure out the problem. My reply was the first response. Excuse me for suggesting doing the absolutely simplest first steps to investigate the problem, and eliminate a whole realm of possible causes. I’m sorry my yhelp wasn’t “technical” enough for you.

I didn’t say it was a “load issue”. But a load issue could explain the behavior or thread deadlock could explain the behavior. And these are about the easiest things to eliminate as possibilities.

I’m not sure that you were the intended target here - there have been other comments (in the other thread) that were more in that direction.

Actually, that’s ambiguous - “load” is often used about CPU load, GPU load etc. There are situations where these can’t keep up with the tasks being thrown at them. But, in a OH context this seems rare. Instead, it looks like “I/O load” is the most common bottleneck, which is what you describe. It’s however not the “only type of load”, so maybe we should find a more precise term to describe it than just “load”?

When I say “load” I literally mean the technical definition for “load” on Linux.

Number of running and waiting processes / number of CPUs = instantenous load
exponentially-damped moving averages over 1, 5, and 15 minutes = average load

The three average loads are what is reported by top, htop, etc. as the “Load average”.

When the load is high it usually means one has a bunch of waiting processes on a typical OH installation.

It could also be caused by a busy wait in a rule somewhere, but that usually is apparent in other ways (e.g. one CPU pegged at 100%). Actually, now that I think of it, that wouldn’t cause a high load. It could also be caused by having rules that kick off a bunch of calls in the exec binding and/or executeCommandLine and does so really rapidly. That should be pretty apparent in other ways.

The load doesn’t tell us anything about why the processes are waiting around or even whether there are blocked processes in the first place. But all the other information presented by top or htop can usually tell us the answer to that, or at least point towards next steps/commands to run to find out.

One almost never sees the CPU as the limiting resource on an openHAB installation because OH is largely monolithic (i.e. it’s not running as a swarm of hundreds of little processes) and it’s not CPU heavy. And even if it were CPU heavy, it’s only add 1 to the overall load calculation. So when the load is high, usually the CPU is doing almost nothing, because all the processes are waiting around instead of running.

If the load is ever high and the CPU utilization is also high, it’s likely that load isn’t a problem at all. A high load with high CPU utilization means processes are not waiting around to run. You just have a whole lot running at the same time.

80% of the time, the load isn’t the problem. But when it is, nothing else we do will make any difference until the root cause of the load problem is solved. And it’s so easy to eliminate as a cause it’s an obvious first step when ever the symptoms are “it works for a while but then stops responding”, timeout errors, gradual slowing of responses, certain actions disappearing (as things time out because they can’t run the actions can get reset or stuck in a zombie state), etc.

To make things a bit smoother, I created a fix for the issues I found by looking at the code. I can’t say whether it will fix the DSL failure or not, but it should at least make sure that notifications don’t just “vanish” if the cloud connector isn’t running.

Ok, that explains it. I don’t interpret the word from how Linux has “defined load” - and I wouldn’t expect everybody else to do so either. While CPU load is unlikely to be a problem on a dedicated OH installation, it could be a problem on shared servers (virtual. containerized or everything running in the host OS). So, it’s not that “far fetched” to think that at least some, when they talk about “load”, actually means CPU load. OH would suffer from that as well.

1 Like

OH: 5.1.2

I got 128gb of ram, 76gb free. Unless the docker limit itself in the runtime from the person who created the docker, I don’t set a specific limit in my docker file. I’ll still check htop on the docker next time (not on the host, the host isn’t running out of ram) and run the karaf command to dump.

I’m waiting for the slowliness to come back. If there was a deadlock on the host ressource, it would affect other services as well normally. Maybe the OS in the docker is deadlock on itself though. Have to wait to get the data. Since the start of that thread, the problem hasn’t appear so it’s why no more data was provided since I cannot get it yet.

Please note that, as explained in the links I provided, the thread dump made from Karaf isn’t suitable to detect thread deadlocks. So, if you’re talking about a thread dump, please do that from the command line instead (I guess that means to “exec” into the container first).

Note, htop won’t be available inside the container, but top will be.

I don’t really think this is the problem but if it is nothing else we do will do any good. Eliminating it as a possibility is useful.

ttop from the Karaf console should also tell us if a thread pool is being exhausted, right? Runtime Commands | openHAB. If there are a bunch of threads in a waiting state or timed waiting that could point to a binding run amok and not returning threads to the pool in a timely manner.

ok so to sum up everything I need to do next time.

  • Running top from docker shell
  • ttop from karaf

I run it and post screenshot of them I imagine

Anything else?

just for fun, I did the ttop command as per the link you provided. Here’s the current output

Grab a thread dump as well, though that may not be possible…

The container ships with JVM, not JDK, so neither jstack nor jcmd are available inside the container.

It might be we need to make due with ttop and a thread dump from inside the karaf console. That command is

dev:dump-create

I assume there is nothing in the right two columns. That is where we will see threads being locked. I don’t see anything remarkable in this screen shot at the moment. It will be potentially informative when things start to go wrong.

I never really dug down into to what ttop shows to be sure that I get all that I can get from it. But, my goal is thread deadlocks or “serious thread slowdowns”, as the most “usual” cause for thread exhaustion, and to see those, I need a thread dump where the locks/monitors/mutexes are shown. Because, they are what reveals what thread is waiting for what, and that can let me track down where the “knot” that starts the whole pileup is. The Karaf thread-dump, unfortunately, doesn’t include this information, and I’ve found no way to make it do so. During a “normal” thread dump, this is the -l (for Locks I assume) argument, but it’s not implemented in the Karaf dump.

This is why I documented other ways to make thread dumps, because with a Karaf dump, I can see “that there is an issue”, but I can’t trace its origins.

Yes, you’ve somehow managed to miss what I consider the most important thing, the thread dump:

The documentation even shows how to get it from within docker, once you have found the OH process PID:

docker exec -it <container> jcmd <PID> Thread.print -l

That’s actually a major problem I’d say. I certainly don’t know how to figure out those thread dumps without the lock information. I don’t know how the docker container is maintained, but could one of them be “included manually” in some way?