Some bindings gets into a non-responsive state every few hours (in v2.5.6 - v2.5.8)

Quite a whlie ago, but I do have a new update…

My system was running fine (as described above) on 2.5.5 for the time since the downgrade.
With the 2.5.8 I wanted to give it a new try. Unfortunately that failed, as after the upgrade the error of hanging bindings occurs again.

I digged into my Grafana charts on persited items as well as the events.log and found that some bindings (incl. Hue) stopped working while some other still kept working.

Still working:

  • Novelan
  • ComfoAir
  • MQTT
  • Z-Wave
  • Homematic

Not working:

  • Astro
  • OpenWeatherMap
  • Hue
  • System Info

Those of you that encountered the same issue, did you find some reasons or does someone else has an idea of what this could be?

Comment only - That looks like bindings with scheduled activities, polling.

If bindings stop updating it’s very likely a deadlock on the scheduled tasks. There are a limited set of threads available that are shared between bindings. If bindings don’t play nice or even lock those threads all other bindings will also stop. To see if there are any dead locks you can use the following command in the karaf console:

threads --monitors --locks

It’s a bit verbose. But it will given an information about locks and should give information about where it’s blocked.

What is the meaning of this output? I’ve been having some seemingly odd issues of late and I know that my mqtt homie binding is one that is probably holding up resources a lot. Would love some info on deciphering this. Thanks!

Really? Doesn’t that just lie there passively until someone publishes a message for it?

True - I just know I have a lot of devices here (over 50) that are fairly active and i’ve had a few issues where they have stopped updating from mqtt (commands still work). Have a separate thread where that is being looked into though.

So your broker might be pants.

It isn’t the broker - everything in mqtt works fine. This has its own thread discussion on github going though.

Hi hilbrand,

thanks a lot for that excellent hint, a deadlock sounds so much reasonable!
I wouldn’t expect it due to a schedule as the lock situation occurs always at a different time.
But I found something that looks like the right way to the root of the problem.

This is the result of my threads --monitors --locks:

openhab> threads --monitors --locks|grep BLOCKED
“OH-thingHandler-1” Id=222 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-3” Id=229 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-4” Id=232 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-5” Id=233 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-243” Id=2133 in BLOCKED on lock=java.lang.Object@197628b
“OH-discovery-75” Id=2193 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-258” Id=2242 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-77” Id=2244 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/39185)” Id=2245 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-262” Id=2261 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-83” Id=2550 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-84” Id=2590 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-85” Id=2642 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/40065)” Id=2690 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/35275)” Id=3111 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9

That shows that org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9 is the one blocking other tasks.

Regarding the Hue bindung I can’t find anything at all in the threads information, so the OH-thingHandler seems to block the thing actions.

After a service restart there are no locks anymore.

Do you know how to dig deeper into DiscoveryServiceRegistryImpl and what @1a7e1c9 means?
Meanwhile I try to stop the Discovery services in Karaf and see if the thing actions will hang again.

There are blocks on thing handler threads (OH-thingHandler-*) and discovery threads (OH-discovery-*)

and the blocks mention the HarmonyDiscoveryServer. It’s very likely the discovery process and handler of that binding don’t work well together. In the parent class of the handler a number of methods are synchronized. It’s possible both the handler and discovery access different methods at the same time causing them to wait for each other.

The binding also does background discovery. If you disable background discovery on that binding the problem might not happen (could be triggered if you do a manual discovery, but less likely). But that is only a temporary solution.

I also suggest to report this issue on the openhab-addons GitHub.

This is a internal runtime hash of the object.

I configured the Harmony binding inside PaperUI which does not allow settings for discovery methods in the binding and the hub thing configuration.
To my understanding disabling general discovery in OH should disable auto discovery for all bindings.

Anyway I will monitor the behaviour having the harmony binding and OH discovery services disabled completely in the first run (before disabling discovery by .cfg):

openhab> bundle:list|grep Resolved
139 │ Resolved │ 80 │ 2.5.0 │ openHAB Core :: Bundles :: Configuration Discovery
140 │ Resolved │ 80 │ 2.5.0 │ openHAB Core :: Bundles :: Configuration mDNS Discovery
247 │ Resolved │ 80 │ 2.5.8 │ openHAB Add-ons :: Bundles :: HarmonyHub Binding
264 │ Resolved │ 80 │ 2.5.0 │ openHAB Core :: Bundles :: Configuration UPnP Discovery
265 │ Resolved │ 80 │ 2.5.0 │ openHAB Core :: Bundles :: Configuration USB-Serial Discovery
266 │ Resolved │ 80 │ 2.5.0 │ openHAB Core :: Bundles :: Configuration USB-Serial Discovery for Linux using sysfs scanning

I will bring this to the openhab-addons GitHub once getting a more precise description with these results.
Again, thanks for pointing me into this direction!

Do you know if the behaviour in the parent class changed since 2.5.6 (can’t find something in the release notes)?
In 2.5.5 the exactly same environment ran / runs (when downgrading) flawlessly.

You should be able disable discovery for a specific binding at least via karaf console: Runtime Commands | openHAB

I also see upnp being in the block list so it might also be caused by that binding. The harmonyhub hasn’t changed since that version.
The thread command also gives the full stack trace. That should give more information about at what point the block is actually happen and should give some insight in where to look.

That’s not related to a specific binding, but any binding that uses upnp for discovery. That would point to the Hue binding in your list of bindings.

Just a first wrap up:
Having the discovery processes and the harmony service disabled, the system kept running properly for >24h now :slight_smile:

I enabled the harmony service again and will keep analyzing if the lock is due to the upnp discovery service in conjuction with the harmony binding. Will report back soon.

1 Like

After a couple of more testing I found the combination of the upnp discovery and the harmony binding causing the deadlock for some other binding.

I opened up an issue in the GitHub repo here.

Thanks for all your help!

3 Likes