[SOLVED] Z-Wave unreliable in 2.5.0.M4

I have not finished my investigations because of time constraints. I’ll post the results here as soon as I find the exact reason.

The issues (#1195, #1178) you reported on Github are still open.
If it turns out that they still exist in M2, M3, M4 and snapshots, why has not it been resolved so far, or why was it decided to publish the milestones anyway?

I did not report the first, and it was a known issue. The second one was mine. My workaround was to disable the daily network heal and restarting OH. I can still heal individual devices, if needed.

It’s a rare issue, possibly related to large networks (>120 devices). Chris is extremely busy, and with all the changes this year, the IDE is not what it was, especially when it comes to debugging.

I’m not the right person to ask, but the 2.5M2 preparation topic may shed some light on that.

2 Likes

I only have 10 nodes. (1 controller, 1 repeater, 5 FLiRS and 3 “non listening” nodes (=3x motion sensor)) . Only the last 3 make troubles after “daily healing”.

The workaround to both issues is easy. Just disable the nightly heal. My network of about 100 nodes has been running fine without the nightly heal for a couple months.

2 Likes

Did you do a manual heal for each node? How often? How do you know if it’s neccessary? Do you restart OH server each time?

I’d do that if necessary, but I haven’t needed to do it yet. My network is pretty stable. If I add some new nodes I might need to do it. If I see a node starting to become flaky, I might try running a heal on that node. But I haven’t needed to do that either.

1 Like

I should note that after disabling the nightly heal, a restart of openHAB is advised just in case the binding is in a weird state due to a previously run heal.

@mhilbush

One last question.

Suppose you have a nightly heal at 2:00 A.M. and all the nodes have been healed within the next 24 hours, except for one or two nodes, then you should do a manual heal (of the missing nodes) until it is complete? Then “disable heal” and thereafter restart OH?

In my analysis, if a heal doesn’t complete in 24 hours, it will never complete. In this scenario, I’ve seen the binding get stuck in the “initialize/heal” state for that node. Once in this state, it will never get out of that state until the binding is restarted. Once stuck in this state, the device will never be polled again until you do a restart. That’s why I disable the nightly heal, then do an openHAB restart.

This is just my experience, of course. OTOH, it’s backed by log files that show this behavior.

1 Like

In a battery operated sensor, I found removing & replacing the battery & making sure it was awake brought it back online fully.

@mhilbush @Bruce_Osborne

I my case, I can do a “manual heal” by waking the node up many times (tripple pressing the button)! Without restarting OH.

EDIT: I have to check if 1x pressing has the same effect. (–>heal successful)

If a node is hung trying to heal, may may not work.
Verify how to wake up your device. Triple-clicking on some devices is used for inclusion.exclusion mode.

1 Like

But then I must have had many ghost nodes? But I have not. But controller is NOT in inclusion/exclusion mode during a heal! Isn’t it?

But you are right. 3x press (include/exclude); 1x press “wake up” :open_mouth: :open_mouth:

1 Like

The mode likely times out if the controller is not in inclusion or exclusion mode.
Some of my devices take a single press & I have one where you are supposed to press & hold for at least 5 seconds to wake up.

Hmm. I’m not sure why cycling the power and waking up the device fixes the issue completely. The primary symptom of the issue is that the initialization thread (which is used for device initialization, as well as heal), never completes. While it might’ve shown the node with a thing status of ONLINE, how do you know the binding killed the node initialization thread?

In the issue I opened, and as documented in the log file, once in this state, wake ups have no effect on resolving the issue. I’ve found nothing short of an OH restart will fix the issue. Even a binding restart won’t fix it, because the binding never kills the orphaned initialization threads. This behavior can be demonstrated by looking at the zwave threads using the karaf console. I’ve observed numerous node initialization threads even after a binding restart.

Edit: I should be more clear, after stopping the binding, the init threads are still there.

See discussion here.

But I can see in HABmin: “Last Heal Time” is getting updated with actual date/time, if it was successful. Is this info wrong?

I don’t know. It’s unclear to me when the binding updates Last Heal Time – at the beginning of the heal, or at the end of the heal.

Here is why I thought it’s “tripple pressing”: it was not translated correctly!

english:
image
.

german:

image
.

3)… dreimal = 3x
6) …“drücken” = press (but not how often!) …“aufzuwecken” = “to wake it up”

Better would be “drücken Sie einmal” = “do press once” :smiley: