[SOLVED] Z-Wave unreliable in 2.5.0.M4

ariela · October 24, 2019, 9:40pm

I have the same USB key … and healing at 2 AM … updated yesterday so I’ve already passed this point

Bruce_Osborne · October 24, 2019, 10:48pm

There are current release versions of OpenHAB.

The stable release, normally released every 6 months, is currently 2.4 released late last year.

The Testing releases, also known ad Milestone releases are released every month. 2.5M4 or 2,5 Milestone 4 is the current version.

The Unstable, or Snapshot releases are built & released every night and are only designed for developers to test all pieces working together. Snapshot releases are not expected to totally work or be stable.

Andrew_Rowe · October 25, 2019, 2:49am

OK folks… before we all grab our pitchforks and start calling this a bug… this is common traits of a failed node. Sometimes when things are working good for a long time and we upgrade and things suddenly break, it is easy to blame the upgrade itself, when in fact, it is a low battery, node slightly out of range or whatever edge case. In a system with a lot of nodes, it could have gone unnoticed previously.

Celaeno1 · October 25, 2019, 3:21am

I have not finished my investigations because of time constraints. I’ll post the results here as soon as I find the exact reason.

Celaeno1 · October 25, 2019, 3:51am

The issues (#1195, #1178) you reported on Github are still open.
If it turns out that they still exist in M2, M3, M4 and snapshots, why has not it been resolved so far, or why was it decided to publish the milestones anyway?

5iver · October 25, 2019, 5:28am

I did not report the first, and it was a known issue. The second one was mine. My workaround was to disable the daily network heal and restarting OH. I can still heal individual devices, if needed.

It’s a rare issue, possibly related to large networks (>120 devices). Chris is extremely busy, and with all the changes this year, the IDE is not what it was, especially when it comes to debugging.

I’m not the right person to ask, but the 2.5M2 preparation topic may shed some light on that.

Celaeno1 · October 25, 2019, 11:07am

I only have 10 nodes. (1 controller, 1 repeater, 5 FLiRS and 3 “non listening” nodes (=3x motion sensor)) . Only the last 3 make troubles after “daily healing”.

mhilbush · October 25, 2019, 11:36am

The workaround to both issues is easy. Just disable the nightly heal. My network of about 100 nodes has been running fine without the nightly heal for a couple months.

Celaeno1 · October 25, 2019, 11:42am

Did you do a manual heal for each node? How often? How do you know if it’s neccessary? Do you restart OH server each time?

mhilbush · October 25, 2019, 11:46am

I’d do that if necessary, but I haven’t needed to do it yet. My network is pretty stable. If I add some new nodes I might need to do it. If I see a node starting to become flaky, I might try running a heal on that node. But I haven’t needed to do that either.

mhilbush · October 25, 2019, 12:30pm

I should note that after disabling the nightly heal, a restart of openHAB is advised just in case the binding is in a weird state due to a previously run heal.

Celaeno1 · October 25, 2019, 1:35pm

@mhilbush

One last question.

Suppose you have a nightly heal at 2:00 A.M. and all the nodes have been healed within the next 24 hours, except for one or two nodes, then you should do a manual heal (of the missing nodes) until it is complete? Then “disable heal” and thereafter restart OH?

mhilbush · October 25, 2019, 1:40pm

In my analysis, if a heal doesn’t complete in 24 hours, it will never complete. In this scenario, I’ve seen the binding get stuck in the “initialize/heal” state for that node. Once in this state, it will never get out of that state until the binding is restarted. Once stuck in this state, the device will never be polled again until you do a restart. That’s why I disable the nightly heal, then do an openHAB restart.

This is just my experience, of course. OTOH, it’s backed by log files that show this behavior.

Bruce_Osborne · October 25, 2019, 1:44pm

In a battery operated sensor, I found removing & replacing the battery & making sure it was awake brought it back online fully.

Celaeno1 · October 25, 2019, 1:48pm

@mhilbush @Bruce_Osborne

I my case, I can do a “manual heal” by waking the node up many times (tripple pressing the button)! Without restarting OH.

EDIT: I have to check if 1x pressing has the same effect. (–>heal successful)

Bruce_Osborne · October 25, 2019, 1:51pm

If a node is hung trying to heal, may may not work.
Verify how to wake up your device. Triple-clicking on some devices is used for inclusion.exclusion mode.

Celaeno1 · October 25, 2019, 1:53pm

But then I must have had many ghost nodes? But I have not. But controller is NOT in inclusion/exclusion mode during a heal! Isn’t it?

But you are right. 3x press (include/exclude); 1x press “wake up”

Bruce_Osborne · October 25, 2019, 2:05pm

The mode likely times out if the controller is not in inclusion or exclusion mode.
Some of my devices take a single press & I have one where you are supposed to press & hold for at least 5 seconds to wake up.

mhilbush · October 25, 2019, 2:16pm

Hmm. I’m not sure why cycling the power and waking up the device fixes the issue completely. The primary symptom of the issue is that the initialization thread (which is used for device initialization, as well as heal), never completes. While it might’ve shown the node with a thing status of ONLINE, how do you know the binding killed the node initialization thread?

In the issue I opened, and as documented in the log file, once in this state, wake ups have no effect on resolving the issue. I’ve found nothing short of an OH restart will fix the issue. Even a binding restart won’t fix it, because the binding never kills the orphaned initialization threads. This behavior can be demonstrated by looking at the zwave threads using the karaf console. I’ve observed numerous node initialization threads even after a binding restart.

Edit: I should be more clear, after stopping the binding, the init threads are still there.

mhilbush · October 25, 2019, 2:18pm

See discussion here.