[SOLVED] Z-Wave unreliable in 2.5.0.M4

YES. A Fibaro FGMS-001.

1 Like

I’m going to check this later.

1 Like

It updates the time at the end.

2 Likes

Was it successful then? Or is it only a timestamp?

The heal completed, but I have no way to know if it was successful.

I’m sorry, but I completely disagree. If a faulty node causes a huge section of the network to get knocked offline during a network-wide heal, that is definitely a bug. Especially since prior to M3 (or M2 as @5iver has reported), the network-wide heal was able to continue past a faulty node without issue.

However, since I managed to solve my issues by discovering, removing, and re-adding a faulty node, and that there appears to be consensus on an effective workaround, I am going to mark this issue as solved. The solution/workaround being:

  1. Identify and locate any misbehaving nodes and remove them from the network (Thanks, @Bruce_Osborne)
    and
  2. Disable daily network heal (Thanks, @5iver)
3 Likes

do you know this for sure? I’m just asking, did you ever have a know to be faulty node and the nightly heal regularly ran without issue?
If so it would be easier to pinpoint the regression but Chris has said previously that very little has changed in the binding in months because of the troubles with the IDE that Scott mentioned. The device database has been updated but that is about it
if I’m not mistaken

I do with 90% certainty, because I was having this issue on M3, but didn’t have the time to analyze it so I rolled back to M1, and operated for weeks without issue. Then after updating to M4, it immediately started happening again. @5iver appears to also had the same experiences, and there is an active bug open #1195, so Chris is aware of it.

I wish I had the debug logging set before so I could contribute some useful logs to this bug too. But since removing the faulty node seems to have fixed it, I have nothing useful to contribute about it.

File an issue on GitHub so it can be dealt with by a developer. Chris is busy working on Zigbee but there are other devs there too.

There is already an open issue for this issue.

2 Likes

No Robert, you are contributing, thanks, this is how the system works and bringing up the issue and folks looking into helps nail down the issues. Alex pretty much single handedly figured out the problem with the REST documentation that led to a fix. Mark is actively working with Chris on the zwave stuff.
It does sound like since we now know you had a bad node and rolling back to M1 cured the issue (even though the bad node had not yet been discovered) this may be a recent regression

2 Likes

@mhilbush
@Bruce_Osborne

Ok. I got it healed manually. (one node that was not healed since the last nightly heal). And HABmin is telling me LAST HEAL TIME is updated to actual date/time.

But you wont believe what I did.

Set “Heal device” in HABmin and triple click (= inclusion/exclusion) the device many times.

Before that I’ve tested it with 1x click (= wake up) many times with no success.

But if it’s really healed I can tell you within the next few days…

2 Likes

@mhilbush
@Bruce_Osborne
@ariela
@Andrew_Rowe
@rrgeorge
@5iver

Here are my findings:

NETWORK HEAL

node1:

For a better resolution, see here.

node2:

node3:

node4:

node5:

node6:

node7:


.

As you can see all FLiRS (node3, node4, node5, node6, node7) were healed! OK!

@mhilbush
@Bruce_Osborne
@ariela
@Andrew_Rowe
@rrgeorge
@5iver

Now the interesting part: all “non listening nodes” (node8, node9, node10)

node8:



node9:



node10:

For a better resolution, see here.
.

As you can see, they were not healed automatically.

I had to wake them up manually.

node10: There you can see, it first woke up on its own, healing FAILED.

Later I manually woke it up, then it was healed.

Here is the complete DEBUG.log : DEBUG.log (654.2 KB)

So you can load and filter it on your own, here:

https://www.cd-jackson.com/index.php/openhab/zwave-log-viewer

PS: ALL images were readable (big enough) at my PC, but they were zipped during upload, SORRY!

Explanation:
healing started at 21:20, controller tried 5 times with 3 requests. They all FAILED. The 6th try was successful then. The node1, node2, node3, node4, node5, node6, node7 were healed until 21:23.

Node8, node9, node10 were not healed within the next 30 minutes. Then (21:50) node 10 woke up on its own (=wake up interval), but healing FAILED and it was getting OFFLINE.

After this event, I woke up all three left nodes (8, 9, 10) manually and they were healed.

node9: start: 21:57 ; end: 21:58
node8: start: 22:05 ; end: 22:05

node10: I woke it up 2x at 22:02 and 22:11 ((in the meantime a lot of motion/tamper were detected, because I was in front of the sensor or I moved it a little bit, to press the button!))
node10: start: 22:02; end: 22:11

Apologies I’ve not had the chance to look at this as I’m completely overloaded with other work for the next couple of days and am then away from home until mid November so just don’t have time until then.

3 Likes

node1: (better resolution)



node10: (better resolution)






If there is something you want Chris or the other devs to address, open an issue on GitHub.

There is already an issue reported on Github. Just wanted to prove that I have the same problem and that manual healing (by waking up the node) works. Maybe some people can discover something in the logfile that may lead to the solution.

3 Likes