[SOLVED] Z-Wave unreliable in 2.5.0.M4

Celaeno1 · October 25, 2019, 2:31pm

YES. A Fibaro FGMS-001.

Celaeno1 · October 25, 2019, 2:36pm

I’m going to check this later.

mhilbush · October 25, 2019, 2:47pm

It updates the time at the end.

openhab/org.openhab.binding.zwave/blob/b4a9a6145c0c22ac48e49b82be696c83d1f602cd/src/main/java/org/openhab/binding/zwave/handler/ZWaveThingHandler.java#L1464


      
          
              if (finalTypeSet) {
                  // Now that this node is initialised, we want to re-process all channels
                  initialiseNode();
              }
              break;
          case HEAL_START:
              break;
          case HEAL_END:
              Map<String, String> properties = editProperties();
              properties.put(ZWaveBindingConstants.PROPERTY_LASTHEAL, getISO8601StringForCurrentDate());
              updateProperties(properties);
              break;
          
          // Don't update the thing state for dynamic updates - this is just polling
          case DYNAMIC_VALUES:
          case DYNAMIC_END:
              break;
          // Don't update the thing state when doing a heal
          case UPDATE_NEIGHBORS:
          case GET_NEIGHBORS:

Celaeno1 · October 25, 2019, 2:54pm

Was it successful then? Or is it only a timestamp?

mhilbush · October 25, 2019, 3:08pm

The heal completed, but I have no way to know if it was successful.

rrgeorge · October 25, 2019, 4:11pm

I’m sorry, but I completely disagree. If a faulty node causes a huge section of the network to get knocked offline during a network-wide heal, that is definitely a bug. Especially since prior to M3 (or M2 as @5iver has reported), the network-wide heal was able to continue past a faulty node without issue.

However, since I managed to solve my issues by discovering, removing, and re-adding a faulty node, and that there appears to be consensus on an effective workaround, I am going to mark this issue as solved. The solution/workaround being:

Identify and locate any misbehaving nodes and remove them from the network (Thanks, @Bruce_Osborne)
and
Disable daily network heal (Thanks, @5iver)

Andrew_Rowe · October 25, 2019, 4:21pm

do you know this for sure? I’m just asking, did you ever have a know to be faulty node and the nightly heal regularly ran without issue?
If so it would be easier to pinpoint the regression but Chris has said previously that very little has changed in the binding in months because of the troubles with the IDE that Scott mentioned. The device database has been updated but that is about it
if I’m not mistaken

rrgeorge · October 25, 2019, 4:26pm

I do with 90% certainty, because I was having this issue on M3, but didn’t have the time to analyze it so I rolled back to M1, and operated for weeks without issue. Then after updating to M4, it immediately started happening again. @5iver appears to also had the same experiences, and there is an active bug open #1195, so Chris is aware of it.

rrgeorge · October 25, 2019, 4:31pm

I wish I had the debug logging set before so I could contribute some useful logs to this bug too. But since removing the faulty node seems to have fixed it, I have nothing useful to contribute about it.

Bruce_Osborne · October 25, 2019, 4:37pm

File an issue on GitHub so it can be dealt with by a developer. Chris is busy working on Zigbee but there are other devs there too.

rrgeorge · October 25, 2019, 4:39pm

There is already an open issue for this issue.

Andrew_Rowe · October 25, 2019, 4:48pm

No Robert, you are contributing, thanks, this is how the system works and bringing up the issue and folks looking into helps nail down the issues. Alex pretty much single handedly figured out the problem with the REST documentation that led to a fix. Mark is actively working with Chris on the zwave stuff.
It does sound like since we now know you had a bad node and rolling back to M1 cured the issue (even though the bad node had not yet been discovered) this may be a recent regression

Celaeno1 · October 25, 2019, 4:53pm

@mhilbush
@Bruce_Osborne

Ok. I got it healed manually. (one node that was not healed since the last nightly heal). And HABmin is telling me LAST HEAL TIME is updated to actual date/time.

But you wont believe what I did.

Set “Heal device” in HABmin and triple click (= inclusion/exclusion) the device many times.

Before that I’ve tested it with 1x click (= wake up) many times with no success.

But if it’s really healed I can tell you within the next few days…

Celaeno1 · October 26, 2019, 10:13am

@mhilbush
@Bruce_Osborne
@ariela
@Andrew_Rowe
@rrgeorge
@5iver

Here are my findings:

NETWORK HEAL

node1:

For a better resolution, see here.

node2:

node3:

node4:

node5:

node6:

node7:

.

As you can see all FLiRS (node3, node4, node5, node6, node7) were healed! OK!

Celaeno1 · October 26, 2019, 10:22am

@mhilbush
@Bruce_Osborne
@ariela
@Andrew_Rowe
@rrgeorge
@5iver

Now the interesting part: all “non listening nodes” (node8, node9, node10)

node8:

node9:

node10:

For a better resolution, see here.
.

As you can see, they were not healed automatically.

I had to wake them up manually.

node10: There you can see, it first woke up on its own, healing FAILED.

Later I manually woke it up, then it was healed.

Here is the complete DEBUG.log : DEBUG.log (654.2 KB)

So you can load and filter it on your own, here:

https://www.cd-jackson.com/index.php/openhab/zwave-log-viewer

PS: ALL images were readable (big enough) at my PC, but they were zipped during upload, SORRY!

Explanation:
healing started at 21:20, controller tried 5 times with 3 requests. They all FAILED. The 6th try was successful then. The node1, node2, node3, node4, node5, node6, node7 were healed until 21:23.

Node8, node9, node10 were not healed within the next 30 minutes. Then (21:50) node 10 woke up on its own (=wake up interval), but healing FAILED and it was getting OFFLINE.

After this event, I woke up all three left nodes (8, 9, 10) manually and they were healed.

node9: start: 21:57 ; end: 21:58
node8: start: 22:05 ; end: 22:05

node10: I woke it up 2x at 22:02 and 22:11 ((in the meantime a lot of motion/tamper were detected, because I was in front of the sensor or I moved it a little bit, to press the button!))
node10: start: 22:02; end: 22:11

chris · October 26, 2019, 10:45am

Apologies I’ve not had the chance to look at this as I’m completely overloaded with other work for the next couple of days and am then away from home until mid November so just don’t have time until then.

Celaeno1 · October 26, 2019, 2:00pm

node1: (better resolution)

Celaeno1 · October 26, 2019, 2:03pm

node10: (better resolution)

Bruce_Osborne · October 26, 2019, 2:30pm

If there is something you want Chris or the other devs to address, open an issue on GitHub.

Celaeno1 · October 26, 2019, 9:25pm

There is already an issue reported on Github. Just wanted to prove that I have the same problem and that manual healing (by waking up the node) works. Maybe some people can discover something in the logfile that may lead to the solution.