Node gets into weird state after heal

Hi @chris This issue fell off my radar screen for a while because I got busy with some other things. I updated my ticket with another example of the behavior. It would be great if you can take a look.

https://www.cd-jackson.com/index.php/tickets/viewticket/ticketid-858/

To summarize, I have several battery-powered nodes that get into a weird state after the nightly heal. Node 109 is the specific example in the log file linked to below. The DeleteSUCReturnRoute fails and never seems to recover. The device also is never polled after that.

The log file is posted here. Node 109 is the one to look at. It seems to behave normally until the heal, then gets into a weird state. The log file starts at binding startup.

Edit: I should add that I’m not using the latest binding. I don’t think anything has changed in this area of the binding, but I can upgrade to the latest if you want. I’ve been hesitant to do this because it’s my production system…

@chris Hate to bug you but did you have a chance to look at this?

Sorry Mark - I’ve been super busy over the past couple of weeks. Next week I’m out in the US and I hope I’ll have a bit more time after that.

Chris

I figured as much. Didn’t mean to be a bother.

What part of the US. It’s a big place. LOL. DC or NYC area by any chance?

Sorry - missed this (just seen it as I’m gonna take a look at your logs if my battery doesn’t go flat).

I’m in Charlotte (suffering from a large plate of BBQ!).

The last few weeks have been super busy preparing for this trip, but will hopefully have a few minutes to spare now :slight_smile:

We did fly straight over DC and NYC on the way down on Sunday, but I didn’t spot you :wink:

What is a “weird state”? It looks like the device is reporting correctly (?) but there are some issues with the binding I think.

I see lots of reporting - probably normally - in the morning (0635 and after), but the binding is still trying to heal the device, so there is no polling.

I don’t see anything that is otherwise “weird” though?

I do see something strange with the transactions in the heal (see the PR below), and maybe the heal should time out quicker on battery devices than it does, but I’d like to understand if there is anything other than this that you see that’s wrong?

Yes, node 109 is reporting correctly.

However, starting at 03:24:07.870, when the first failure occurs, the device is never polled again. Whenever the device wakes up, the binding immediately sends a WAKE_UP_NO_MORE_INFORMATION, and the node goes back to sleep. Then after the binding sends the WAKE_UP_NO_MORE_INFORMATION, the binding sends the AssignSucReturnRoute, which fails because the device is asleep. It does this over and over, and never recovers.

Here’s an example. It seems to do this repeatedly whenever the device wakes up.

ZWave%20Log%20Viewer%202019-05-23%2020-36-47

Polling is disabled during initialisation and heal to try and reduce traffic to allow the heal to word without interference. This is why I said that maybe we could timeout or cancel the heal quicker in such cases.

Or you can try an convince me it’s a good idea to poll during a heal - I can see some benefit, but it is likely to cause more delays to the heal.

This makes sense. I won’t try to convince you otherwise. :wink:

The issue is that the heal never finishes. It just does this every time the device wakes up.

Yep - understood and I will take a look at how to time this out better so it doesn’t continue indefinitely.

1 Like