[SOLVED] Z-Wave Network Heal.. when is it needed? does it always need to happen?

leif · July 20, 2019, 11:52am

OH S1628 on RPi with UZB

So, my Z-Wave network works quite well!
The only time it’s sluggish is with battery driven devices after a network heal.
And that network heal happens every day at 2 AM.

I understand the need for network heal as your network grows.
But, once your configuration is stable… if you haven’t added any new devices, does network heal really need to happen?
If it does sometimes need to happen even on an unchanging network, certainly it doesn’t have to happen every day?

It would be great to be able to configure it to automatically only happen if devices have been included or excluded, or better yet longer interval options with separate dropdowns for unchanged and changed networks… so you could get it to happen at 2 AM the night after you’ve added a device, and then once a week or even month if nothing has changed? What do you think?

chris · July 20, 2019, 12:17pm

No - so long as the network really is stable then you don’t need the heal.

The heal will restore routes, so if there are any dodgy nodes (eg those that don’t have really good links into the mesh) then things may not work so well - hence what I mean by “really stable” - not just that you haven’t added any new nodes.

Yes, this could be added and is probably a good feature. Feel free to open an issue, but note that I’m currently unable to work on ZWave or ZigBee issues until the IDE is resolved (which has not happened in the past month!).

mhilbush · July 20, 2019, 12:17pm

See this issue here.

github.com/openhab/org.openhab.binding.zwave

Heal never completes resulting in device not polled until openHAB is restarted

opened 02:51PM - 05 Jul 19 UTC

mhilbush

Logging this as an issue so we can keep track of it. For some battery devices…, the heal process never completes. Symptoms are described in the forum post below. Because the heal process never completes, the initialization thread stays around until the system is restarted. The net result is that the device will not be polled until openHAB is restarted. One major downside of this is that the device battery level will not be updated. Workaround is to disable the nightly heal. Forum discussion is here. https://community.openhab.org/t/node-gets-into-weird-state-after-heal/74339/9

For now, I would suggest you disable the nightly heal, then restart openHAB to clear out any stuck initialization/heal threads. If needed, you can run the heal on specific nodes that might be having issues.

Bruce_Osborne · July 20, 2019, 1:42pm

Will that trigger neighboring nodes to recalculate the routes too?

chris · July 20, 2019, 2:47pm

No - just the single node.

mstormi · July 20, 2019, 3:45pm

The problem may be that if you move devices or the environment changes (to affect radio range), routes can become invalid without that you in- or exclude any device. The binding can’t know about that.

robmac · July 20, 2019, 9:34pm

Should self heal with a bit of traffic unless it is a dramatic change. Given a bit of time, unless there is a very poor node and poor configuration, the routes will settle. Some older devices may not particularly if they have dodgy firmware.

I have some fibaro v1.9 switch devices in a draw that anyone is welcome to if they want to experience what I am talking about.

chris · July 20, 2019, 10:57pm

Why is that? Has this changed in the latest SDK? Previously the controller is in charge of setting the routes, but are you saying this has changed?

robmac · July 21, 2019, 7:06am

Because the LWR will correct based the PWR or one of the neighbours unless all neighbours tried are now not working. Yes if all the neighbours tried are now not routable they need clearing and if there are a lot of neighbours it only tries a few of them so things can fail to self heal. If the new LWR is reliable again then it does indeed self heal.

I have never been clear from the docs if when an explorer frame finds a route this also becomes LWR. I think it does sometimes from experimentation. When things settle hopefully it settles to LWR, PWR.

chris · July 21, 2019, 7:36am

So your point is that routing slaves are now able to create their own routes rather than these being defined by the controller software as has been the case? Do you have a reference for this?

From your post, you are saying that the controller no longer needs to define the routes between different devices? It would definitely be good to get a reference for this in the docs.

Thanks.

robmac · July 21, 2019, 8:18am

Hi,

So a controller updates on inbound also is my understanding. Also all devices have to have a copy of the routes as as you say it is source routing and devices talk to other devices when there is an association.

Here is the explanation from the 500 series developer guide for all to try to unravel. Having refreshed my reading of it explorers do update so that is good.

3.4 Z-Wave Routing Principles
The Z-Wave protocol use source routing, which is a technique whereby the sender of a frame specifies the exact route the frame must take to reach the destination node. Source routing assumes that the sender knows the topology of the network, and can therefore determine a route having a minimum number of hops. The Z-Wave protocol supports up to four repeaters between sender and destination node. Routing can also be used to reach FLiRS destination nodes. Source routing allows implementation of a leightweight protocol by avoiding distributed topologies in all repeaters. Nodes containing the topology can also assign routes to a topology-less node enabling it to communicate with a number of destination nodes using routes.
In case sender fails to reach destination node using routes an explorer mechanism can be launched on demand to discover a working route to the destination node in question. The explorer mechanism builds
INS13954-7 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x 2018-06-14
silabs.com | Building a more connected world. Page 6 of 444

on AODV routing with adjustments for source routing and memory footprint. Explorer frames implement managed multi-hop broadcast forwarding and returns a working route to sender as result. The application payload piggybacks on explorer frame to reduce latency.
The routing algorithm in controllers store information about successful attempts to reach a destination node avoiding repetition of previously failed attempts. The last successful route used between sender and destination node are stored in NVM and is called Last Working Route(LWR). The LWR list comprises of 232 destination nodes having up to two routes each, called the LWR and the Next to Last Working Route (NLWR). A LWR/NLWR can contain either a direct or a routed route. Updating LWR and NLWR happens in the following situations (if new route != current LWR):
 When receiving a successful explorer frame route, new route becomes LWR and old LWR becomes NLWR.  When receiving a successful routed/direct request from another node, new route becomes LWR and old LWR becomes NLWR  When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR and old LWR becomes NLWR.  When receiving a successful acknowledge for a transmitted NLWR, NLWR becomes LWR.  When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR and old LWR becomes NLWR.
Incase a LWR/NLWR fails the LWR and NLWR are updated as follows:

 If a LWR fails; it is ‘exiled’ to become the NLWR and the current NLWR (if present) is tried as the next route.  If a NLWR fails it is removed.
The Application can, instead of having both the LWR and the NLWR entries for a destination node, overload the LWR with an Application Priority Route (APR), which then upgrades the NLWR to LWR status as the protocol now is down to one dynamic LWR for the destination node in question. The APR will after being set always be tried prior to any possible present LWR. The APR can only be removed by the Application. If destination node has an APR defined the updating of the LWR happens in the following situations (if new route != APR AND new route != LWR)
 When receiving a successful explorer frame route, new route becomes LWR.  When receiving a successful routed/direct request from another node, new route becomes LWR.  When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR.  When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR.  Incase LWR fails it is removed.
The routing algorithm in slaves store information about successful attempts to reach a destination node in response routes after the following principles:
 When receiving a successful explorer frame route, new route becomes a new Response Route.  When receiving a successful routed/direct request from another node, new route becomes a new Response Route.  When receiving a successful acknowledge for a transmitted explorer frame, new route becomes a new Response Route.  When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes a new Response Route.  Incase Response Route fails it is removed.
INS13954-7 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x 2018-06-14
silabs.com | Building a more connected world. Page 7 of 444

However, the response routes only contains up to two routes to different destination nodes. A response route for a new destination node overwrites the oldest of the two buffered response routes. A new response route for an existing destination overwrites the old response route for that specific destination.
The routing attempts depend on the Z-Wave library and transmit options used in the node, for details refer to section 3.10.
The source routing algorithm does not alter the topology due to failed attempts or store any statistics regarding link quality.
3

chris · July 21, 2019, 8:30am

I’m not sure if slaves are generally using Explorer frames. I know it was not popular a couple of years ago and there was a discussion on ZTS about it as it tends to flood the network. Therefore if devices are having trouble with their routing, and are sending a lot of explorer frames to try to discover the routes, it can cause significant congestion. Therefore, the system uses routes that are configured by the controller and these are set during the heal.

Maybe this isn’t required now if the latest SDK is using explorer frames in routing slaves, but older devices probably still require the static routes to be set during the heal, and certainly anything more than a few years ago (when Explorer frames were introduced - although I forget exactly what SDK these were introduced) will definitely need the controller defined routes.

This is one of the many messy areas in ZWave unfortunately and I’m not sure there’s a one size fits all answer due to the ever evolving “standard”.

robmac · July 21, 2019, 8:36am

Agreed which is why I like how Z/IP makes use of the retry option. First application attempt LWR /NLWR

Then into long queue and try neighbours and if that still fails try explorer. First attempt never has any retries configured.

And even reading the guide leaves your mind in twists.

chris · July 21, 2019, 8:43am

Yes, but again, this is the controller side - not the slave. The big question is what SLAVES do and this is one of the main things that is addressed in the heal. If slaves are unable to find their own routes (which is not uncommon) then they must have routes set for them by the controller using the ZW_AssignReturnRoute command. This is why you need to be careful with statements like the system should self heal.

Yes, ZWave documentation has never been great unfortunately and I fear that the move toward Z/IP will not help these sort of issues. Forcing users to go down the Z/IP path was Sigmas way to recapture the market, but I fear it’s backfiring and I’m seeing a number of commercial customers move to ZigBee to avoid this.

robmac · July 21, 2019, 9:43am

Last para refers to slaves

“However, the response routes only contains up to two routes to different destination nodes”

it is cut down but as they source route to destinations I think this is what they have to make that possible.

The last para is the crux though as it does make it clear that if you make big changes you are going to have to heal.

chris · July 21, 2019, 10:10am

FYI -:

This comes from one of the Silabs FAEs, so again, it seems to indicate that routing slaves do not derive their own routes, and they must be set by the controller.

robmac · July 21, 2019, 10:48am

Agreed but once set they are managed.

Also unless the controller is configured explicitly for more or less it will only try max 4 neighbours on a retry if lnr and nlnr fail.

It is not clear how it chooses which neighbours if there are more but if all of those are no longer valid then they only cause delays and a heal would be the only way to sort. Even in this case, if an explorer returns then lnr and nlnr should sort so neighbour
routing should never happen but your network is less resilient until you heal and fix the routing tables in the controller and any return routes in the slaves.

chris · July 21, 2019, 10:56am

What do you mean by “managed”. I do not think the controller is “managing” them - the controller, until recently is pretty dumb (and even now with the M4 it’s not a lot better!). The controller doesn’t do anything by itself as far as I know (I looked through the source a couple of years back and I’d be surprised if this has changed).

robmac · July 21, 2019, 11:02am

LNR and NLWR and response routes.

The routing table is let’s face it a set of hints for resilience. If you do not stress your network and it is all stable and healthy most times you would hope it uses LNR and never tries neighbours and other methods of routing.

The first thing tried is always LNR and if that fails NLNR. If these work then the routing topology can be wrong in the table. As the LNR can be updated by explorer frames there is no reason a node can not communicate reliably by a route not in the topology.

So even on a slave it should fix to some extent but the routing tables on the controller will still be bad. It depends if it settles down if you care. What you will lose is resilience if the table is bad not communication by the first thing a node tries the LNR or response route can work even if the topology in the controller is pants.

robmac · July 21, 2019, 12:27pm

Back Inside as a bit of rain.

And this is all talking about this version of the SDK and your controller and nodes supporting explorer.

Explorer can be supported on none 500 chips as it depends on the SDK. Fibaro HC2 still uses an old controller chip and is not Z-Wave+ but supports explorer as it has a firmware with a more recent SDK than you would think looking at the chip.

If you are using a lot of devices and a controller that have firmwares prior to a certain SDK you lose some of this.

Anyway that is what the docs show as theory and in a network not under stress I believe it is true. Stress your network and things can get nasty particularly if the routing table is bad. If I am having issues the first things I go looking for are bad nodes or too much traffic not routing table issues.
If I have not shifted any nodes or removed a load of copper pipe from my house the routing probably stays the same and that is always my assumption.

In my experience routing issues are a symptom not only of bad routing tables. Healing without fixing other causes first will cause more trouble than it will solve as the topology of routing table is not the first thing ever tried.