Z Wave Routing Basics: Retry Strategies

Tags: #<Tag:0x00007fc1fcd6a478> #<Tag:0x00007fc1fcd6a2e8>

Read Z Wave Routing Basics first

Z-wave has a set of strategies that are used to get a message through. These have changed over time and to try to cover all past versions in a single thread would be hard.

For this reason I will concentrate on the mechanisms used in z-wave plus. Not all of these are relevant to older generations.

This is what the silicon labs documents say about the risk of broken routes and how z-wave tries to solve the issue.

3.9.1 Routing
A Z-Wave Plus node MUST use by default the last working route to communicate with a target node.
Over time, there is a risk that nodes are moved or stop working. To ensure that nodes adapt to
changing network topology and failing repeaters, a Z-Wave Plus node MUST enable dynamic route
resolution.
Dynamic route resolution consists of trying the following routes:
 Last working routes
 Calculated routes
 Explorer Frame

so what are

 Last working routes
 Calculated routes
 Explorer Frame

In z-wave the first route that is tried is the Last Working Route (LWR) with one caveat explained later. It is exactly what it sounds like. It is the last route that was used to communicate with the node. Clearly if something has changed since the last time a command succeeded there has to be a mechanism to get the message through.

So if we consider the diagram below where one of the repeaters that was in the LWR is now no longer available how does z-wave cope?

This is best split into two parts.

  1. controller to nodes
  2. slave to other node or controller

Controller

The controller holds the complete routing table but for each node in the network it only holds routes to a couple of dynamic routes to that destination.

This is either

  • a last working route and a next to last working route or
  • a application priority route and another route that can be used.

These routes are managed dynamically. This is how it is described in the documentation:

The routing algorithm in controllers store information about successful attempts to reach a destination node avoiding repetition of previously failed attempts.

The last successful route used between sender and destination node are stored in NVM and is called Last Working Route(LWR).

The LWR list comprises of 232 destination nodes having up to two routes each, called the LWR and the Next to Last Working Route (NLWR).

A LWR/NLWR can contain either a direct or a routed route. Updating LWR and NLWR happens in the following situations (if new route != current LWR):

When receiving a successful explorer frame route, new route becomes LWR and old LWR becomes NLWR.
When receiving a successful routed/direct request from another node, new route becomes LWR and old LWR becomes NLWR
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR and old LWR becomes NLWR.
When receiving a successful acknowledge for a transmitted NLWR, NLWR becomes LWR.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR and old LWR becomes NLWR.

Incase a LWR/NLWR fails the LWR and NLWR are updated as follows:

If a LWR fails; it is ‘exiled’ to become the NLWR and the current NLWR (if present) is tried as the next route.
If a NLWR fails it is removed.

Now the possible override to this behaviour:

The Application can, instead of having both the LWR and the NLWR entries for a destination node, overload the LWR with an Application Priority Route (APR), which then upgrades the NLWR to LWR status as the protocol now is down to one dynamic LWR for the destination node in question.
The APR will after being set always be tried prior to any possible present LWR. The APR can only be removed by the Application. If destination node has an APR defined the updating of the LWR happens in the following situations (if new route != APR AND new route != LWR)
When receiving a successful explorer frame route, new route becomes LWR. >
When receiving a successful routed/direct request from another node, new route becomes LWR.
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR.
Incase LWR fails it is removed.

So if a priority route is set it is always used first regardless if it worked or failed last time.

That is a lot to take in and I have just copied that from the docs as it is hard to improve on what it says complex as it sounds.

But there is a little more

The routing attempts done by a static controller to reach the destination node are as follows:

If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails then remove the LWR and try direct if neighbor.
If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails, ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR. In case the NLWR fails remove it and try direct if neighbor.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then calculate up to two routing attempts per entry/repeater node. If enough routes exist try a settable maximum number of tries.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_NO_ROUTE are set, then direct with retries. >If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_EXPLORE are set then issue an explore frame as last resort.

So this explains that if the primary routing mechanism fails (LWR,NLWR,PWR) then a few other mechanisms are tried.

In openHAB all commands are sent to the controller across the serial interface using these flags TRANSMIT_OPTION_ACK, TRANSMIT_OPTION_AUTO_ROUTE and TRANSMIT_OPTION_EXPLORE .

So assuming your controller has at some point managed to talk to the node:

  1. The controller will try the primary routing mechanism as described earlier. Max two routes.
  2. The controller will try direct if the node is a neighbor as found in the routing table.
  3. A number of new routes will be calculated from the routing table and tried. This is normally set to a maximum of 3 from observation.
  4. The controller will issue an explorer

So if neither of the two primary routes work the controller will try direct if the node is listed as a neighbour in the routing table.

Next it uses the routing table to calculate some new possible routes to try. Lastly a special thing called an explorer.

So the new concept here is the explorer. An explorer is a broadcast mechanism that goes to all nodes to try to find a destination. If a node receives a response from the destination node it passes this back to the controller and a route relearning takes place and the explorer discovered route becomes the LWR or the route that is not the application preferred route. The diagram sort of implies it might be a less hop route but it could be a route with any number of hops including none. Remember the controller did not try direct so if the node was not in range but is now, explorer could even change to no hops.

Remember bar setting these flags the binding does nothing to define the routes and any information held in the binding is not used in the real routing behaviour. You could move your controller to a different application and all of the routing data moves. None of it is in the binding, passed to the binding or changed by the binding.

Very new nodes can report what the last route used was and some other information back to the application/binding but the current binding does not support this extension to the controler serial API payload.

So this is the mechanism used by more recent devices and all zwave plus devices will use this mechanism.

So the explorer is the clever thing here as it finds a working route regardless of what the routing table says. I will discuss the advantage of letting explorer correct your network over healing later.

Slave

Dealing with the slave nodes. These hold a very limited set of return routes that allow them to communicate with other nodes. While in principle a modern slave node could hold return routes to all 232 other possible nodes in a network in reality they will only hold routes to the Controller and other nodes that are associated.

The big difference from the controller is that there is no full routing table so the slave can not calculate new possible routes. It does however store upto 4 route to each node in the table. These are set initially at inclusion and then modified when associations change, when heal is used or when an explorer finds a new working route when the other mechanisms have failed.

These are actively managed each time a route suceeds or fails to that node in a similar way to the controller but the list is only upto 4 routes per destination node not the full routing table so no recalculated routes.

This is how the documents describe this dynamic management:

The routing algorithm in slaves store information about successful attempts to reach a destination node in response routes after the following principles:
When receiving a successful explorer frame route, new route becomes a new Response Route.  When receiving a successful routed/direct request from another node, new route becomes a new Response Route.
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes a new Response Route.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes a new Response Route.
Incase Response Route fails it is removed.
However, the response routes only contains up to two routes to different destination nodes. A response route for a new destination node overwrites the oldest of the two buffered response routes. A new response route for an existing destination overwrites the old response route for that specific destination.

and then

The routing attempts done by an enhanced 232 slave to reach the destination node are as follows:
If TRANSMIT_OPTION_ACK is set and destination is available in response routes, try response route.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then try return routes if any exists for specified destination.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then try direct.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_EXPLORE are set, issue an explore frame as last resort.

Now this is an interesting section as at the time of writing no z-wave plus devices use any option other than TRANSMIT_OPTION_ACK, TRANSMIT_OPTION_AUTO_ROUTE and TRANSMIT_OPTION_EXPLORE .

This is largely due to the limitations in pre 700 series chips. Even the sample code for 700 series device development have not been changed to do anything different so it may be some time before these options are set differently.

I have a development device that changes how these options are sets and is better behaved ina busy network so I can confirm using them differently has advantages.

So in summary, assuming your slave node has at some point managed to talk to the destination node:

  1. The slave node will try the primary routing mechanism as described earlier 2 response routes.
  2. The slave node will try the other routes stored in the table for that destination. So a max of 2 more.
  3. The slave node will try direct (as it has no full routing table so it has no concept of neighbour)
  4. The slave node will issue an explorer.

So a slight difference there that the node only choses from set of 4 precalculated routes. Initially 2 routes that are held as response routes then the other 2 held as return routes.
It has no concept of neighbours so it just tries direct then an explorer is used. If the explorer suceeds one of the return routes is thrown away and the route discovered by the explorer becomes a response route.

Summary

One last thing about routing in z-wave. There is no statistical improvement of the routes so the mechanisms described above are all that is used. The algorithm in the controller to calculate routes is not published so it is hard to know what is considered but certainly no past success beyond the simple mechanism here.

The initial return routes are calculated by the controller and sent to the slave devices so the underlying mechanism is the same but there is no option for a slave to calculate more routes on the fly.

If you read the above a few times you can see that it is perfectly possible for zwave to recycle poor routes as nothing ever fully removes them from the pool of possible.

Explorer discovered routes are the only routes that may be improvements to what was already in the table so these are the only routes that can bring fresh routes into the process.

If you start looking at the route in your network in zniffer expect to be surprised and bemused. When I first started looking it amazed me how the signal bounced around the house taking sometimes seemingly strange routes. While you can set preferred routes with some success remember if you do you are reducing the self healing capabilities so try with care.

I have become more circumspect with my attitude to these strange routes following a conversation with Peter @petergebruers over on the Fibaro forum. He pointed out to me that all sorts of things in your home block and amplify the signal in strange ways.

The example he gave me that drove it home to me was regarding how a module can get a signal in a metal back box behind a switch in a solid wall. The answer he gave me was that the mains cables can act as passive repeaters and probably brings the signal inside the box. He is a smart fellow and I am sure he is correct. He says to me RF is strange and I now just accept it and do not try to fight the tide.

There is a lot here to digest. Feel free to post questions. I will try to add some zniffer logs showing some of these strategies and some diagrams that may help as the words are dense but I will need to use my dev network as I do not want to break my production network. My wife does not like that.

Next Z Wave Routing Basics: Application/Binding Retries, Heal and Explorer

8 Likes