Z Wave Routing Basics: Retry Strategies

robmac · February 9, 2020, 7:55am

Z-wave has a set of strategies that are used to get a message through. These have changed over time and to try to cover all past versions in a single thread would be hard.

For this reason I will concentrate on the mechanisms used in z-wave plus. Not all of these are relevant to older generations.

This is what the silicon labs documents say about the risk of broken routes and how z-wave tries to solve the issue.

3.9.1 Routing
A Z-Wave Plus node MUST use by default the last working route to communicate with a target node.
Over time, there is a risk that nodes are moved or stop working. To ensure that nodes adapt to
changing network topology and failing repeaters, a Z-Wave Plus node MUST enable dynamic route
resolution.
Dynamic route resolution consists of trying the following routes:
 Last working routes
 Calculated routes
 Explorer Frame

so what are

 Last working routes
 Calculated routes
 Explorer Frame

In z-wave the first route that is tried is the Last Working Route (LWR) with one caveat explained later. It is exactly what it sounds like. It is the last route that was used to communicate with the node. Clearly if something has changed since the last time a command succeeded there has to be a mechanism to get the message through.

So if we consider the diagram below where one of the repeaters that was in the LWR is now no longer available how does z-wave cope?

This is best split into two parts.

controller to nodes
slave to other node or controller

Controller

The controller holds the complete routing table but for each node in the network it only holds routes to a couple of dynamic routes to that destination.

This is either

a last working route and a next to last working route or
a application priority route and another route that can be used.

These routes are managed dynamically. This is how it is described in the documentation:

The routing algorithm in controllers store information about successful attempts to reach a destination node avoiding repetition of previously failed attempts.

The last successful route used between sender and destination node are stored in NVM and is called Last Working Route(LWR).

The LWR list comprises of 232 destination nodes having up to two routes each, called the LWR and the Next to Last Working Route (NLWR).

A LWR/NLWR can contain either a direct or a routed route. Updating LWR and NLWR happens in the following situations (if new route != current LWR):

When receiving a successful explorer frame route, new route becomes LWR and old LWR becomes NLWR.
When receiving a successful routed/direct request from another node, new route becomes LWR and old LWR becomes NLWR
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR and old LWR becomes NLWR.
When receiving a successful acknowledge for a transmitted NLWR, NLWR becomes LWR.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR and old LWR becomes NLWR.

Incase a LWR/NLWR fails the LWR and NLWR are updated as follows:

If a LWR fails; it is ‘exiled’ to become the NLWR and the current NLWR (if present) is tried as the next route.
If a NLWR fails it is removed.

Now the possible override to this behaviour:

The Application can, instead of having both the LWR and the NLWR entries for a destination node, overload the LWR with an Application Priority Route (APR), which then upgrades the NLWR to LWR status as the protocol now is down to one dynamic LWR for the destination node in question.
The APR will after being set always be tried prior to any possible present LWR. The APR can only be removed by the Application. If destination node has an APR defined the updating of the LWR happens in the following situations (if new route != APR AND new route != LWR)
When receiving a successful explorer frame route, new route becomes LWR. >
When receiving a successful routed/direct request from another node, new route becomes LWR.
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes LWR.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes LWR.
Incase LWR fails it is removed.

So if a priority route is set it is always used first regardless if it worked or failed last time.

That is a lot to take in and I have just copied that from the docs as it is hard to improve on what it says complex as it sounds.

But there is a little more

The routing attempts done by a static controller to reach the destination node are as follows:

If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails then remove the LWR and try direct if neighbor.
If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails, ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR. In case the NLWR fails remove it and try direct if neighbor.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then calculate up to two routing attempts per entry/repeater node. If enough routes exist try a settable maximum number of tries.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_NO_ROUTE are set, then direct with retries. >If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_EXPLORE are set then issue an explore frame as last resort.

So this explains that if the primary routing mechanism fails (LWR,NLWR,PWR) then a few other mechanisms are tried.

In openHAB all commands are sent to the controller across the serial interface using these flags TRANSMIT_OPTION_ACK, TRANSMIT_OPTION_AUTO_ROUTE and TRANSMIT_OPTION_EXPLORE .

So assuming your controller has at some point managed to talk to the node:

The controller will try the primary routing mechanism as described earlier. Max two routes.

image2082×1463 332 KB
The controller will try direct if the node is a neighbor as found in the routing table.

image2082×1463 332 KB
A number of new routes will be calculated from the routing table and tried. This is normally set to a maximum of 3 from observation.

image2082×1463 338 KB
The controller will issue an explorer

image2082×1463 341 KB

So if neither of the two primary routes work the controller will try direct if the node is listed as a neighbour in the routing table.

Next it uses the routing table to calculate some new possible routes to try. Lastly a special thing called an explorer.

So the new concept here is the explorer. An explorer is a broadcast mechanism that goes to all nodes to try to find a destination. If a node receives a response from the destination node it passes this back to the controller and a route relearning takes place and the explorer discovered route becomes the LWR or the route that is not the application preferred route. The diagram sort of implies it might be a less hop route but it could be a route with any number of hops including none. Remember the controller did not try direct so if the node was not in range but is now, explorer could even change to no hops.

Remember bar setting these flags the binding does nothing to define the routes and any information held in the binding is not used in the real routing behaviour. You could move your controller to a different application and all of the routing data moves. None of it is in the binding, passed to the binding or changed by the binding.

Very new nodes can report what the last route used was and some other information back to the application/binding but the current binding does not support this extension to the controler serial API payload.

So this is the mechanism used by more recent devices and all zwave plus devices will use this mechanism.

So the explorer is the clever thing here as it finds a working route regardless of what the routing table says. I will discuss the advantage of letting explorer correct your network over healing later.

Slave

Dealing with the slave nodes. These hold a very limited set of return routes that allow them to communicate with other nodes. While in principle a modern slave node could hold return routes to all 232 other possible nodes in a network in reality they will only hold routes to the Controller and other nodes that are associated.

The big difference from the controller is that there is no full routing table so the slave can not calculate new possible routes. It does however store upto 4 route to each node in the table. These are set initially at inclusion and then modified when associations change, when heal is used or when an explorer finds a new working route when the other mechanisms have failed.

These are actively managed each time a route suceeds or fails to that node in a similar way to the controller but the list is only upto 4 routes per destination node not the full routing table so no recalculated routes.

This is how the documents describe this dynamic management:

The routing algorithm in slaves store information about successful attempts to reach a destination node in response routes after the following principles:
When receiving a successful explorer frame route, new route becomes a new Response Route.  When receiving a successful routed/direct request from another node, new route becomes a new Response Route.
When receiving a successful acknowledge for a transmitted explorer frame, new route becomes a new Response Route.
When receiving a successful acknowledge for a transmitted routed/direct frame, new route becomes a new Response Route.
Incase Response Route fails it is removed.
However, the response routes only contains up to two routes to different destination nodes. A response route for a new destination node overwrites the oldest of the two buffered response routes. A new response route for an existing destination overwrites the old response route for that specific destination.

and then

The routing attempts done by an enhanced 232 slave to reach the destination node are as follows:
If TRANSMIT_OPTION_ACK is set and destination is available in response routes, try response route.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then try return routes if any exists for specified destination.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then try direct.
If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_EXPLORE are set, issue an explore frame as last resort.

Now this is an interesting section as at the time of writing no z-wave plus devices use any option other than TRANSMIT_OPTION_ACK, TRANSMIT_OPTION_AUTO_ROUTE and TRANSMIT_OPTION_EXPLORE .

This is largely due to the limitations in pre 700 series chips. Even the sample code for 700 series device development have not been changed to do anything different so it may be some time before these options are set differently.

I have a development device that changes how these options are sets and is better behaved ina busy network so I can confirm using them differently has advantages.

So in summary, assuming your slave node has at some point managed to talk to the destination node:

The slave node will try the primary routing mechanism as described earlier 2 response routes.

image2082×1463 326 KB
The slave node will try the other routes stored in the table for that destination. So a max of 2 more.

image2082×1463 334 KB
The slave node will try direct (as it has no full routing table so it has no concept of neighbour)

image2082×1463 325 KB
The slave node will issue an explorer.

image2082×1463 322 KB

So a slight difference there that the node only choses from set of 4 precalculated routes. Initially 2 routes that are held as response routes then the other 2 held as return routes.
It has no concept of neighbours so it just tries direct then an explorer is used. If the explorer suceeds one of the return routes is thrown away and the route discovered by the explorer becomes a response route.

Summary

One last thing about routing in z-wave. There is no statistical improvement of the routes so the mechanisms described above are all that is used. The algorithm in the controller to calculate routes is not published so it is hard to know what is considered but certainly no past success beyond the simple mechanism here.

The initial return routes are calculated by the controller and sent to the slave devices so the underlying mechanism is the same but there is no option for a slave to calculate more routes on the fly.

If you read the above a few times you can see that it is perfectly possible for zwave to recycle poor routes as nothing ever fully removes them from the pool of possible.

Explorer discovered routes are the only routes that may be improvements to what was already in the table so these are the only routes that can bring fresh routes into the process.

If you start looking at the route in your network in zniffer expect to be surprised and bemused. When I first started looking it amazed me how the signal bounced around the house taking sometimes seemingly strange routes. While you can set preferred routes with some success remember if you do you are reducing the self healing capabilities so try with care.

I have become more circumspect with my attitude to these strange routes following a conversation with Peter @petergebruers over on the Fibaro forum. He pointed out to me that all sorts of things in your home block and amplify the signal in strange ways.

The example he gave me that drove it home to me was regarding how a module can get a signal in a metal back box behind a switch in a solid wall. The answer he gave me was that the mains cables can act as passive repeaters and probably brings the signal inside the box. He is a smart fellow and I am sure he is correct. He says to me RF is strange and I now just accept it and do not try to fight the tide.

There is a lot here to digest. Feel free to post questions. I will try to add some zniffer logs showing some of these strategies and some diagrams that may help as the words are dense but I will need to use my dev network as I do not want to break my production network. My wife does not like that.

Next Z Wave Routing Basics: Application/Binding Retries, Heal and Explorer

robmac · February 22, 2020, 2:48pm

When healing a slave device if there are 4 good routes the controller can calculate these are set.

If you watch this with a zniffer you will see the 4 routes set.

Even if there are more than 4 routes the slave can only hold 4. If there are less then 4 then there is less resilience.

Explorer frames can discover more routes if the 4 initially set by the controller are not reliable. It is possible for these explorer route to be more reliable than routes found by the controller’s algorithm.

apella12 · February 26, 2020, 11:29pm

I have been wondering about this and it looks like you are very knowledgeable. Are Z-wave frame cancellations at this level anything to be concerned about? As far as I can tell everything is working.
Bob
Zwave stats

chris · February 26, 2020, 11:43pm

The CAN messages are not related to the network - it’s related to the communications between the host and the controller. At this level I would not be concerned - it’s a rate of 0.4%.

apella12 · February 27, 2020, 12:37am

Thanks. Good to know.

Bob

robmac · February 27, 2020, 12:28pm

Yes that is high. I occasionally get 1 at a system restart when there is a lot of traffic but I have a large network. With a small network I would not expect to get any.

As Chris says these stats are for the serial interface between the binding and the controller. Chris hinted he may be able to improve this a little but it is vulnerable area to high volumes of commands and reports.

If you read on further down my threads there are some tests I run in PC controller that intentionally overload the serial interface so this is not a binding specific issue.

The most common cause is too much reporting either due to configuration or a fault in a device. I know it is a further investment but a zniffer would allow you to find the cause more easily.

To give you some idea of how low a % this can be. My network has been up for just over a week since I last backed up my USB controller as I had added some nodes.

The % CAN is 0.00035 % and no new CAN since the startup when all of the 150+ devices were pinged so this is falling each day I do not restart.

You should very rarely see a CAN if ever on a small to medium network.

Bruce_Osborne · February 27, 2020, 12:33pm

I just got a UZB-3 yesterday so I can investigate my network with a zniffer.

robmac · February 27, 2020, 12:36pm

Enjoy

chris · February 27, 2020, 12:37pm

Sorry, but this is just caused by a race condition with communication between the device and the binding. You’re right that it is more likely when there is more data on the network, since the chances that the binding and the device will send at the same time is higher, but it’s not really an issue since the binding will resend the data shortly afterwards.

Nothing you see on a sniffer will be related to this issue.

I would respectfully disagree. This is not related to the network - it’s related to communication between the device and the controller. It’s a simple race condition, and it will occur every now and then.

robmac · February 27, 2020, 12:40pm

Agreed other than the probability of a race condition increases in busy networks.

Particularly with high volumes of reports and busy repeaters that allow timeouts and late arrival of responses.

chris · February 27, 2020, 12:43pm

That’s what I said. It’s a race condition, so the more traffic, the higher the likelihood. However it doesn’t really matter - the CAN simply tells the other end that the race was hit, and it will resend.

No - it’s not related to this at all. It is only related to communications between the binding and the controller. It occurs when the binding and the controller both try to send a message to each other at the same time. t’s not directly correlated to anything in the mesh, and certainly not delays like this since it is only correlated with binding and controller sending data at the same time, and this will always be random.

robmac · February 27, 2020, 12:49pm

Chris I think we are in violent agreement without managing to put it in words.

Definitely a race condition.
Definitely inherent in the serial interface.

Absolutely

And the last bit we should agree on is that the volume of traffic can only be seen with a zniffer.

I did say the most common cause. So other issues could also be an issue but the in the users control and worth checking first is have they got so much traffic that they are seeing too many CAN.

chris · February 27, 2020, 12:53pm

While that is true, it is not relevant here since this issue is only the communication layer between the binding and the controller. It’s a clash of the two messages.

It doesn’t matter how much traffic is on the network - only how much is sent to the binding. You will see that in the logs. A sniffer will not provide any useful information for this issue - you will not be able to correlate what the binding sends and what the controller sends over the USB by looking at the sniffer.

robmac · February 27, 2020, 1:06pm

I will disagree on that.

There is a relationship, admittedly complex as the binding does not see network retries and the network does not see serial retries but all underlying traffic to and from the controller bound for a network device should correlate. Also a very low level of retries in the serial interface causes significant issues on the network.

Any level of retry in the network or serial is undesirable. It happens and is inevitable at a level. Above a level it impacts the utility of the system.

I typically I get a max of 1 CAN on system start and 1 every two to four weeks on a 150 node network.

If I turn up my reporting and send commands in big fast streams this increases and my network is slow and has pauses when I would rather not.

I am sure you have similarly low levels of CAN and also rarely have pauses in your network.

We always end up at loggerheads and I never understand why so I will withdraw gracefully

chris · February 27, 2020, 1:09pm

Ok - fine. I will again step down, but I would strongly suggest to look at the documentation, and / or the binding code to understand what this issue is. It is simply a USB communication issue, and while I agree that more traffic on the network will result in a higher likelihood of a clash (as already stated) there is no direct correlation since it depends also on what the binding is doing (it takes both sides to send at exactly the same time - not just the network).

That is not related to the CAN. IFRC the messages are resent immediately if there is a CAN since all that needs to happen is the binding has to avoid the race - which it will by resending.

apella12 · February 27, 2020, 4:29pm

So I’m sorry to have caused all this. I was just trying to understand Z-wave better. Hopefully without starting this up again, (in layman’s terms) what I get from the discussion is that there is a “collision” between the USB stick and OpenHab. For instance a OpenHab Rule may trigger a frame being sent to the USB (for a specific node) at the same time that the USB stick is trying to send a frame (say of a change in status) of that same node back to Openhab. However the frame will be resent, so nothing really is lost, although there will be extra traffic from these events that may slow down response times if there are a lot of them.

Close enough?

Bob

Bruce_Osborne · February 27, 2020, 4:35pm

Not your fault at all. Just 2 strong-willed people.

chris · February 27, 2020, 4:35pm

Partly right - it’s nothing to do with the node - it is simply a race condition when the binding and the controller start to send a command to each other at “exactly” the same time. To resynchronise, a CAN is sent, and then the command will be resent shortly after.

It’s not related to any specific node - the command may not even be for a node - it could be for the controller. It is only related to the communication between the binding and the controller. If there is more communication on the mesh, then there will be more communication between the controller and the binding so the possibility of this is higher, but the collisions are otherwise not correlated.

No - there is no extra traffic on the mesh. There will be a slight delay with communications between the stick and the controller, but it will likely not be noticeable.

The thing to remember is that this is only related to traffic in the serial API - that’s the layer between the binding and the controller.

apella12 · February 27, 2020, 8:05pm

Ok. 1) So the CAN is because any 2 frames have collided, the two frames don’t have to be about the same node and critical to my original concern, they get re-sent, so everything should work. 2) Also, no noticeable delay as it is only serial API traffic. Got it.

Thanks again to everyone
Bob