Z Wave Routing Basics: Application/Binding Retries, Heal and Explorer

robmac · February 10, 2020, 6:50pm

Read Z Wave Routing Basics: Retry Strategies First

So the first article explained how the routing works.

The second the measures the controller stick and your devices use to get messages through if your controller and devices are relatively recent models.

The second is relevant to all zwave plus devices and controllers. If they are certified z-wave plus then they must support what is described,

If they are older or a mix you will need to use your zniffer and work out if your device supports some or all of the mechanisms. Explorer is not tied to a particular chipset so it is impossible to tell you what devices will and will not support. I am assuming you now have a zniffer as if you have not there is little point continuing to read as you will not be able to make use of the information.

So we have got to the point where we get to more difficult subjects. Before we continue the discussion I suggest I start by making a statement regarding the target state for your z-wave network if you want to be able to leave it alone and let it run. Afterall you have HA to make your life easier not to waste even more time.

Now this thread is about routing but sometimes what looks like routing issues are caused by more mundane things. You are probably reading this because you think you have routing issues but you may not so it is worth a few paragraphs on other things that people mistake for routing issues.

So what does a good resilient z-wave network look like.

it has few regular retries and very few explorer frames
when there are transient issues there are enough good routes between your nodes so that one of the “less expensive” mechanisms gets the message through. Explorers are expensive.
it does not become clogged with traffic so that it gets to the point where even an explorer does not work or you experience long pauses
it never suffers from dropouts or nodes going offline

So let’s understand the parameters.

Z-wave as the documents warn "Z-Wave is a radio technology with limited bandwidth. " so we have to live inside that limit.
RF is weird and while long distances are possible in open spaces, inside or between structures this is unlikely.
Your network is the sum of all of the devices. One bad ingredient and it can ruin the complete recipe.
Resilience in a grid relies on enough alternate routing options.
RF is even weirder. Change your home environment and the pattern of the RF signal will change. This can be a material change like running new power cables or copper pipes or just having a lot of people visit.

If you take all that into account z-wave is cool and works. Ignore any of that and the experience moves from patchy to unlivable remarkably easily.

So first new subject is application/binding retries.

Application/binding retries

The older documentation for z-wave specified that an application would resend a command across the serial bus if the original command was not completed within a time frame. The guidance has now changed slightly but you are an openHAB binding user and the binding WILL retry for you.

That is fine but this is now getting very expensive and you REALY REALY do not want this to be happening regularly. If you are getting application retries regularly your network is on its knees will not be nice to live with. Yes occasionally it saves the day and you will get the message through so it is great it is there as a backstop but not if it is every second transmission.

So for a transient situation like having all of the family around and your obese aunty Flora sitting by your controller and blocking the RF but not for every day. BTW this never realistically happens. However dense Flora is z-wave uses a frequency that travels relatively well through walls, doors and even very large people and pets.

How will you know these application retries are happening?

Well from the logs you will get an indication but the easiest is the zniffer you now own.

You will see all the controller retries as described in the previous thread then;
all of the controller retries again; and then because the binding can not take the hint that the network is not going to fix all of them again.

The binding can not change the routes so this is probably a definition of “something” to go round this loop so many times asking the same question. You may be lucky and the issue was a 2s dip in power or a transient mountain of reports or a new wall sprung up inthe middle of your house then disappeared again but unless that is a regular issue for you these are probably not adding a lot of value. We will talk about too many reports later because that may be a big issue fo you.

As you don’t want to see these that is all there is to say about them so I will move on to something positive and set a scene before we talk about heal and explorer.

Over the last ten years I have helped many people who think they have routing problems. They have lots of pauses and nasty behaviour and they are convinced they have routing issues.

Only a handful have had serious routing issues so before I talk about explorer and heal and you come back and say it can not be correct because I still have pauses and retries lets look at all reasons you could have pauses and retries.

How do you avoid getting too many retries.

The answers are remarkably simple:

make sure the traffic on your network is within the limits of the technology. (Traffic)
make sure every node has adequate RF to at least two but preferably four nodes so there are good alternate routes. (Routing)
use good quality devices and if you see a device behaving badly get rid. Send it back to the manufacturer. Many many devices have bugy firmware or are made/designed poorly. (Bad devices)

How much traffic is too much.

If you have read the previous articles you now know that z-wave lives in the controller and devices.

The binding or application that communicates with the controller has little influence on how z-wave works. They all do fundamentally the same thing. They provide a wrapper around an API accessed through a serial interface to the controller firmware.

Z-wave on openHAB, HAS, Fibaro, HS3, ZIPATO, HOMEY etc etc etc when it comes to z-wave all have the same limits if they run the same series of chips and similar SDK.

So before starting to try to give some guidelines on how much traffic I will point you to one person’s story and how he resolved it as it is relevant to all z-wave systems.

Now you have read his story, you now know how much traffic is too much. It is when your network no longer does what you want it to do.

Now it is worth going through more details and setting some targets and explaining why those targets may work for you. This is all about compromises.

This article is squarely aimed at people who do not have a lot of old z-wave devices. I assume if you do have old devices you have been around z-wave for a few years so do not need this article.

Let’s start with a simple guideline.

If you want to avoid occasional pauses when you do not want them, target no more than 10 commands a second. @petergebruers

If you can keep it to this figure or lower then your network will always have capacity to turn on your lights within 1 second not after 5 or more seconds which is a good starting target and if all else is good, you will get no timeouts or retries.

So if your network traffic is above this and you have issues how do you reduce the volume.

All modern devices and all z-wave plus devices support association. Association means that the devices push reports to the controller and other devices based on changes they detect.There is no need in normal operation to ask a device for its state or values for its sensors.

This changes how you can configure your network.

There is no need for polling
You could turn off all time based wake up of sleeping devices if any device supported this and only wake them when you need to change configuration,
Association has allowed devices to send a lot of reports.

Polling

Reducing polling down to virtually nothing rather than no polling has one small advantage. It provides the network with a self tuning capability. The polling tests the routes and if they are not good and stable the routes slowly improve to the most stable route available. If there is no stable route you will also have regular traffic that shows up in your zniffer with retries/explorer and even application retries.

How regularly you poll or how many nodes you poll is your decision. In time I have reduced all of the nodes I poll to a period of 10 days. I have also tweaked the binding in the past so a no polling option is possible and 80% of my nodes were no longer polled with no adverse effects. It was just not required.

Poll after. I am new to openHAB and this one is possibly of interest to a few people with some special requirements but it is not necessary for standard z-wave behaviour. I have messaged a few openHab users and they like me do not use it. If you feel you need it just remember it is increasing traffic and if you are seeing issues then it may be worth only doing it for a few of your nodes.

Wake up

Now this is a fun one for openHAB users. Wakeup is something sleeping nodes do but if you have configured them and you have no need to change any parameters or setting as they are pure sensors why do you need to waste battery on a regular wake up?

Well in the case of openHAB there is one reason and that is because if you did you would have to wake them if you ever restarted openHAB. If you never restart openHAB you could set wake up to many days or even manual.

Sadly you will not be able to as the devices only reserve a small value for the configuration so small values generally less than a day and none allow disable. You can certainly set them to the maximum wake value they support and you will still get the updates to sensor values but save a load of battery so you may think this is worth doing but you can not set the wake to days or manual yet.

Reporting

Most modern power control devices and all sensors of any type send reports. That is good as that is what you probably wanted so that you could automate your house. The problem is that the default settings tend to send lots of reports.

The other thing about devices is that many sensors tend to send the same or similar reports. Possibly you want the temperature from your smoke sensor, from the thermostatic valve on each radiator in your room, from your CO sensor and from your motion sensor. It was not a problem when you only had 20 nodes but possibly if it is now causing issues on your network.

If you have any more than a few nodes then your network can rapidly start to cause issues and you will get pauses, timeouts and retries.

I am not going to tell you what you do and do not need but these are the biggest causes of issues and lots of people think z-wave does not work and their network routing is broken because the network is swamped in reports containing data they never look at.

Other network packet creators

Regular Waking FLiRS

These are really cool and if you want 5 locks and 15 radiator valves I totally get it. Possibly you need to make some other compromises though as if you get your zniffer out you will see they do cause more traffic than a standard device in return for that low battery use and low latency.

Security

We all need some of this but it again does use more network. By all means secure all of your devices but again you may have to make some other compromises. This is not a high bandwidth system with massive amounts of compute. It is a low energy system with very good range for that power.

Making sure nodes have two or more good neighbours

Now you have reduced your network traffic down if you still have issues, it may now be routing so time to talk about Heal and Explorer.

Heal

Sounds great doesn’t it. Run this regularly and it will fix all issues and make routing issues go away.

Heal is the old and sometimes temperamental method to try to repair routes before explorer frames existed.

What it will do:

clear all stored routes to a node and try to refresh with good ones on both the controller and target device
update the topology in the controller

using a standard algorithm that does not taking into account history, the type of device and traffic during normal use,

What it will not do:

Fix poor RF. z-wave is good at getting through things at a low power because of the low frequency bands allocated. But the laws of physics will not bend. If nodes are too far apart then the only thing that will fix them is moving them closer or adding 2 or more repeaters that are in range.
Fix bad nodes. If a node has failed heal will not remove it
Remove phantom associations
Make badly designed or malfunctioning nodes good

You can heal a complete network or just a single node and it will do a good job to get you started.

Explorer

Sounds a bit weird but if your controller and devices are modern then this is running every time a route fails and there are no working ones in the tables.

What it will do:

If a route fails and there is no good working alternative in the stored tables, find the fastest working route during your normal operation that exists.
clear the route that is the route from the bottom of the list i.e. the one that has not worked for a while
put this new one at the top of the list
adjust the topology

What it will not do:

Fix poor RF. z-wave is good at getting through things at a low power because of the low frequency bands allocated. But the laws of physics will not bend. If nodes are too far apart then the only thing that will fix them is moving them closer or adding 2 or more repeaters that are in range.
Fix bad nodes. If a node has failed explorer will not remove it
Remove phantom associations
Make badly designed or malfunctioning nodes good
Reduce excess traffic

So there is a difference which is subtle.

One runs when triggered by a schedule or manually.
The other runs the instant the route fails.

One uses an algorithm that gives a route based on power and number of hops.
One finds the fastest route at that instant when things have not worked.

So to summarise one sets a baseline and one optimises. Both fail to fix the same list of other possible issues. So if explorer has not fixed your issue then it is unlikely heal will and it is not going to improve over what explorer finds.

Again I will not tell you what to do but read that a few times and make up your own mind.

So the last routing issues are all things that require a different solution but as you now have a zniffer you will find them easily.

The solutions:

Poor RF Add 2 or more repeaters that are in range.
Bad nodes. Mark them failed and remove them and add new nodes to fill the hole.
Remove phantom associations. Get a new firmware or remove device and replace with a new type.
Make badly designed or malfunctioning nodes good. Send them back to the manufacturer.
Reduce excess traffic. See section above.

Good luck to you and enjoy your HA.

robmac · February 20, 2020, 8:36am

Hi,

I was just reading a SDK document that @petergebruers pointed me to relating to an issue.

ern12716-20_-_known_test_observations_sdk_v6.51.10.pdf (795.6 KB)

Reading it again I noticed this little chestnut

TO/Headline: TO #05919 – After requesting Node Neighbor Update to a node the controller clear
LWR to that node.
Library: All controllers
ASIC: 500 series
Consequence: Controller must calculate a route based on the topology map alternatively issue an
explore frame.
Workaround: None.

So if you do a nightly heal of your network it will have no LWR if you have a 500 series and a SDK without this changed. This will make your network sluggish first time you try to control a node after a heal.

chris · February 20, 2020, 8:48am

This is not new and as part of the heal the system sets the preferred route anyway since older SDKs that don’t support explorer frame would otherwise have no route. It should also not be observable in use since the heal should be done at a quiet time and the explorer frames should only take a very short time to discover the route.

robmac · February 20, 2020, 8:56am

Older non explorer would recalculate from the topology table or by managing a direct.

I am not sure older chips/sdk that do not support explorer would support application preferred routes but my memory of the SDK is not encyclopedic.

I don’t much like application preferred routes unless they are really glueing the node to a route that I know is the best from testing. It encourages the escalation to more expensive retries by not allowing the LWR and NLWR to be dynamic to the latest two working but each to their own.

Each time we clash on this you put a little smiley. Possibly best to agree to disagree.

chris · February 20, 2020, 9:58am

No - the controller defines the routes for the devices. The controller will set the route between two nodes so the controller uses the topology table and sets the route in the end device.

Older SDK will ONLY support the controller defined routes. Otherwise they have no way to route.

Sorry, but after we discussed this last time I confirmed this with Silabs. Older devices have no way to perform routing without the routes that the controller defines. If the device doesn’t support explorer frames, then it won’t route unless the controller defines the routes.

robmac · February 20, 2020, 1:20pm

So if you do a nightly heal of your network it will have no LWR if you have a 500 series and a SDK without this changed. This will make your network sluggish first time you try to control a node after a heal.

So the issue is 500 series so why are we talking about 200 series pre explorer nodes. The issue does not even impact them.

chris · February 20, 2020, 1:24pm

My point was that this is not likely to be the “first time you try to control it” since the binding will send other messages before that most of the time since the heal is done when the network is quiet (ie 2am is the recommended time). It is therefore unlikely to cause any issue that users will notice.

robmac · February 20, 2020, 3:17pm

Again you misrepresent what I say. Very well aware of that.

chris · February 20, 2020, 4:02pm

My apologies - I certainly did not mean to do that and I’m not actually sure what I misrepresented. I will however refrain from any further comment and will concentrate on the binding.

robmac · February 20, 2020, 4:09pm

Thanks Chris