[SOLVED] Node heal fails during UPDATE_NEIGHBORS

Yes, I’m sure. Discovery went smoothly. See node 31 in this log, at 11:47. A little later, starting at 12:00 the heal starts failing.

If I retry the heal now, same thing, it starts, then fails at UPDATE_NEIGHBORS.

Did that already, see the log in my previous message.

So both the newly added node (NODE 31) as well as the edge device (NODE 30) in the vicinity of the new node (the one with the wrong route to the old node) show this behaviour when trying to heal.

There is an open issue on this problem on GH but I have not had time to look at this yet.

1 Like

Hi Chris,

Are you referring to #1297?

That issue seems possibly related to the heal issues that multiple people are experiencing. Is there anything that I can do to help you getting this fixed?

No - I was actually referring to this -:

However you’re right that we still have that other issue open. I’ll see if I can find a way to at least resolve this for this command.

Right. I don’t know if the issue you’re referring is applicable here since - from the description - it seems to be about battery-powered devices. In my case the problematic ones are mains-powered devices.

That would be great. I’m happy to help in any way I can.

This is not relevant to any routing. It is simply a report of neighbours discovered by a heal. It is in many ways a confusing thing. The actual route used could not be listed on here and it would be no issues.

This may be possible. If this is a recent device, after a few attempts that tried to use the bad route the last working route and next working route in the return routes in this node would self heal. The controller routing table will already be corrected as the node was removed from the cotroller.

It certainly is sensible to heal after a network change but it is not always required. One thing to consider when deciding if you want to run heal regularly is to understand it is in no way an optimization process. Often after a heal the network takes a few days or weeks to settle and become optimal again. The best that can be said about heal is that it sets a working baseline position not that it sets the best possible position.

That is a fair point and if you want to run heal I can only refer you to the workaround I use to heal and that is to remove the USB and use the PC Controller Program from silicon labs. I have posted how to use this for this purpose. Just after I posted this Chris came through saying he had made a tweak to the heal in the binding so this workaround may no longer be needed. Many thanks Chris. As always great work.

The last reason not at all.

That diagram is of no help in understanding the routing in use in your network whatsoever. It is a picture of a historic moment in time of what nodes had what neighbours under very abnormal conditions. i.e. when using -6db signal to communicate. For this reason it will never show you the full potential of routing nor the route and if heal is not working is never updated. Even if you heal using PC Controller program it will remain broken but your network routing table held in your controller will be fine.

To see the actual routing table follow my instructions here

This is the view read from the controller in PC Controller Program. Even thi sis limited as it gives no information on the quality of communication.

If you want that then a network health check would be worth running

I’ve made an update to the binding - please let me know if it helps.

OK now we have an interesting situation.

There are two type of networks that have disproportionate routing issues. Very small and very large.

If you have a network in an average size house constructed without huge thick floors and walls then very few people have issues between 30ish and 80is nodes but it is an extremely complicated thing and impossible to predict.

At 16 nodes and 5 battery it now depends on how far apart and the materials between the nodes.

This is where neighbours does come into the equation as the working routes can only happen where there are nodes that can communicate. Just because two nodes are neighbours does not mean that zwave will ever route between them.

You certainly are at the less robust end of a zwave network with only that number of nodes as it is unlikely that the routing table and return routing tables have many options stored. Many nodes may have less than the maximum number of alternate routes stored.

I strongly suggest if you have time to put your controller in PC Controller program and look at the actual routing table and run a network health check from that program.

This actually sends commands to your device and tells you if you have anything to worry about.

You may find that your network has a few weaknesses beyond our edge nodes. These can sometimes be fixed by by moving nodes slightly but more often by adding extra powered nodes to act as repeaters.

I also think it is strange you had issues with heal as most of the people reporting issues have large networks and were seeing timeouts. Possibly there is more to the failure of heal on your network than the binding for instance a lot of background reporting that would disrupt a heal so it took longer to complete than would be expected on a small network. Or possibly at -6db there are no neighbours to report so there was no way to reset the routes from a heal.

Hi Rob,

Thanks a lot for all the info! This helps a lot in determining logical ways of troubleshooting the issue(s) in my network.

It’s good to know that I shouldn’t value the neighbor info in Habmin (including the map) that much. I assumed that this info would be accurate, but having read your explanation it’s clear to me that this info is only based on momentary historical readings, and does not say a lot about the current neighbor map or routing tables. Perhaps it’s an idea that the Habmin UI warns the user for this kind of misinterpretation? Just some textual changes could help a lot here.

Anyway, I will try Chris’ fix first and report back. After that I will look into PC Controller to heal any still problematic nodes and run a network health check.

Many thanks again!

That UI was supposed to be deprecated in OH2 but development stalled on the Paper UI. HABmin has already been removed from OH 3, I believe.

I am sure Chris has a lot on his plate and it does have some uses in seeing what a heal did or did not do.

As Chris may have fixed heal in openHAB it makes managing the network a lot easier again so the fix is very good news.

Sorry, but this isn’t completely correct.

As Rob said, the map shows the neighbours - the routes will be via some of these neighbours, but you don’t know exactly which ones from the links in the map. A node can only store a small number of possible routes, so the best ones are chosen - so if a node has 10 neighbours, then around 4 will be chosen for the routes.

This doesn’t mean that the map shown in HABmin is wrong, or inaccurate - it just means that of the “links” that are shown in the map, you don’t know exactly what route is used. The map will still be useful in troubleshooting the network topology and establishing which routes are possible.

In future, I hope to be able to show the actual routes used, but that will be in a new version of the binding.

2 Likes

I understand what you are saying about not knowing the exactly used routing paths.

But I also understood from Rob’s story that the ZWave map and Attribute “Neighbours” info is based on historical responses from neighbor reports coming from the nodes. Meaning that nodes may in fact have a different (newer) neighbor list than OH is aware of, because the report happened earlier in time. That’s why I called it (in other words) possibly inaccurate.

Also, thanks for the proposed fix, I will test it ASAP!

This is partly true, but that doesn’t make the neighbour tables inaccurate. The neighbour tables show possible routes, and it’s unlikely that a route will be established if it can’t hear the neighbour.

There are two methods for routing in a ZWave network (and apologies if I’m duplicating something said earlier) - either the controller can set preferred routes, or the device can try to establish a route itself. If the controller defines the routes, then it uses the information in the neighbour tables, and this is why it is useful to perform the heal periodically since this can change due to all sorts of “random” RF “magic” - even if you may not think anything has changed in the house. This is partly why the neighbour table is built with the 6dB margin - to try and ensure that the link is maintained if the power drops by 75%, and is why the binding performs a neighbour table update as part of the heal.

If the node establishes its own route, then this uses a different system (ie explorer frames) to try to find a route. However even in this case, it is highly likely that the information in the neighbour table is going to indicate the routes and it’s unlikely that a route will be established via a neighbour for which there is no link.

1 Like

Just to be clear on what this change addresses… IIRC, the binding has a 75 second timer on the UPDATE_NEIGHBORS transaction. Am I remembering this correctly? And, if so, then your change ensures that that timeout will be used versus the “normal” 5 second timer used on most other transactions.

1 Like

Yep - that’s correct. The timer was only being applied to certain types of responses, so it wasn’t being used in this case.

1 Like

Woot! I can confirm your fix instantly fixed the single node heals for my 2 problematic nodes!

This is great news, been fighting this for quite some time! Thanks a lot Chris!

2 Likes

I guess this fix and this issue on github are mentioned here. Right?

Sorry - I’m not sure what you mean by “mentioned here”? Can you ask the question a different way please? (sorry).

Is this “update” this fix?