[SOLVED] Node heal fails during UPDATE_NEIGHBORS

robbert · February 5, 2020, 12:22pm

Hello,

OH: 2.5.1, Aeotec Z-Stick Gen 5

I have a couple of nodes on the edge of my mesh, that are functioning normally but I consistently cannot heal them.

So:

I can include/exclude them on the spot (NWI)
They (re)initialize fine
They operate fine

But yet, when I’m performing a heal, they always get stuck at the UPDATE_NEIGHBORS stage, waiting for a response to “RequestNodeNeighborUpdate”.

Of course I could move the devices temporarily closer to be able to advance to the next stage of the heal, but this kind of defeats the purpose of the heal, since moving them will alter their neighbourhood and they would possibly report the wrong neighbor list.

So, I’m puzzled here. How do people heal their edge nodes?

Bruce_Osborne · February 5, 2020, 12:39pm

Is this a specific node heal or a whole network heal?

robbert · February 5, 2020, 1:19pm

Actually this happens with both. I thought that the whole network heal was just a series of specific node heals initiated at the same time. Or are there also differences in the heal mechanism?

Btw, these are mains-powered devices.

robbert · February 7, 2020, 7:40pm

I’ve captured a debug log of a single node heal failing:
https://privatebin.net/?4f6501aa4c30c8e3#AsdmsM8tTqrdy8Dy1oeZGDuGzLJWWz9MUEckbah19DwD

Node 30 is sent a heal at 20:29, after which the device drops offline, later comes back online, but does not finish the heal. I noticed some warnings in the log. What does this mean?

[WARN ] [essage.ApplicationUpdateMessageClass] - TODO: Implement Application Update Request Handling of New ID Assigned (64).

robmac · February 8, 2020, 9:45pm

Why do you want to heal them if they are not broken?

If they have good routes and return routes it seems an odd thing to want to try to change them.

One thing that may be happening is that they are not that good but you will only tell how not good they are is with a zniffer.

It is hard to say why the heal is not working as you have not said a lot about your network.

There are a few things that may be happening but one thing about heal is it uses a lower power than normal transmission -6db.

What may have happened is that the route to your edge node was established by explorer frames but the route does not work at -6db so heal fails.

Heal is an odd animal and can fail for many reasons. It generates a vast amount of traffic in large or dense networks where many nodes can see each other at -6db. It can kill itself due to the pure volume of traffic generated and any timeout at the application end.

I have a few marginal nodes at the edge of my network and the worst thing I can do is succeed to heal them. They pick up a worse route than they had before heal. Until the route fails badly and an explorer frame gets them back to a good route they are laggy and generally horrible.

I never heal them. That is my strategy for nodes that are working fine.

robbert · February 9, 2020, 9:30am

Hi Rob,

Thanks for your detailed response, some very useful info there!

Fair question. While these devices do operate, I have reasons to expect they are not operating optimally. Let me explain a recent event. I replaced a device on the edge of my network with another device (upgrading a ZWave to ZWave Plus device). This resulted in two observations:

The new node, while functioning, hasn’t reported its neighbors (yet), according to OH / Habmin.
As for the old node that is removed: I have another edge device that, again according to Habmin, still has the removed node in its neighbor list and is likely to have a stale return route (that may currently not be utilized) to the controller via this removed node.

It’s often advised on this forum and elsewhere to perform a heal after network changes, to optimize the mesh. Also, the nightly heal being enabled in the default configuration seems to suggest that’s it’s a good idea to regularly heal.

So, my primary reason for wanting to heal these devices is having a healthy network and optimizing the mesh, deleting any stale return routes (that may not be used now, but could be used in the future).

Additionally, there is a secondary reason: I still would like to fix my ZWave network map as laid out in this topic. The general advice for fixing the map seems to heal your network/all your nodes.

Don’t you agree these are valid reasons?

I’m not sure what information would be relevant, but happy to supply more info. This is a fairly small home network, consisting of one controller and 16 nodes, 11 mains-powered, 5 battery. All devices, except one, are Zwave Plus devices. What more info would be relevant?

This is very useful information, thanks! Makes me wonder why this was designed this way though. Assuming you have only Zwave Plus devices, this would make all network functionality (normal operation, include, exclude, (re)init) supported by the mesh, except healing functionality? Seems strange.

Thanks for sharing. Your strategy is interesting because it is directly contradicting to the default setting of nightly heal being enabled. As a normal user with a small home network I wouldn’t expect to have problems with default settings. But probably my network isn’t as normal as I think.

And may I ask - does your Habmin ZWave network map work?

Bruce_Osborne · February 9, 2020, 9:52am

Are you sure the new node has been fully discovered? Try just healing that node. An error will be logged if it is not fully discovered.

After verifying the new node try a node head on that one edge device.

There is some difference of opinion on whether the nightly heal is needed. If their network is stable many have turned it off until they make network changes.

robbert · February 9, 2020, 10:24am

Yes, I’m sure. Discovery went smoothly. See node 31 in this log, at 11:47. A little later, starting at 12:00 the heal starts failing.

If I retry the heal now, same thing, it starts, then fails at UPDATE_NEIGHBORS.

Did that already, see the log in my previous message.

So both the newly added node (NODE 31) as well as the edge device (NODE 30) in the vicinity of the new node (the one with the wrong route to the old node) show this behaviour when trying to heal.

chris · February 9, 2020, 10:41am

There is an open issue on this problem on GH but I have not had time to look at this yet.

robbert · February 9, 2020, 11:00am

Hi Chris,

Are you referring to #1297?

That issue seems possibly related to the heal issues that multiple people are experiencing. Is there anything that I can do to help you getting this fixed?

chris · February 9, 2020, 11:02am

No - I was actually referring to this -:

However you’re right that we still have that other issue open. I’ll see if I can find a way to at least resolve this for this command.

robbert · February 9, 2020, 11:17am

Right. I don’t know if the issue you’re referring is applicable here since - from the description - it seems to be about battery-powered devices. In my case the problematic ones are mains-powered devices.

That would be great. I’m happy to help in any way I can.

robmac · February 9, 2020, 11:32am

This is not relevant to any routing. It is simply a report of neighbours discovered by a heal. It is in many ways a confusing thing. The actual route used could not be listed on here and it would be no issues.

This may be possible. If this is a recent device, after a few attempts that tried to use the bad route the last working route and next working route in the return routes in this node would self heal. The controller routing table will already be corrected as the node was removed from the cotroller.

It certainly is sensible to heal after a network change but it is not always required. One thing to consider when deciding if you want to run heal regularly is to understand it is in no way an optimization process. Often after a heal the network takes a few days or weeks to settle and become optimal again. The best that can be said about heal is that it sets a working baseline position not that it sets the best possible position.

That is a fair point and if you want to run heal I can only refer you to the workaround I use to heal and that is to remove the USB and use the PC Controller Program from silicon labs. I have posted how to use this for this purpose. Just after I posted this Chris came through saying he had made a tweak to the heal in the binding so this workaround may no longer be needed. Many thanks Chris. As always great work.

The last reason not at all.

That diagram is of no help in understanding the routing in use in your network whatsoever. It is a picture of a historic moment in time of what nodes had what neighbours under very abnormal conditions. i.e. when using -6db signal to communicate. For this reason it will never show you the full potential of routing nor the route and if heal is not working is never updated. Even if you heal using PC Controller program it will remain broken but your network routing table held in your controller will be fine.

To see the actual routing table follow my instructions here

This is the view read from the controller in PC Controller Program. Even thi sis limited as it gives no information on the quality of communication.

If you want that then a network health check would be worth running

chris · February 9, 2020, 11:33am

I’ve made an update to the binding - please let me know if it helps.

robmac · February 9, 2020, 12:34pm

OK now we have an interesting situation.

There are two type of networks that have disproportionate routing issues. Very small and very large.

If you have a network in an average size house constructed without huge thick floors and walls then very few people have issues between 30ish and 80is nodes but it is an extremely complicated thing and impossible to predict.

At 16 nodes and 5 battery it now depends on how far apart and the materials between the nodes.

This is where neighbours does come into the equation as the working routes can only happen where there are nodes that can communicate. Just because two nodes are neighbours does not mean that zwave will ever route between them.

You certainly are at the less robust end of a zwave network with only that number of nodes as it is unlikely that the routing table and return routing tables have many options stored. Many nodes may have less than the maximum number of alternate routes stored.

I strongly suggest if you have time to put your controller in PC Controller program and look at the actual routing table and run a network health check from that program.

This actually sends commands to your device and tells you if you have anything to worry about.

You may find that your network has a few weaknesses beyond our edge nodes. These can sometimes be fixed by by moving nodes slightly but more often by adding extra powered nodes to act as repeaters.

I also think it is strange you had issues with heal as most of the people reporting issues have large networks and were seeing timeouts. Possibly there is more to the failure of heal on your network than the binding for instance a lot of background reporting that would disrupt a heal so it took longer to complete than would be expected on a small network. Or possibly at -6db there are no neighbours to report so there was no way to reset the routes from a heal.

robbert · February 9, 2020, 1:39pm

Hi Rob,

Thanks a lot for all the info! This helps a lot in determining logical ways of troubleshooting the issue(s) in my network.

It’s good to know that I shouldn’t value the neighbor info in Habmin (including the map) that much. I assumed that this info would be accurate, but having read your explanation it’s clear to me that this info is only based on momentary historical readings, and does not say a lot about the current neighbor map or routing tables. Perhaps it’s an idea that the Habmin UI warns the user for this kind of misinterpretation? Just some textual changes could help a lot here.

Anyway, I will try Chris’ fix first and report back. After that I will look into PC Controller to heal any still problematic nodes and run a network health check.

Many thanks again!

Bruce_Osborne · February 9, 2020, 1:41pm

That UI was supposed to be deprecated in OH2 but development stalled on the Paper UI. HABmin has already been removed from OH 3, I believe.

robmac · February 9, 2020, 1:43pm

I am sure Chris has a lot on his plate and it does have some uses in seeing what a heal did or did not do.

As Chris may have fixed heal in openHAB it makes managing the network a lot easier again so the fix is very good news.

chris · February 9, 2020, 1:45pm

Sorry, but this isn’t completely correct.

As Rob said, the map shows the neighbours - the routes will be via some of these neighbours, but you don’t know exactly which ones from the links in the map. A node can only store a small number of possible routes, so the best ones are chosen - so if a node has 10 neighbours, then around 4 will be chosen for the routes.

This doesn’t mean that the map shown in HABmin is wrong, or inaccurate - it just means that of the “links” that are shown in the map, you don’t know exactly what route is used. The map will still be useful in troubleshooting the network topology and establishing which routes are possible.

In future, I hope to be able to show the actual routes used, but that will be in a new version of the binding.

robbert · February 9, 2020, 1:53pm

I understand what you are saying about not knowing the exactly used routing paths.

But I also understood from Rob’s story that the ZWave map and Attribute “Neighbours” info is based on historical responses from neighbor reports coming from the nodes. Meaning that nodes may in fact have a different (newer) neighbor list than OH is aware of, because the report happened earlier in time. That’s why I called it (in other words) possibly inaccurate.

Also, thanks for the proposed fix, I will test it ASAP!