[SOLVED] Insufficient time to complete transaction for RequestNodeNeighborUpdate for large z-wave networks

mhilbush · January 25, 2020, 2:24pm

I counted 15 in the 2.5 hour time period covered by the log.

The other log had 1 in the 1.5 hours covered by that log.

I’m not suggesting anything other than there are timeouts. I completely agree with all the other points you made in your post.

chris · January 25, 2020, 2:24pm

Hmmm - maybe I looked at the wrong log as I only saw one. I’ll double check.

chris · January 25, 2020, 2:35pm

In the log, using the log viewer, I see 6 “timeouts”. These aren’t exactly timeouts even if the log shows that. These are failures being returned from the controller, and then after 3 attempts the binding gives up, and then logs the “timeout”. This ultimately logs the 15 “timeouts” that I think you are referring to.

Looking at the transactions, they do not time out -:

They are receiving a failure message from the controller. The binding tries 3 times and then gives up - changing the timeout here from 5 to 10 seconds will not make any difference as the controller is returning a failure a lot quicker than that.

mhilbush · January 25, 2020, 2:42pm

I was referring to these in the log. It wasn’t clear to me how these map to the Neighbor update FAILED messages in the viewer.

BTW, what would cause the Neighbor update FAILED to come back so quickly (10-20 ms) after the request is sent? I was seeing this quite a lot back when I had the nightly heal enabled.

~/tmp/single_heal.log:25602: 2020-01-20 13:32:26.413 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 40: TID 495: Timeout at state WAIT_DATA. 3 retries remaining.
~/tmp/single_heal.log:64921: 2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65204: 2020-01-20 14:25:20.932 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65307: 2020-01-20 14:25:26.033 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 2 retries remaining.
~/tmp/single_heal.log:65886: 2020-01-20 14:27:00.667 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 969: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66548: 2020-01-20 14:28:55.574 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 978: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66657: 2020-01-20 14:32:41.188 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 979: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:69413: 2020-01-20 14:40:07.408 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1020: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:72571: 2020-01-20 14:54:55.036 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1065: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:79843: 2020-01-20 15:24:26.042 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1165: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88556: 2020-01-20 15:50:58.773 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1258: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88854: 2020-01-20 15:51:11.940 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1261: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89016: 2020-01-20 15:51:26.949 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1263: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89262: 2020-01-20 15:51:52.751 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1266: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89494: 2020-01-20 15:52:39.503 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1268: Timeout at state WAIT_REQUEST. 3 retries remaining.

chris · January 25, 2020, 2:49pm

Yep - that’s the same ones I refered to, but my image above was cut just before the timeout -:
Here it is again with the timeout message -:

These are the same events that you refer to I think.

The controller is busy doing something else. The binding really should probably wait a bit longer, but at the moment that’s how it is and I won’t change it right now.

Again though, changing the timeout is not going to help, so I think that this is a misunderstanding where something gets changed, and the next time it works, but it doesn’t explain the initial problem.

Celaeno1 · January 25, 2020, 4:11pm

See also this topic from October 2019:

I have reported also “Neighbor update FAILED” during heal.

But unfortunately you never had time for solving this…

Celaeno1 · January 25, 2020, 5:06pm

I tested a lot at the time and then gave up. In the end had a strong guess why the heal no longer worked.

With my three Fibaros (FGMS-001) some parameters were changed in the Z-Wave database.

Parameters: 40, 42, 60, 62, 64

And more precisely parameters 42 and 64:

I compared it to @Sihui parameters, which he said made no changes. And his heal is running.

Sihuis parameters:

.
.
My parameters:

42 and 64 are set to 0, that means, it is reported during the awake period.

During a heal, this also happens. The node normally is sleeping. After e.g. every hour it wakes up. The heal request still is in the queue of the controller. After the node awakes it wants to report illumination and temperature and also should get healed in the same step. And the awake period is to short, to report all these information. While node is sleeing again, after tree trials the controller gives up and sets the node to OFFLINE.

Setting the interval of “illumination report” and “temperature report” e.g. to 1800 (like Sihui) avoids this behavior, because these are reported before the heal comes on.

chris · January 25, 2020, 5:07pm

It is highly (very highly) unrelated to the configuration of the device. Heal works at the network layer, and not at the application layer so should not be influenced by configuration parameters.

ilektron · January 25, 2020, 5:39pm

Thanks for the direction! I’ll put together a better explanation and some more snippets to explain my reasoning.

ilektron · January 25, 2020, 6:01pm

My reasoning that the transaction manager timeout is too short is this,

Once the stage is put into HEAL_START at 14:25:00, the binding requests a report from the controller almost immediately after. The node responds with some state updates, but has not yet replied with the report from the RequestNodeNeighborUpdate. At 14:25:06, the binding decides to send the RequestNodeNeighborUpdate, which is logic in ZwaveTransactionManager class labeled ‘timer2’.

The binding is requesting another RequestNodeNeighborUpdate because it hasn’t received the report back from the controller, which the controller is still working on. The controller responds with a failure because the node is still working on the previous RequestNodeNeighborUpdate. Whatever logic is controlling the state of the node retries 3 times, but never waits for the node to figure out who its neighbors are.

IMHO, this is a little bit clearer from the logs themselves because you can see the retries with the same TID before the transaction manager decides the transaction failed.

2020-01-20 14:25:00.534 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update received
2020-01-20 14:25:00.540 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update set action_heal to true (Boolean)
2020-01-20 14:25:00.542 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Starting heal on node!
2020-01-20 14:25:00.544 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Starting network mesh heal.
2020-01-20 14:25:00.546 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Starting initialisation from HEAL_START
2020-01-20 14:25:00.549 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to HEAL_START
2020-01-20 14:25:00.550 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.551 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to UPDATE_NEIGHBORS
2020-01-20 14:25:00.553 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.554 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer: UPDATE_NEIGHBORS - updating neighbor list
2020-01-20 14:25:00.556 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Request neighbor update
2020-01-20 14:25:00.557 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:00.559 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:00.561 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 960 to queue - size 1
2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
2020-01-20 14:25:05.580 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:05.581 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: notifyTransactionResponse TID:960 CANCELLED
2020-01-20 14:25:05.585 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node Init response (0) org.openhab.binding.zwave.internal.protocol.ZWaveTransactionResponse@12a1e4e
2020-01-20 14:25:06.706 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:06.708 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.710 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.751 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.753 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.759 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.763 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.765 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 2 attempts left!
2020-01-20 14:25:06.767 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.768 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.777 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.795 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.796 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.798 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.800 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.801 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 1 attempts left!
2020-01-20 14:25:06.804 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.805 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.813 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.830 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.832 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.833 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.835 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.837 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Retry count exceeded. Discarding message: TID 961: [CANCELLED] priority=Controller, requiresResponse=true, callback: 31
2020-01-20 14:25:06.838 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNodeStatusEvent
2020-01-20 14:25:06.840 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Setting OFFLINE
2020-01-20 14:25:06.843 [DEBUG] [ve.internal.protocol.ZWaveController] - NODE 71: Node Status event - Node is DEAD
2020-01-20 14:25:06.845 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Node is DEAD.

I’m not confident in how the underlying network functions, but I would imagine that a node querying a network with 90 neighbors might take a little longer than the 5 seconds that the binding is giving the controller to complete the transaction.

I was hesitant to create a pull request because I’m not confident that this is the correct place to make the change in handling this particular transaction, but I went ahead and created one just to show exactly where I made the change. I opened an issue and will submit a pull request with the change shortly.

ilektron · January 25, 2020, 6:42pm

I submitted the PR here, for those that want to see and test out the change for themselves.

chris · January 25, 2020, 7:41pm

Thanks.

As mentioned in the PR, I agree with your assessment of the issue, but would like to solve it in a different way as your implementation will have side effects. I won’t go into the details here as we will duplicate everything.

Celaeno1 · January 25, 2020, 7:41pm

I don’t care if it is related to network layer or application layer. Heal does not work. Either the controller is busy with something (e.g. getting illumination report or temperature report) or the device is busy (e.g. sending illumination or temperature report) while the neighbors are also requested. There was a change in handling this properly in May or June 2019. Since then heal does not work for my three Fibaro FGMS-001. Waking them up manually during a heal solved the problem. I give it up now.

chris · January 25, 2020, 7:43pm

Maybe it doesn’t work - but we need to know why - and that does require someone to care even if you don’t!

I spend an aweful lot of time working on OH and I’m sorry I’ve not had the chance to look at this for you, but there are a lot of other things I also need to work on.

My apologies again.

Andrew_Rowe · January 25, 2020, 7:52pm

No Chris
Thank you for everything you do and tirelessly diagnois all the zwave and zigbee stuff, maintain these great bindings, help in the forum… THANK YOU!!!
Many people use OpenHAB because the zwave is the best out there (from what I’ve heard)
I personally think you are OpenHAB’s greatest asset
I think I can speak for the entire community
Thank you

Celaeno1 · January 25, 2020, 8:03pm

@Andrew_Rowe @chris

Just to make that clear. I am exactly the same opinion and grateful for all the support here. Thank you very much. I’m just trying to share my ideas here to help solving the problem. Even if my ideas are wrong. I’m sorry.

Bruce_Osborne · January 25, 2020, 8:08pm

That is what brought me here after struggling with the mess that is openzwave. It seems especially bad for North American users.

sihui · January 25, 2020, 8:10pm

ilektron · January 25, 2020, 8:12pm

I’ve worked on a couple of products in my career that implement z-wave hubs, but they all wanted to work in their own walled garden that only officially supported a handful of devices. Even they had significant issues over the years and were unable to even work with the new products that the same company was releasing. One case caused a lock to go through a set of batteries in a couple of days. Between my vivint system zwave going offline every couple of days and looking at the reviews for the smartthings hubs, I ended up here!

Celaeno1 · January 25, 2020, 8:26pm

I don’t have to know that because I’m just a simple Z-Wave user. I don’t care whether there are two different layers in which the devices communicate with each other. In the end, the heal doesn’t work. I’m just trying to explain logically what might have changed in the meantime, that it no longer works. If my assumption is unlikely, then OK.

It is only surprising that heal works for some and not for others. I only have 10 nodes. So there shouldn’t be any problems. Which actually never existed until mid-2019. If it were a fundamental problem, everyone would have it. Probably the number of unreported cases is very high because many have simply deactivated the heal.