[SOLVED] Insufficient time to complete transaction for RequestNodeNeighborUpdate for large z-wave networks

Celaeno1 · January 25, 2020, 12:00pm

Where can I set this? So I could try out, too.

ilektron · January 25, 2020, 1:25pm

Good to know!

It seems to have worked as my network is still online this morning.

ilektron · January 25, 2020, 1:27pm

It’s hardcoded in the transaction manager so I compiled it myself. I can post the jar later today and put up a diff or something.

chris · January 25, 2020, 1:51pm

Please avoid posting extra JARs - it will just cause a nightmare (for me) if people start using different versions. It would be better if you can open a GH issue (or PR) and explain what you’ve done and why it is the correct approach.

I’ve had a quick look at the log and don’t currently see any problem with the log you said shows everything failing. There are no errors or timeouts etc, and from a quick look the heals seem to be completing in that log.

Since you’ve not really explained what you’ve changed, it hard to comment, so opening a PR seems the best approach so I can see what you’ve done. Often I see people say they have made a change, and it fixed a bunch of stuff, where actually the change is unrelated and it was just that the second time they tried (after they made a change) it worked and it’s not fixed anything.

Normally, the controller will timeout before 10 seconds on most commands as it also has a timeout which I think is set to 5 seconds. The binding uses a similar timeout so that in theory the controller timeout should rule.

So, I don’t want to jump to any conclusions without understanding the code you’ve changed.

mhilbush · January 25, 2020, 2:01pm

Are you sure about that? What about at 2020-01-20 14:25:05.575 in the first link posted above?

mhilbush · January 25, 2020, 2:14pm

FTR, there’s another tell-tale sign that the heal has borked things up. Running this command shows you the ZWave threads (adjust directory path and password for your install).

/path/to/openhab2/runtime/bin/client -p habopen -l 0 -- threads | grep -e '^"ZWave' | sort

Normally, after your system has been running for a while (i.e. after all nodes have been fully initialized), the output will look something like this.

"ZWaveReceiveInputThread" Id=346 in RUNNABLE
"ZWaveReceiveProcessorThread" Id=348 in WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@392d28c2

If there are issues with the heal, you’ll see a bunch of lines like this (notice the NodeNNNInit).

"ZWaveNode103Init2020-01-23 10:27:33.652" Id=515 in WAITING on lock=java.util.concurrent.FutureTask@520d825

chris · January 25, 2020, 2:15pm

Ok, I know I said none, and there was one that I saw, so yes, you’re correct

Maybe there were more, but the one timeout did not seem to impact 60+ heals - most or all seemed to complete fine.

mhilbush · January 25, 2020, 2:24pm

I counted 15 in the 2.5 hour time period covered by the log.

The other log had 1 in the 1.5 hours covered by that log.

I’m not suggesting anything other than there are timeouts. I completely agree with all the other points you made in your post.

chris · January 25, 2020, 2:24pm

Hmmm - maybe I looked at the wrong log as I only saw one. I’ll double check.

chris · January 25, 2020, 2:35pm

In the log, using the log viewer, I see 6 “timeouts”. These aren’t exactly timeouts even if the log shows that. These are failures being returned from the controller, and then after 3 attempts the binding gives up, and then logs the “timeout”. This ultimately logs the 15 “timeouts” that I think you are referring to.

Looking at the transactions, they do not time out -:

They are receiving a failure message from the controller. The binding tries 3 times and then gives up - changing the timeout here from 5 to 10 seconds will not make any difference as the controller is returning a failure a lot quicker than that.

mhilbush · January 25, 2020, 2:42pm

I was referring to these in the log. It wasn’t clear to me how these map to the Neighbor update FAILED messages in the viewer.

BTW, what would cause the Neighbor update FAILED to come back so quickly (10-20 ms) after the request is sent? I was seeing this quite a lot back when I had the nightly heal enabled.

~/tmp/single_heal.log:25602: 2020-01-20 13:32:26.413 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 40: TID 495: Timeout at state WAIT_DATA. 3 retries remaining.
~/tmp/single_heal.log:64921: 2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65204: 2020-01-20 14:25:20.932 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65307: 2020-01-20 14:25:26.033 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 2 retries remaining.
~/tmp/single_heal.log:65886: 2020-01-20 14:27:00.667 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 969: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66548: 2020-01-20 14:28:55.574 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 978: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66657: 2020-01-20 14:32:41.188 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 979: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:69413: 2020-01-20 14:40:07.408 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1020: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:72571: 2020-01-20 14:54:55.036 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1065: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:79843: 2020-01-20 15:24:26.042 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1165: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88556: 2020-01-20 15:50:58.773 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1258: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88854: 2020-01-20 15:51:11.940 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1261: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89016: 2020-01-20 15:51:26.949 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1263: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89262: 2020-01-20 15:51:52.751 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1266: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89494: 2020-01-20 15:52:39.503 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1268: Timeout at state WAIT_REQUEST. 3 retries remaining.

chris · January 25, 2020, 2:49pm

Yep - that’s the same ones I refered to, but my image above was cut just before the timeout -:
Here it is again with the timeout message -:

These are the same events that you refer to I think.

The controller is busy doing something else. The binding really should probably wait a bit longer, but at the moment that’s how it is and I won’t change it right now.

Again though, changing the timeout is not going to help, so I think that this is a misunderstanding where something gets changed, and the next time it works, but it doesn’t explain the initial problem.

Celaeno1 · January 25, 2020, 4:11pm

See also this topic from October 2019:

I have reported also “Neighbor update FAILED” during heal.

But unfortunately you never had time for solving this…

Celaeno1 · January 25, 2020, 5:06pm

I tested a lot at the time and then gave up. In the end had a strong guess why the heal no longer worked.

With my three Fibaros (FGMS-001) some parameters were changed in the Z-Wave database.

Parameters: 40, 42, 60, 62, 64

And more precisely parameters 42 and 64:

I compared it to @Sihui parameters, which he said made no changes. And his heal is running.

Sihuis parameters:

.
.
My parameters:

42 and 64 are set to 0, that means, it is reported during the awake period.

During a heal, this also happens. The node normally is sleeping. After e.g. every hour it wakes up. The heal request still is in the queue of the controller. After the node awakes it wants to report illumination and temperature and also should get healed in the same step. And the awake period is to short, to report all these information. While node is sleeing again, after tree trials the controller gives up and sets the node to OFFLINE.

Setting the interval of “illumination report” and “temperature report” e.g. to 1800 (like Sihui) avoids this behavior, because these are reported before the heal comes on.

chris · January 25, 2020, 5:07pm

It is highly (very highly) unrelated to the configuration of the device. Heal works at the network layer, and not at the application layer so should not be influenced by configuration parameters.

ilektron · January 25, 2020, 5:39pm

Thanks for the direction! I’ll put together a better explanation and some more snippets to explain my reasoning.

ilektron · January 25, 2020, 6:01pm

My reasoning that the transaction manager timeout is too short is this,

Once the stage is put into HEAL_START at 14:25:00, the binding requests a report from the controller almost immediately after. The node responds with some state updates, but has not yet replied with the report from the RequestNodeNeighborUpdate. At 14:25:06, the binding decides to send the RequestNodeNeighborUpdate, which is logic in ZwaveTransactionManager class labeled ‘timer2’.

The binding is requesting another RequestNodeNeighborUpdate because it hasn’t received the report back from the controller, which the controller is still working on. The controller responds with a failure because the node is still working on the previous RequestNodeNeighborUpdate. Whatever logic is controlling the state of the node retries 3 times, but never waits for the node to figure out who its neighbors are.

IMHO, this is a little bit clearer from the logs themselves because you can see the retries with the same TID before the transaction manager decides the transaction failed.

2020-01-20 14:25:00.534 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update received
2020-01-20 14:25:00.540 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update set action_heal to true (Boolean)
2020-01-20 14:25:00.542 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Starting heal on node!
2020-01-20 14:25:00.544 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Starting network mesh heal.
2020-01-20 14:25:00.546 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Starting initialisation from HEAL_START
2020-01-20 14:25:00.549 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to HEAL_START
2020-01-20 14:25:00.550 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.551 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to UPDATE_NEIGHBORS
2020-01-20 14:25:00.553 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.554 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer: UPDATE_NEIGHBORS - updating neighbor list
2020-01-20 14:25:00.556 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Request neighbor update
2020-01-20 14:25:00.557 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:00.559 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:00.561 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 960 to queue - size 1
2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
2020-01-20 14:25:05.580 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:05.581 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: notifyTransactionResponse TID:960 CANCELLED
2020-01-20 14:25:05.585 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node Init response (0) org.openhab.binding.zwave.internal.protocol.ZWaveTransactionResponse@12a1e4e
2020-01-20 14:25:06.706 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:06.708 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.710 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.751 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.753 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.759 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.763 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.765 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 2 attempts left!
2020-01-20 14:25:06.767 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.768 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.777 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.795 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.796 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.798 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.800 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.801 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 1 attempts left!
2020-01-20 14:25:06.804 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.805 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.813 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.830 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.832 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.833 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.835 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.837 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Retry count exceeded. Discarding message: TID 961: [CANCELLED] priority=Controller, requiresResponse=true, callback: 31
2020-01-20 14:25:06.838 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNodeStatusEvent
2020-01-20 14:25:06.840 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Setting OFFLINE
2020-01-20 14:25:06.843 [DEBUG] [ve.internal.protocol.ZWaveController] - NODE 71: Node Status event - Node is DEAD
2020-01-20 14:25:06.845 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Node is DEAD.

I’m not confident in how the underlying network functions, but I would imagine that a node querying a network with 90 neighbors might take a little longer than the 5 seconds that the binding is giving the controller to complete the transaction.

I was hesitant to create a pull request because I’m not confident that this is the correct place to make the change in handling this particular transaction, but I went ahead and created one just to show exactly where I made the change. I opened an issue and will submit a pull request with the change shortly.

ilektron · January 25, 2020, 6:42pm

I submitted the PR here, for those that want to see and test out the change for themselves.

chris · January 25, 2020, 7:41pm

Thanks.

As mentioned in the PR, I agree with your assessment of the issue, but would like to solve it in a different way as your implementation will have side effects. I won’t go into the details here as we will duplicate everything.

Celaeno1 · January 25, 2020, 7:41pm

I don’t care if it is related to network layer or application layer. Heal does not work. Either the controller is busy with something (e.g. getting illumination report or temperature report) or the device is busy (e.g. sending illumination or temperature report) while the neighbors are also requested. There was a change in handling this properly in May or June 2019. Since then heal does not work for my three Fibaro FGMS-001. Waking them up manually during a heal solved the problem. I give it up now.