[SOLVED] Insufficient time to complete transaction for RequestNodeNeighborUpdate for large z-wave networks

ilektron · January 25, 2020, 6:30am

Changing the timeout for responses to requests in the transaction manager from 5000ms to 10000ms from the 2.5.1 tag resolved an issue that heal was effectively broken on my openhabian system on a Pi4 w/4GB RAM and 32GB disk, using the Linear usb controller.

Heal started breaking once I had around 60 devices. At this point the nightly heal started knocking all of my z-wave devices offline. Even when trying to heal a device, the transaction manager was resending the request multiple times until it timed out completely and finally aborted the transaction

Here is a log of the heal failing heals

Here is a log of heal succeeding after the change

My network is still a little crazy in the map, but things are a lot better. We’ll see if the nightly heal knocks everything offline tonight.

If there’s a better way to solve this, like change the specific timeout for that type of transaction, I’m happy to submit a pull request.

5iver · January 25, 2020, 11:46am

IIRC, if you change the daily heal time to one hour prior to now, the heal will start immediately. If I enable the daily heal and let it run, my network will seize up and usually gets a “Queue full” error. I have 131 devices. These issues are related…

github.com/openhab/org.openhab.binding.zwave

Heal never completes resulting in device not polled until openHAB is restarted

opened 02:51PM - 05 Jul 19 UTC

mhilbush

Logging this as an issue so we can keep track of it. For some battery devices…, the heal process never completes. Symptoms are described in the forum post below. Because the heal process never completes, the initialization thread stays around until the system is restarted. The net result is that the device will not be polled until openHAB is restarted. One major downside of this is that the device battery level will not be updated. Workaround is to disable the nightly heal. Forum discussion is here. https://community.openhab.org/t/node-gets-into-weird-state-after-heal/74339/9

Celaeno1 · January 25, 2020, 12:00pm

Where can I set this? So I could try out, too.

ilektron · January 25, 2020, 1:25pm

Good to know!

It seems to have worked as my network is still online this morning.

ilektron · January 25, 2020, 1:27pm

It’s hardcoded in the transaction manager so I compiled it myself. I can post the jar later today and put up a diff or something.

chris · January 25, 2020, 1:51pm

Please avoid posting extra JARs - it will just cause a nightmare (for me) if people start using different versions. It would be better if you can open a GH issue (or PR) and explain what you’ve done and why it is the correct approach.

I’ve had a quick look at the log and don’t currently see any problem with the log you said shows everything failing. There are no errors or timeouts etc, and from a quick look the heals seem to be completing in that log.

Since you’ve not really explained what you’ve changed, it hard to comment, so opening a PR seems the best approach so I can see what you’ve done. Often I see people say they have made a change, and it fixed a bunch of stuff, where actually the change is unrelated and it was just that the second time they tried (after they made a change) it worked and it’s not fixed anything.

Normally, the controller will timeout before 10 seconds on most commands as it also has a timeout which I think is set to 5 seconds. The binding uses a similar timeout so that in theory the controller timeout should rule.

So, I don’t want to jump to any conclusions without understanding the code you’ve changed.

mhilbush · January 25, 2020, 2:01pm

Are you sure about that? What about at 2020-01-20 14:25:05.575 in the first link posted above?

mhilbush · January 25, 2020, 2:14pm

FTR, there’s another tell-tale sign that the heal has borked things up. Running this command shows you the ZWave threads (adjust directory path and password for your install).

/path/to/openhab2/runtime/bin/client -p habopen -l 0 -- threads | grep -e '^"ZWave' | sort

Normally, after your system has been running for a while (i.e. after all nodes have been fully initialized), the output will look something like this.

"ZWaveReceiveInputThread" Id=346 in RUNNABLE
"ZWaveReceiveProcessorThread" Id=348 in WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@392d28c2

If there are issues with the heal, you’ll see a bunch of lines like this (notice the NodeNNNInit).

"ZWaveNode103Init2020-01-23 10:27:33.652" Id=515 in WAITING on lock=java.util.concurrent.FutureTask@520d825

chris · January 25, 2020, 2:15pm

Ok, I know I said none, and there was one that I saw, so yes, you’re correct

Maybe there were more, but the one timeout did not seem to impact 60+ heals - most or all seemed to complete fine.

mhilbush · January 25, 2020, 2:24pm

I counted 15 in the 2.5 hour time period covered by the log.

The other log had 1 in the 1.5 hours covered by that log.

I’m not suggesting anything other than there are timeouts. I completely agree with all the other points you made in your post.

chris · January 25, 2020, 2:24pm

Hmmm - maybe I looked at the wrong log as I only saw one. I’ll double check.

chris · January 25, 2020, 2:35pm

In the log, using the log viewer, I see 6 “timeouts”. These aren’t exactly timeouts even if the log shows that. These are failures being returned from the controller, and then after 3 attempts the binding gives up, and then logs the “timeout”. This ultimately logs the 15 “timeouts” that I think you are referring to.

Looking at the transactions, they do not time out -:

They are receiving a failure message from the controller. The binding tries 3 times and then gives up - changing the timeout here from 5 to 10 seconds will not make any difference as the controller is returning a failure a lot quicker than that.

mhilbush · January 25, 2020, 2:42pm

I was referring to these in the log. It wasn’t clear to me how these map to the Neighbor update FAILED messages in the viewer.

BTW, what would cause the Neighbor update FAILED to come back so quickly (10-20 ms) after the request is sent? I was seeing this quite a lot back when I had the nightly heal enabled.

~/tmp/single_heal.log:25602: 2020-01-20 13:32:26.413 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 40: TID 495: Timeout at state WAIT_DATA. 3 retries remaining.
~/tmp/single_heal.log:64921: 2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65204: 2020-01-20 14:25:20.932 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65307: 2020-01-20 14:25:26.033 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 2 retries remaining.
~/tmp/single_heal.log:65886: 2020-01-20 14:27:00.667 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 969: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66548: 2020-01-20 14:28:55.574 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 978: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66657: 2020-01-20 14:32:41.188 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 979: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:69413: 2020-01-20 14:40:07.408 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1020: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:72571: 2020-01-20 14:54:55.036 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1065: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:79843: 2020-01-20 15:24:26.042 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1165: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88556: 2020-01-20 15:50:58.773 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1258: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88854: 2020-01-20 15:51:11.940 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1261: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89016: 2020-01-20 15:51:26.949 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1263: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89262: 2020-01-20 15:51:52.751 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1266: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89494: 2020-01-20 15:52:39.503 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1268: Timeout at state WAIT_REQUEST. 3 retries remaining.

chris · January 25, 2020, 2:49pm

Yep - that’s the same ones I refered to, but my image above was cut just before the timeout -:
Here it is again with the timeout message -:

These are the same events that you refer to I think.

The controller is busy doing something else. The binding really should probably wait a bit longer, but at the moment that’s how it is and I won’t change it right now.

Again though, changing the timeout is not going to help, so I think that this is a misunderstanding where something gets changed, and the next time it works, but it doesn’t explain the initial problem.

Celaeno1 · January 25, 2020, 4:11pm

See also this topic from October 2019:

I have reported also “Neighbor update FAILED” during heal.

But unfortunately you never had time for solving this…

Celaeno1 · January 25, 2020, 5:06pm

I tested a lot at the time and then gave up. In the end had a strong guess why the heal no longer worked.

With my three Fibaros (FGMS-001) some parameters were changed in the Z-Wave database.

Parameters: 40, 42, 60, 62, 64

And more precisely parameters 42 and 64:

I compared it to @Sihui parameters, which he said made no changes. And his heal is running.

Sihuis parameters:

.
.
My parameters:

42 and 64 are set to 0, that means, it is reported during the awake period.

During a heal, this also happens. The node normally is sleeping. After e.g. every hour it wakes up. The heal request still is in the queue of the controller. After the node awakes it wants to report illumination and temperature and also should get healed in the same step. And the awake period is to short, to report all these information. While node is sleeing again, after tree trials the controller gives up and sets the node to OFFLINE.

Setting the interval of “illumination report” and “temperature report” e.g. to 1800 (like Sihui) avoids this behavior, because these are reported before the heal comes on.

chris · January 25, 2020, 5:07pm

It is highly (very highly) unrelated to the configuration of the device. Heal works at the network layer, and not at the application layer so should not be influenced by configuration parameters.

ilektron · January 25, 2020, 5:39pm

Thanks for the direction! I’ll put together a better explanation and some more snippets to explain my reasoning.

ilektron · January 25, 2020, 6:01pm

My reasoning that the transaction manager timeout is too short is this,

Once the stage is put into HEAL_START at 14:25:00, the binding requests a report from the controller almost immediately after. The node responds with some state updates, but has not yet replied with the report from the RequestNodeNeighborUpdate. At 14:25:06, the binding decides to send the RequestNodeNeighborUpdate, which is logic in ZwaveTransactionManager class labeled ‘timer2’.

The binding is requesting another RequestNodeNeighborUpdate because it hasn’t received the report back from the controller, which the controller is still working on. The controller responds with a failure because the node is still working on the previous RequestNodeNeighborUpdate. Whatever logic is controlling the state of the node retries 3 times, but never waits for the node to figure out who its neighbors are.

IMHO, this is a little bit clearer from the logs themselves because you can see the retries with the same TID before the transaction manager decides the transaction failed.

2020-01-20 14:25:00.534 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update received
2020-01-20 14:25:00.540 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Configuration update set action_heal to true (Boolean)
2020-01-20 14:25:00.542 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Starting heal on node!
2020-01-20 14:25:00.544 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Starting network mesh heal.
2020-01-20 14:25:00.546 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Starting initialisation from HEAL_START
2020-01-20 14:25:00.549 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to HEAL_START
2020-01-20 14:25:00.550 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.551 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer - advancing to UPDATE_NEIGHBORS
2020-01-20 14:25:00.553 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveInitializationStateEvent
2020-01-20 14:25:00.554 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node advancer: UPDATE_NEIGHBORS - updating neighbor list
2020-01-20 14:25:00.556 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Request neighbor update
2020-01-20 14:25:00.557 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:00.559 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:00.561 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 960 to queue - size 1
2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
2020-01-20 14:25:05.580 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:05.581 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: notifyTransactionResponse TID:960 CANCELLED
2020-01-20 14:25:05.585 [DEBUG] [alization.ZWaveNodeInitStageAdvancer] - NODE 71: Node Init response (0) org.openhab.binding.zwave.internal.protocol.ZWaveTransactionResponse@12a1e4e
2020-01-20 14:25:06.706 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: sendTransaction org.openhab.binding.zwave.internal.protocol.ZWaveSerialPayload@1d2753e
2020-01-20 14:25:06.708 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.710 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.751 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.753 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.759 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.763 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.765 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 2 attempts left!
2020-01-20 14:25:06.767 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.768 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.777 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.795 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.796 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.798 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.800 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.801 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: CANCEL while sending message. Requeueing - 1 attempts left!
2020-01-20 14:25:06.804 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Adding to device queue
2020-01-20 14:25:06.805 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Added 961 to queue - size 1
2020-01-20 14:25:06.813 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: TID 961: Transaction not completed
2020-01-20 14:25:06.830 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: Got NodeNeighborUpdate request.
2020-01-20 14:25:06.832 [DEBUG] [equestNodeNeighborUpdateMessageClass] - NODE 71: NodeNeighborUpdate FAILED
2020-01-20 14:25:06.833 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNetworkEvent
2020-01-20 14:25:06.835 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveTransactionCompletedEvent
2020-01-20 14:25:06.837 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 71: Retry count exceeded. Discarding message: TID 961: [CANCELLED] priority=Controller, requiresResponse=true, callback: 31
2020-01-20 14:25:06.838 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Got an event from Z-Wave network: ZWaveNodeStatusEvent
2020-01-20 14:25:06.840 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 71: Setting OFFLINE
2020-01-20 14:25:06.843 [DEBUG] [ve.internal.protocol.ZWaveController] - NODE 71: Node Status event - Node is DEAD
2020-01-20 14:25:06.845 [DEBUG] [ng.zwave.internal.protocol.ZWaveNode] - NODE 71: Node is DEAD.

I’m not confident in how the underlying network functions, but I would imagine that a node querying a network with 90 neighbors might take a little longer than the 5 seconds that the binding is giving the controller to complete the transaction.

I was hesitant to create a pull request because I’m not confident that this is the correct place to make the change in handling this particular transaction, but I went ahead and created one just to show exactly where I made the change. I opened an issue and will submit a pull request with the change shortly.

ilektron · January 25, 2020, 6:42pm

I submitted the PR here, for those that want to see and test out the change for themselves.