Changing the timeout for responses to requests in the transaction manager from 5000ms to 10000ms from the 2.5.1 tag resolved an issue that heal was effectively broken on my openhabian system on a Pi4 w/4GB RAM and 32GB disk, using the Linear usb controller.
Heal started breaking once I had around 60 devices. At this point the nightly heal started knocking all of my z-wave devices offline. Even when trying to heal a device, the transaction manager was resending the request multiple times until it timed out completely and finally aborted the transaction
IIRC, if you change the daily heal time to one hour prior to now, the heal will start immediately. If I enable the daily heal and let it run, my network will seize up and usually gets a âQueue fullâ error. I have 131 devices. These issues are relatedâŠ
Please avoid posting extra JARs - it will just cause a nightmare (for me) if people start using different versions. It would be better if you can open a GH issue (or PR) and explain what youâve done and why it is the correct approach.
Iâve had a quick look at the log and donât currently see any problem with the log you said shows everything failing. There are no errors or timeouts etc, and from a quick look the heals seem to be completing in that log.
Since youâve not really explained what youâve changed, it hard to comment, so opening a PR seems the best approach so I can see what youâve done. Often I see people say they have made a change, and it fixed a bunch of stuff, where actually the change is unrelated and it was just that the second time they tried (after they made a change) it worked and itâs not fixed anything.
Normally, the controller will timeout before 10 seconds on most commands as it also has a timeout which I think is set to 5 seconds. The binding uses a similar timeout so that in theory the controller timeout should rule.
So, I donât want to jump to any conclusions without understanding the code youâve changed.
FTR, thereâs another tell-tale sign that the heal has borked things up. Running this command shows you the ZWave threads (adjust directory path and password for your install).
Normally, after your system has been running for a while (i.e. after all nodes have been fully initialized), the output will look something like this.
"ZWaveReceiveInputThread" Id=346 in RUNNABLE
"ZWaveReceiveProcessorThread" Id=348 in WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@392d28c2
If there are issues with the heal, youâll see a bunch of lines like this (notice the NodeNNNInit).
"ZWaveNode103Init2020-01-23 10:27:33.652" Id=515 in WAITING on lock=java.util.concurrent.FutureTask@520d825
In the log, using the log viewer, I see 6 âtimeoutsâ. These arenât exactly timeouts even if the log shows that. These are failures being returned from the controller, and then after 3 attempts the binding gives up, and then logs the âtimeoutâ. This ultimately logs the 15 âtimeoutsâ that I think you are referring to.
Looking at the transactions, they do not time out -:
They are receiving a failure message from the controller. The binding tries 3 times and then gives up - changing the timeout here from 5 to 10 seconds will not make any difference as the controller is returning a failure a lot quicker than that.
I was referring to these in the log. It wasnât clear to me how these map to the Neighbor update FAILED messages in the viewer.
BTW, what would cause the Neighbor update FAILED to come back so quickly (10-20 ms) after the request is sent? I was seeing this quite a lot back when I had the nightly heal enabled.
~/tmp/single_heal.log:25602: 2020-01-20 13:32:26.413 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 40: TID 495: Timeout at state WAIT_DATA. 3 retries remaining.
~/tmp/single_heal.log:64921: 2020-01-20 14:25:05.575 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 960: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65204: 2020-01-20 14:25:20.932 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:65307: 2020-01-20 14:25:26.033 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 963: Timeout at state WAIT_REQUEST. 2 retries remaining.
~/tmp/single_heal.log:65886: 2020-01-20 14:27:00.667 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 969: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66548: 2020-01-20 14:28:55.574 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 978: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:66657: 2020-01-20 14:32:41.188 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 979: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:69413: 2020-01-20 14:40:07.408 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1020: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:72571: 2020-01-20 14:54:55.036 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1065: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:79843: 2020-01-20 15:24:26.042 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 71: TID 1165: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88556: 2020-01-20 15:50:58.773 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1258: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:88854: 2020-01-20 15:51:11.940 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1261: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89016: 2020-01-20 15:51:26.949 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1263: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89262: 2020-01-20 15:51:52.751 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1266: Timeout at state WAIT_REQUEST. 3 retries remaining.
~/tmp/single_heal.log:89494: 2020-01-20 15:52:39.503 [DEBUG] [sactionManager$ZWaveTransactionTimer] - NODE 24: TID 1268: Timeout at state WAIT_REQUEST. 3 retries remaining.
These are the same events that you refer to I think.
The controller is busy doing something else. The binding really should probably wait a bit longer, but at the moment thatâs how it is and I wonât change it right now.
Again though, changing the timeout is not going to help, so I think that this is a misunderstanding where something gets changed, and the next time it works, but it doesnât explain the initial problem.
I tested a lot at the time and then gave up. In the end had a strong guess why the heal no longer worked.
With my three Fibaros (FGMS-001) some parameters were changed in the Z-Wave database.
Parameters: 40, 42, 60, 62, 64
And more precisely parameters 42 and 64:
I compared it to @Sihui parameters, which he said made no changes. And his heal is running.
Sihuis parameters:
.
.
My parameters:
42 and 64 are set to 0, that means, it is reported during the awake period.
During a heal, this also happens. The node normally is sleeping. After e.g. every hour it wakes up. The heal request still is in the queue of the controller. After the node awakes it wants to report illumination and temperature and also should get healed in the same step. And the awake period is to short, to report all these information. While node is sleeing again, after tree trials the controller gives up and sets the node to OFFLINE.
Setting the interval of âillumination reportâ and âtemperature reportâ e.g. to 1800 (like Sihui) avoids this behavior, because these are reported before the heal comes on.
It is highly (very highly) unrelated to the configuration of the device. Heal works at the network layer, and not at the application layer so should not be influenced by configuration parameters.
Once the stage is put into HEAL_START at 14:25:00, the binding requests a report from the controller almost immediately after. The node responds with some state updates, but has not yet replied with the report from the RequestNodeNeighborUpdate. At 14:25:06, the binding decides to send the RequestNodeNeighborUpdate, which is logic in ZwaveTransactionManager class labeled âtimer2â.
The binding is requesting another RequestNodeNeighborUpdate because it hasnât received the report back from the controller, which the controller is still working on. The controller responds with a failure because the node is still working on the previous RequestNodeNeighborUpdate. Whatever logic is controlling the state of the node retries 3 times, but never waits for the node to figure out who its neighbors are.
IMHO, this is a little bit clearer from the logs themselves because you can see the retries with the same TID before the transaction manager decides the transaction failed.
Iâm not confident in how the underlying network functions, but I would imagine that a node querying a network with 90 neighbors might take a little longer than the 5 seconds that the binding is giving the controller to complete the transaction.
I was hesitant to create a pull request because Iâm not confident that this is the correct place to make the change in handling this particular transaction, but I went ahead and created one just to show exactly where I made the change. I opened an issue and will submit a pull request with the change shortly.