Changing the timeout for responses to requests in the transaction manager from 5000ms to 10000ms from the 2.5.1 tag resolved an issue that heal was effectively broken on my openhabian system on a Pi4 w/4GB RAM and 32GB disk, using the Linear usb controller.
Heal started breaking once I had around 60 devices. At this point the nightly heal started knocking all of my z-wave devices offline. Even when trying to heal a device, the transaction manager was resending the request multiple times until it timed out completely and finally aborted the transaction
IIRC, if you change the daily heal time to one hour prior to now, the heal will start immediately. If I enable the daily heal and let it run, my network will seize up and usually gets a “Queue full” error. I have 131 devices. These issues are related…
Please avoid posting extra JARs - it will just cause a nightmare (for me) if people start using different versions. It would be better if you can open a GH issue (or PR) and explain what you’ve done and why it is the correct approach.
I’ve had a quick look at the log and don’t currently see any problem with the log you said shows everything failing. There are no errors or timeouts etc, and from a quick look the heals seem to be completing in that log.
Since you’ve not really explained what you’ve changed, it hard to comment, so opening a PR seems the best approach so I can see what you’ve done. Often I see people say they have made a change, and it fixed a bunch of stuff, where actually the change is unrelated and it was just that the second time they tried (after they made a change) it worked and it’s not fixed anything.
Normally, the controller will timeout before 10 seconds on most commands as it also has a timeout which I think is set to 5 seconds. The binding uses a similar timeout so that in theory the controller timeout should rule.
So, I don’t want to jump to any conclusions without understanding the code you’ve changed.
In the log, using the log viewer, I see 6 “timeouts”. These aren’t exactly timeouts even if the log shows that. These are failures being returned from the controller, and then after 3 attempts the binding gives up, and then logs the “timeout”. This ultimately logs the 15 “timeouts” that I think you are referring to.
Looking at the transactions, they do not time out -:
They are receiving a failure message from the controller. The binding tries 3 times and then gives up - changing the timeout here from 5 to 10 seconds will not make any difference as the controller is returning a failure a lot quicker than that.
These are the same events that you refer to I think.
The controller is busy doing something else. The binding really should probably wait a bit longer, but at the moment that’s how it is and I won’t change it right now.
Again though, changing the timeout is not going to help, so I think that this is a misunderstanding where something gets changed, and the next time it works, but it doesn’t explain the initial problem.
I tested a lot at the time and then gave up. In the end had a strong guess why the heal no longer worked.
With my three Fibaros (FGMS-001) some parameters were changed in the Z-Wave database.
Parameters: 40, 42, 60, 62, 64
And more precisely parameters 42 and 64:
I compared it to @Sihui parameters, which he said made no changes. And his heal is running.
42 and 64 are set to 0, that means, it is reported during the awake period.
During a heal, this also happens. The node normally is sleeping. After e.g. every hour it wakes up. The heal request still is in the queue of the controller. After the node awakes it wants to report illumination and temperature and also should get healed in the same step. And the awake period is to short, to report all these information. While node is sleeing again, after tree trials the controller gives up and sets the node to OFFLINE.
Setting the interval of “illumination report” and “temperature report” e.g. to 1800 (like Sihui) avoids this behavior, because these are reported before the heal comes on.
Once the stage is put into HEAL_START at 14:25:00, the binding requests a report from the controller almost immediately after. The node responds with some state updates, but has not yet replied with the report from the RequestNodeNeighborUpdate. At 14:25:06, the binding decides to send the RequestNodeNeighborUpdate, which is logic in ZwaveTransactionManager class labeled ‘timer2’.
The binding is requesting another RequestNodeNeighborUpdate because it hasn’t received the report back from the controller, which the controller is still working on. The controller responds with a failure because the node is still working on the previous RequestNodeNeighborUpdate. Whatever logic is controlling the state of the node retries 3 times, but never waits for the node to figure out who its neighbors are.
IMHO, this is a little bit clearer from the logs themselves because you can see the retries with the same TID before the transaction manager decides the transaction failed.
I’m not confident in how the underlying network functions, but I would imagine that a node querying a network with 90 neighbors might take a little longer than the 5 seconds that the binding is giving the controller to complete the transaction.
I was hesitant to create a pull request because I’m not confident that this is the correct place to make the change in handling this particular transaction, but I went ahead and created one just to show exactly where I made the change. I opened an issue and will submit a pull request with the change shortly.