Troubleshooting Z-Wave

rliffredo · March 2, 2020, 8:25am

Lately, my z-wave network has become a bit unreliable.
Essentially, sometimes commands are not sent, and I have no much idea where the problem is.

I have a presence rule in OpenHab, stating that on door open (node 42) or motion (node 14) then light should go on (node 8). Switching off the light is a bit more complex, and involves some few steps and timers.
Most of the time, the setup works flawlessly; but then, every now and then, it “stops working”, with delays of even 10-20 seconds.

This morning, for instance, I had the issue in the screenshot.
What really strikes me is the huge time difference between the COMMAND RECEIVED and the actual transmission.

Is this something related to the binding/OpenHab? Or to the controller? Or to something else?
Note that few weeks ago, I bought a second controller as a backup (still the same, Aeon Stick Gen5, but with a slightly newer firmware version), and I started using it after copying the data.

Here is the log: zwave.problem.log (152.5 KB)

mhilbush · March 2, 2020, 12:33pm

The very first line in your screenshot is complaining about node 65 not getting an ACK after 7297 ms. Without looking at the full log file, this would suggest there might be an issue communicating with node 65 (which appears to be a battery-powered device).

mhilbush · March 2, 2020, 12:41pm

Without seeing more log data, these timeouts, which are occurring at the same time the command is being sent to node 8, are likely the cause of the delay, IMO.

2020-03-02 06:21:45.066 [DEBUG] [ocol.ZWaveTransactionManager$ZWaveTransactionTimer] - NODE 56: TID 45626: Timeout at state ABORTED. 3 retries remaining.
2020-03-02 06:21:50.093 [DEBUG] [ocol.ZWaveTransactionManager$ZWaveTransactionTimer] - NODE 59: TID 45628: Timeout at state WAIT_REQUEST. 3 retries remaining.
2020-03-02 06:22:02.096 [DEBUG] [ocol.ZWaveTransactionManager$ZWaveTransactionTimer] - NODE 59: TID 45628: Timeout at state ABORTED. 3 retries remaining.
2020-03-02 06:22:07.123 [DEBUG] [ocol.ZWaveTransactionManager$ZWaveTransactionTimer] - NODE 65: TID 45629: Timeout at state WAIT_REQUEST. 3 retries remaining.
2020-03-02 06:22:19.128 [DEBUG] [ocol.ZWaveTransactionManager$ZWaveTransactionTimer] - NODE 65: TID 45629: Timeout at state ABORTED. 3 retries remaining.

rliffredo · March 2, 2020, 1:24pm

That’s a Danfoss thermostat LC-13.
Few weeks ago I had to move the controller and then it started acting weird; I excluded and included it again, and it seemed to work.

Is it possible that the timeouts are because of the nightly self-heal? There were a lot of long response times from many devices, after the heal.

mhilbush · March 2, 2020, 1:42pm

That’s a possibility. The nightly heal screws up my network so badly that I’ve disabled it.

5iver · March 2, 2020, 1:51pm

Moving the controller will change all the routes that have been discovered. If you have newer devices, everything should sort itself out eventually, but Z-Wave performance may be slow in the meantime. Older devices may need a heal to get communications working again. The network wide heal currently has some issues. I also have mine disabled due to it making my network completely unusable. If/when it is needed, I heal individual devices.

How many nodes are in your network and how many are mains powered vs battery?

rliffredo · March 2, 2020, 3:08pm

I have about 40 nodes, 17 of them are powered – in every room there there is at least one powered node. All of the nodes should be z-wave plus, with the exception of the danfoss thermostats.

Since the movement I have healed every node; and since the self-heal was running every night at 2am, I would expect that now routing problems should be over.

The strange thing is that self-heal was always enabled; but the problems seem to be worse now.
I cannot say if this is because of the change in the stick (the new one has a newer firmware), because of a change in the location (before it was in the very center of the apartment), because of some change to configuration, or because of a change in OH (honestly, I tend to exclude this last option).

Bruce_Osborne · March 2, 2020, 3:13pm

I have found , in my experience, routing does not always occur as you would expect.
I have obtained a UZB3 stick with Zniffer software to investigate my network further. Zombie or ghost nodes can cause havoc too.

rliffredo · March 2, 2020, 6:05pm

Yes, at a certain point in past I had a ghost node, and it was causing a lot of troubles.
Then I discovered how to remove it, and for a couple of months I was in (z-wave) bliss.
But now there are no ghosts, and everything seems OK from that point of view.

I am taking your advice, and I have just ordered a UZB3 stick.
Hopefully, that knowledge will help in debugging the issues I am currently facing.

That said, it’s quite sad that to get a reliable z-wave mesh you have to buy such type of equipment, and get into such level of detail.

5iver · March 2, 2020, 7:34pm

You could have used your older zstick too… that is what I am using.

robmac · March 3, 2020, 8:46am

Is it not possible to put it back where it was? The fewer indirect links the better.

That would need some fixing if you move things around. You might find battery life is poor until you get it all working again.

rliffredo · March 3, 2020, 11:55am

If I won’t be able to solve problems, I might decide to go back; but I would like to do whatever possible to maintain in the current place.

What do you mean with fixing?
After moving the controller, I have healed every node manually, and on top of that, the nightly healing should still do its job. Is there anything else I should do to improve the situation?

rliffredo · March 3, 2020, 11:57am

Too late, I had already ordered it when I read your post.
Now, I want to think that this is still a good idea, since I wanted to keep the older zstick around as a backup. Please don’t tell me otherwise

5iver · March 3, 2020, 1:13pm

You can try disabling the daily heal and restarting OH to clear the stuck threads. Many of us have to do this in order to have a useable network.

mhilbush · March 3, 2020, 1:31pm

See my post here to see how to determine if there are “stuck” heal threads.

rliffredo · March 4, 2020, 2:59pm

That’s very good info!

I started to monitor that, and at a certain point I could see that waiting threads seemed to be a limiting factor:

Restarted OH
The binding is querying all nodes
There are about 30 threads waiting for the node
No command passes by: I try to turn on a light, and it does not work
One node finally awakes, and give a response: all commands that were “queued” until that point in time are then executed

Is it possible to increase the amount of threads available for zwave?
Would that improve such kind of exceptional circumstances?

mhilbush · March 4, 2020, 3:52pm

After a restart it will normally be slow. This is because the binding interrogates all the devices during startup (this is called initialization). Initialization can take some time, especially if you have a large network. It can take even longer for the initialization to fully complete when you have battery-powered devices, as they need to wake up before they can be interrogated by the binding.

The command I referenced in the post is only meaningful when things are in steady state. Running this soon after a startup will not yield meaningful results.

This is not necessary.

No.

rliffredo · March 4, 2020, 4:38pm

Yes, during startup the system is slow; and that’s expected. OTOH, you might see that as a load test.

And that’s not so far from certain situations I have seen: the network works OK, but then I have many guests, quite a lot of updates from all sensors at the same time, and typically something stops working.

The problem I have is that most of the time the network is in a steady state; but issues rarely happen when the the system is in a steady state.

rliffredo · March 8, 2020, 2:37pm

Just an update.

I have followed some of the advice found in this topic and in the forum.

In particular:

One node battery was dead, so the node was not responding. I have replaced it. Note for the future: the battery information is not always reliable, so I should have some other way to check offline nodes.
Reduced the amount of traffic on the network. Either by increasing the wake-up interval, but also by limiting the amount of data reported back (e.g. energy consumption for some switches).
Stopped healing, and enabling only when adding new nodes.

Now, it is a bit too early to celebrate a success, but it looks like the situation already improved a lot.
Finally, the system started to react consistently fast and reliably.

Thanks @Bruce_Osborne, @5iver, @mhilbush, and @robmac for your help!

mhilbush · March 8, 2020, 3:27pm

Nice work!