Z-Wave troubles - unresponsive network until bundle restart

Hello, and Happy New Year to all!

I have been having so much trouble with my Z-Wave network recently and wanted to get people’s thoughts as I am running out of ideas.

My ~40 node network that had been functioning fine prior to 2.4 now sporadically becomes unresponsive to sent messages - turning on lights, setting TRV setpoints, switching on central heating etc. During this time received messages seem to be coming through just fine - temperature sensors, energy readings etc. Occasionally, the network suddenly starts responding again and the sent messages are consumed - you can tell this is happening when the kitchen lights start dimming up and down, responding to commands that were maybe sent one hour ago in consecutive order.

A bundle restart in Karaf seems to clear this up until the next time it happens.

In a desperate attempt to solve this issue, I hard reset my Z-Stick yesterday and started rebuilding my network again. It was all functioning well until I added nodes 16 (Secure Temperature Sensor), 17, 18 and 20 (TKB energy metering power plugs) today. I will go ahead and remove these one by one to see if this improves the situation.

I have gone through the Z-Wave logs and apart from some NOACK statuses after ~8000ms on a few nodes I cannot see any one culprit device, or any screaming error messages anywhere. I am attaching one recent log file which spans a period of unresponsiveness followed by a bundle restart.

Z-Wave.log

Does any of our Z-Wave gurus have any experience with this sort of issue? @chris ? Any help is much appreciated.

OH2.4 stable on Debian Stretch, Aeotec Z-Stick Gen5

Thanks!
Yetkin

I seem to be having a similar issue. After 2.4 it seems like roughly every day the z-wave binding becomes unresponsive and stops receiving updates from devices. Commands to devices still work, but when I press a light switch, nothing gets reported to OpenHab. I have not tried restarting the binding to see if it fixes things.

I’m on OH2 2.5 SNAPSHOT and I don’t see this. All is well there.
Aeotech GEN 5 USB stick. Ubuntu Server 16.04 LTS. x64 intel platform. ~60 nodes.

Okay, just grabbed the latest 2.5 snapshot and installed it. Let’s see if this fixes things.

Let’s hope so …

No such luck. Had an issue tonight where nothing was responsive, then after about 8 minutes all of the events started being processed. I had z-wave debug logs enabled at the time, and the weird part is the binding was seeing the messages from the switches as they came in, but openhab didn’t seem to be processing them. The only thing that seems out of the ordinary to me in the logs are some entries from the Ecobee binding having connection timeouts. Here are the logs from the relevant time (ignore the link and unlink messages, I was testing if that was the cause):
openhab.log
events.log

Just to test things, I disabled the Ecobee and Harmony bindings last night, and so far things have been responsive as expected. I’m not sure how long the Ecobee issues were going on, I feel like they’ve been having some server issues the past week or so, but I have not updated my Harmony to the beta firmware that re-enables XMPP. I’m wondering if, at least in my case, this is an issue with the updates from the switches getting caught in some OpenHab queue, rather than strictly an issue with the z-wave binding.

Still having issues even with the Ecobee binding disabled. The debug data does show the binding receiving the updates on time, but nothing is happening after that.

I had a quick look at the log you posted above, and I don’t see any problems from a ZWave perspective. The binding is occasionally polling the devices, and is getting responses ok. I also see state updates coming in from devices that are unsolicited, which shows that associations are set ok, and again these are updating the channels.

There is a bit of a gap of incoming data in the middle of the log for a minute or two, but I can’t judge if this is just that devices didn’t report during that time or if there was some other problem - such gaps would not be abnormal since devices shouldn’t normally be reporting data too often. If there is a problem, I don’t think it’s with the ZWave binding as it was working before and after, and there is also a scene update in the middle and there’s no sign of any issue.

Can you explain what this means? Do you mean the log shows the state updates are received, and the channels are being updated, but you don’t see the updates in the UI? Or something else?

I was watching the logs one of the times I had the issue. I pressed the light switch, the binding showed the update from the switch in the log, but the UI didn’t report anything, and the item state didn’t change in the events log until a few minutes later. When the item state changed the rule that was tied to the switch fired. So the zwave network side of things seems fine. The lag seems to be somewhere between the binding processing the event, and openhab seeing it.

What HW platform are you on? (couldn’t spot it in your first post if it was there)
Have you tried monitoring with a ‘top’ command when things are slow?
Watch out for both free memory and CPU hogs.

Raspi 3 B rev 1.2. Under normal load the CPU is barely in use, and about 512MB of RAM is being used, most of that openHAB. The web UI and ssh both feel perfectly responsive when things get laggy, but I have checked loadavg, CPU, or free men during that time. I have an Aeotec gen 5 stick too and I think 10 z-wave nodes. The only other bindings running at this point are Astro and Sonos.

Sorry. Out of ideas. Could it be Java related?
What Java version are you on?

openjdk version “1.8.0_152”. One thing that may or may not be related, the last time I had this issue, it popped up after removing my christmas tree from the z-wave network. I restarted the bundle after that, and have not had issues since. I was removing and re-adding devices a lot the past two weeks with the 2.4 upgrade, so it’s possible that’s related.

What kind of Z-wave controller?
If you just power off devices without removing them from the controller each zombie node can introduce long delays. Nx5s for each missing ack.

Aeotec 5th gen z-stick. And by remove, I did mean “put the controller in exclusion mode, then follow the exclusion procedure on device,” so the christmas tree was properly removed from the network.

Ok. Then I’m truly out of ideas… Sorry.

What exactly is the device that ran your xmas lights?

Do a quick “scan” in the inbox on the z-wave binding to see if the removed nodes show up again.

It’s a Jasco/GE 12718. Everything has been running fine since restarting the binding after removing the device. I don’t think the device had anything to do with it, but wonder if putting the controller into exclusion mode may have caused some issues. I was excluding and re-adding a lot of devices to fix some issues where devices weren’t reporting back their status after updating to 2.4.