ZWave Binding (2.4 and 2.5M3) appears to provoke severe latency with Aeotec G5 Z-Stick after rebuild

Summary

I lost my OH2.4 system (HDD failure) 2 weeks ago, and after rebuilds, have been trying to diagnose the issue since. I sincerely hope I’m overlooking something very simple, perhaps some config/setting set years ago, etc. that I’m missing. Looking for suggesting as I’ve been troubleshooting (probably over-analyzing) this for 2 weeks now.

It doesn’t make logical sense that it is the binding as I’ve started from a clean install of OH2.4.0/Zwave Binding 2.4.0 which is the same version I’ve been on over 9 months. However, since the rebuild, when OH is started, the ZWave stick status LED (which normally cycles a color pattern ~0.5 seconds, looks like it “freezes” but if you wait long enough, will cycle to the next color. It varies, but I’ve timed this at ~12minutes as an example. The response of z-wave devices seem to follow suit. (for example, if I switch a device on/off in Paper-UI, the delay for when it actually occurs seem to coincide to a similar timing) If using the Sigma “Z-Wave PC-Controller” v4.78 software (aka ZenSys Tools), I can switch devices on/off immediately, leave it running for hours/overnight, etc., no issues I can tell.

I have switched PCs, Z-Stick, software and everything possible I can think of to rule out every variable I can think of (as much as possible without the benefit of the original drive). I’ve tried to capture all of this below which hopefully helps identify some “ah-ha” thing I’m missing or determine if actually a bug somewhere.

Details / Troubleshooting / Observations
(not necessarily in order)

  • Prior to failure:

    • OH 2.4.0 Stable Rel
    • 2 core/ 4GB RAM dedicated system. (Small Form Factor PC)
    • Aeotec G5 ZWave Stick
    • Ubuntu 14.x LTS (not 100% sure now)
    • Been running with no issues since 2.4 was released I believe last Dec
  • HDD issue when OH system was powered off inadvertently.

    • ironically when breakers off to install 14 new Z-Wave switches / devices
    • (Yes, on an APC UPS, and yes I know the battery was overdue to be replaced :slight_smile:
  • Rebuilt PC with new SSD HDD

    • (had already planned to replace spinning HDD with new SSD some many months ago, so again ironically, already had new SSD waiting for this)
    • Ubuntu 18.04 LTS
    • Java OpenJDK 8 JRE/JDK Headless (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
    • Initially installed OH 2.5.0M3, when realized this behaviour, removed and started clean install of 2.4.0
  • When OH 2.4.0 (and 2.5.0.M3) starts, Aeotec G5 stick appears to freeze during each OH initialization…

    • Z-Stick LED, which typically sequence between red/blue/yellow every ~0.5 seconds can then take about ~12 minutes to sequence.
    • After OH2.x initialization and lock-up occurs,reinserting z-stick (power cycle) returns led status sequence to normal
  • No obvious errors/issue (to me) with debug logging in OH for “org.openhab.binding.zwave” (and at ROOT level) observed.

  • Using Chris’s z-wave log viewer with zwave at debug level, nothing obvious (to me) either.

  • Devices to respond after some time, could be ~15m-ish to immediate (believe could be timing related to above window)

    • (appears related to the zwave stick led status sequence timing)
  • Monitoring/Operating Z-Wave stick (via Sigma Z-Wave PC Controller) for long periods of time:

    • No z-stick “lock-ups” (LED sequence / cycle normal)
    • see what appears to be typical zwave packets… data received, etc.
    • basic on/off control operations sent to devices are immediate
  • Originally thought Aeotec Home Energy Monitor (new device) might be “flooding” the zwave network based upon forums.

    • unplugged initially
    • then completely removed node from ZWave g5 stick.
    • remains power off / not joined
  • OH Version

    • Clean new install of 2.5.0.M3
    • downgraded Stable 2.4.0
    • Clean new install of Stable 2.4.0 (Current state)
  • Same behaviour with original/current and new physical Aeotec g5 Z-Sticks

  • OS/Ubuntu versions

    • As mentioned above, new install on 18.04LTS
    • Tested another clean OH2.4 install on 16.04LTS
      • with all newly added items removed from Z-Stick
  • permissions - running OH as root vs typically as openhab

  • removing all recently added z-wave nodes (>= node ID: 45)

    • incrementally removed z-zwave node IDs from 58 down to 44 (reboot/reset between each)
  • At various stages, gave the OH system lots of time (many hours / overnight) to let Z-Wave network healing processes, etc. basically time to do its thing, etc.

Again, hope I’m missing something simple!! Thanks all!

Sounds like you’ve done thorough troubleshooting. one thing worth trying is changing out the batteries for all battery operated devices. Even if they’re reporting +50%.

I’ve noted some locks or motion sensors start to drop off zwave networks when they reach under 70% and cause latency problems

1 Like

Originally I thought you forgot the soft reset of your zwave stick, but no, you did.
So the only thing I can think of is clearing the cache because you changed install from 2.5M3 to 2.4 and vicaversa:

2 Likes

I suggest you stay on 2.5M3, since there are a lot of improvements in OHC and the addons compared to 2.4. As for your issue, it sounds very similar to something that I had reported, but appears to not affect everyone. The workaround that I used was to disable the daily network zwave heal, restart OH, and then to manually heal the devices. Using a zwave sniffer helped a lot. The daily heal still causes commands to be severely delayed or a “Queue Full” message in the log, but hopefully someday this will be addressed.

2 Likes

Thanks for the suggestions so far:

More info, forgot on original post:

  • Tried to Zulu8 JRE (no diff, switched back to openJDK)
  • Switching from 2.5 back to 2.4, complete removal of all OH dirs. Also, new PCs and VMs so complete new OS when trying to rule out Ubuntu 18.x had some crazy usb/serial/os odd interaction.
  • OH Zwave logs at debug level. similar periods of time where no zwave activity, then lots of activity at once and the multiple devices respond at same time.

To clarify on “latency”, examples:

(via Paper-UI)
13:38:45 Turn Lamp A On
13:39:37 Turn Lamp B On

13:44:14 Lamp A & B Turn On
quite a bit of zwave debug activity at that time as well (when lamps actually turn on).

Another timing sample:

15:37:53 Turn Lamp A Off
15:42:50 Lamp A Off (4m 57s)

It has also been much longer delays (like 10-15m) in other timing tests.
.
Results for some of the suggestions:

  1. Clear Cache (v2.4.0 Stable)

  2. taking battery operated devices (BoD) offline:

Suggestion did elevate my suspicion (had in back of mind… have several Aeotec mulit-sensors… optionally BoD but have all using usb pwr src… however, one of them seems like doesn’t report all values consistently… think was reporting motions events properly but temp/hum, seems to not be working)… thought maybe the OH-Zwave init process may be waiting/hanging somehow on that device, so just took it offline completely for now.

  1. reapply 2.5.0.M3 (wanted to anyway, 2.5 REL should be out pretty soon I suspect)
  • Shutdown OH processes
  • Clear OH Cache / Temp
  • update apt distro sources to “deb https://openhab.jfrog.io/openhab/openhab-linuxpkg testing main”; apt update
  • apt install openhab2 (updates from 2.4.0-1 -->> 2.5.0~M3-1)
  • Clear OH Cache / Temp (just to be sure)
  • service openhab2 start
  • (waiting for oh to finish updating modules, etc)
  • confirmed ZWave Binding as: 2.5.0.M3

Sidenote: OH 2.5.0.M3 doesn’t appear to have completely updated correctly. Main indication, hapanel doesn’t start… shows installed, uninstalled/reinstalled, but doesn’t show avail.

I will probably do complete removal (including manually removing all dirs,etc), but I’m trying not to change too many items at once.

@5iver

I believe I’m seeing very similar activity in your issue#1178 (while a lot looks normal, entries where “Timeout at state …” and " … Transaction Cancelled…" see that a lot as well. I’m curious to know if that is normal or not. Disabled nightly heal as well.

Also, using the Silicon Labs Zniffer? Have any trouble getting it going? I did get that the other day when 1st trying to debug, but after flashing the zniffer firmware to a spare Aeotec Gen5 (used sniffer_ZW050x_USBVCP.hex 1st) Zniffer looked all good but never saw any packets. I then tried the sniffer_ZW050x.hex and looks like I bricked the stick. Said flashed OK, but now USB doesn’t identify the device anymore. I was hoping to see what is actually is actually getting sent over the air to hopefully help rule something in/out. (ordered a zwave.me ZWB and hopefully will have better look with zniffer with it)

With all the above, can’t say I’ve noticed any progress yet. Will work on a full clean/fresh install of 2.5.0.M3 asap.

You mentioned you added a new zwave device. Might it be one of your old ones also flooding the zwave network?

I recent had this exact issue with a Zooz external switch. It has a feature to report energy, but out of the box it reports once per second(!!).

When installing this switch, you must set the interval to something more reasonable. Otherwise, the massive network traffic will cause all types of unexpected delays and problems. My mistake was forgetting to set the parameters when I removed and re-paired the device to work on something else. After tracing the cause to this switch and fixing the parameters, the network recovered back to it’s usual performance.

1 Like

Latency like this is nearly always related to a device timing out, and therefore delaying other traffic. This should be pretty clear from a debug log. I’d really suggest to provide the log, or if you don’t want to provide logs, then use the log viewer to analyse it yourself.

https://www.cd-jackson.com/index.php/openhab/zwave-log-viewer

1 Like

@Chris If you could take a look at some of the logs, I would greatly appreciate it. I’ve uploaded to the log viewer several at various times and haven’t seen anything that is obvious (to me). I’ve copied a couple days worth of log via dropbox link here.

Some additional bits of info that may be useful:

  • all of the new devices (nodes 45-57), I don’t think I have seen them “work” via openhab control, even with the large delay as seen with the devices that were present prior.
  • All the new devices are 10 x “NuTone NWS15Z Smart Z-Wave Enabled Wall Switch”, and 3 x “NuTone NWD500Z Smart Z-Wave Enabled Wall Dimmer Switch” They appear to be “Linear Corp” devices according to HABmin.
  • Using Zensys Tools with the Z-Stick, all the devices (random sampling of wall switches / lamp module, existing and new) respond immediately to a basic on/off commands.
  • There was a new node “58”, being the “Aeotec Home Energy Monitor”, which was deleted from OH and device has been powered-off since about the 2nd day that this troubleshooting started. Originally thought that was “flooding” the zwave network based upon searching the symptoms. (Roy, to your thoughts, was thinking the same with this device.)

Let me know if any other info/details, but hoping it is something obvious in the logs.

Thanks!!

I have found the root of this issue and now resolved.

The problem turned out to be in the z-stick’s configuration/pairing data (eeprom) in the region where nodes ~44 - 51 are stored. (either corrupted… more likely… or possibly faulty eeprom in this region.)

General info on how this was determined icyi (Note, 44 was the last node added ~ 2yrs ago):

  • Was in the process of another 100% clean 2.5.0.m3 install. The range of nodes > 44 showed as unknown. (all the newly added devices) Odd, they appeared to before…
  • They were recognized before and was thinking… Perhaps these particular devices weren’t actually in the zwave DB. started consulting Chris’s ZWave DB for brand/model, etc… these specific brand/model weren’t listed in there, thought I had solved this here, they weren’t in the DB… (but added another mystery as to how they might have been identified before. And figured somehow in the downgrade/upgrade they got recognized. Maybe not correctly, etc… That theory really didn’t hold water, but at this point, grasping for straws :slight_smile: ) In the OH zwave node xml files, all these where showing obviously unknown manufacturer/deviceID/type (something like 0xfffffff, don’t recall exact value)
  • Still going down the path of needed to help create DB entries for these, using ZenSys tools, wanted to get details like manf / id types to confirm they didn’t exist in the DB under other brand / model names.
  • Since I was in ZenSys tools, I thought it would be good to confirm each node I could (not sensors/PIRs, etc) that it responded/worked to basic on/off commands. (had before, but did random samples). When I got to some of these nodes to send a on/off, I got the symptoms similar to when OH was started (z-wave binding starting) that the z-stick status LED “froze”, ZenSys tools froze, for 30ish seconds, etc. (I believe just trying to send a nop packet did the same) With 8 of these nodes behaving this way, it now started to make sense with OH startup/operating that it would appear to freeze for many minutes, etc.
  • Went to remove these nodes from the z-stick using the normal exclusion process. All nodes appeared to get excluded (z-stick led status / sequence, etc). However, these in this range didn’t. Using ZenSys tools to force exclude failed as well.
  • At this point, got another brand new z-stick. Excluded every node I could from this one… All excluded except this range. On the device side, they completed their “exclude” process fine.
  • I did a hard/factory reset on the z-stick and it reset/cleaned everything successfully (it appears), so I’m leaning more toward corrupted configuration than physical problems with the eeprom. With the time invested in all this, wanting to know for sure… I’ve got snapshots of the eeprom data from various points in time, can recreate the contents, test some writes to those eeprom regions, etc. but may be a bit… Need to catch up on other stuff…

Looking back to the initial troubleshooting process, and trying to rule out the z-stick, it would not have accounted for this.

At this point, most everything is configured and running (all the previously existing items, most of the new switches, the new Aeotec Home Energy Monitor, the original culprit) and more. Everything appears to be even more responsive, which makes me wonder if that last node added ~ 2hrs ago (sensor at ID 44 which wouldn’t delete) wasn’t partly corrupted, then adding 14 new devices, where 8 of them were impacted by that, brought this to light.

Thanks for the help and suggestions!

1 Like