Zwave network issues after 2.4->2.5 upgrade, reproduceable

jberry · May 23, 2020, 3:00am

Hi!

I’m running OpenHAB on CentOS as a guest VM in ESXi. I’ve attempted the RPM update of 2.4->2.5 a few times now with the same result every time: my Zwave network becomes slow at first, then goes totally unresponsive.

Luckily I have snapshotted the VM pre-upgrade, and can revert back to 2.4 and reattempt as necessary. When reverting back to the snapshot, a VM reboot to reconnect the pass-thru USB devices is all it takes for my Zwave network to be fully happy, speedy, and predictable again.

I’ve seen a few topics on this with no solid single answer, so starting my own topic, with the hope my reproducibility will be handy in maybe tracing down a deeper issue.

I will attempt the update yet again tomorrow with what I predict is the same outcome for the 4th or 5th time. That said, are there logs/etc that would be useful pre-upgrade to compare with post-upgrade?

I plan to put 2.5 in debug mode for Zwave and upload logs to the Zwave log parser tomorrow and post those, as well as have a UBZ3 on the way to Zniffer.

Bruce_Osborne · May 23, 2020, 9:33am

Welcome
Are you sure the issue is Z-Wave or possibly invalid addons like restdocs causing all addons including zwave to reload every minute? Look closely for startup errors in the logs.

jberry · May 23, 2020, 10:44am

I did find that error a few attempts ago: restdocs and mail action were in addons.cfg. I fix that before I even start up 2 5 now, so those errors are not in the openhab or events log anymore. But I did see that behavior in the past, and fixing it didn’t resolve the (what I believe are) Zwave issues.

chris · May 23, 2020, 10:51am

Not really. What matters is what is happening - not what was happening… The logs of the system when it’s not working is what is needed.

jberry · May 23, 2020, 2:53pm

Alrighty, back to 2.5. The node I tried to control here is Node 5. You can see me sending the command at 9:45:01, and the controller doesn’t send the command to the node until 9:45:14. Once it DOES send the actual command, the node responds in milliseconds

2 CPU’s, load never even comes close to using 1; 4gb Ram with 1gb untouched (+1.3 in buffer/cache), zero swapping; So it doesn’t appear to be a load issue.

chris · May 23, 2020, 3:03pm

It looks like the binding is busy with other things during this time. This is pretty normal. ZWave doesn’t really support multiple transactions.

jberry · May 23, 2020, 3:16pm

I can understand that, but what I don’t understand is why moving to 2.5 triggers such a delay. Literally the exact same step order between 2.4 and 2.5 produces two drastically different results. I’ve done this multiple times now as mentioned, and I can’t figure out what/why.
Stop OpenHAB2
Clear Logs to make it nice and easy to see THIS session
Start OpenHAB2
Wait 10-15minutes to let everything start up and be happy
Send on/off command to the exact same node (light beside me so I can see exactly when it responds)
2.4 = milliseconds; 2.5 = 15-20seconds.

I’ve read the other threads, I know that not much was changed in the binding, and I can accept that. However, something changed somewhere that made 2.5’s performance this much worse than 2.4. My Zwave binding was updated from 2.4.0 -> 2.5.5 with the 2.5 update.

Maybe something about the security handling? My door locks, 20 & 25, are pretty busy/noisy I see in these logs. *I do have haveged installed and running to ensure the box has plenty of entropy for anything that needs it; it rarely drops below 1800, which ~100-200 is normally when we see /dev/urandom problems

Here’s another example. Node30, an outlet; 10 seconds before the message was sent:

chris · May 23, 2020, 3:22pm

Same point - the binding is doing something else. Or to be clearer - the ZWave network is doing something else.

It’s hard to tell exactly what’s happening from an image, but it’s clear there are other transactions taking place, and until those complete the binding will simply queue other messages. I could guess that node 25 is a FLiRS device and that is partly slowing things down by some time but the main cause here is transaction 37 doesn’t get the data it requests, so it times out after 5 seconds and then releases your command.

You can’t just assume that this is caused by the binding update - you are looking at very specific, and very short transaction sequences.

jberry · May 23, 2020, 3:36pm

You’re right, I can’t assume this is caused specifically by the Zwave binding update, and I wasn’t finger-pointing to it directly at all. I do apologize if you got that from my messages somehow. I did mention that I had updated my binding during the 2.5 update purely as additional informative data .

I am having issues with the bindings behavior, but I understand the binding itself could be the victim of another issue or change. I am looking at absolutely everything possible and every log I can find, as I really want to get to 2.5. That is also why I have a UBZ3 on the way to see what’s going on in my network itself.

What I can assume and have proven over and over, is this is caused by something somewhere when going from 2.4.0 -> 2.5.5. Revert to my 2.4 VM snapshot, and everything is fine again. No other OS changes at all, all OS updates were done while on 2.4 to ensure minimal change during the actual 2.5 update. Running Oracle Java 8, as I know that’s the suggested version from the OpenHAB docs (but I’ve also tried OpenJDK just to rule that out, same behavior).

If you have more ideas of where to look or troubleshoot, by all means I’m open to trying whatever I can/getting more logs/longer transaction histories/etc.

rossko57 · May 23, 2020, 3:56pm

Something that does seem to have caused grief to others 2.4->2.5 is the “nightly heal”. I’ve no idea if this might fire up after an upgrade.

jberry · May 23, 2020, 4:31pm

It seems to be a repeating loop of commands sending to my Schlage locks.

Node37 that failed in my first screenshot was powered off beside me. I got it healthy, then started analyzing log stats. Node 25 and Node 20, both Zwave locks (Schlage BE469), accounted for the majority of the traffic on my Zwave network. I’m not sure if it acts as a FLiRS device as you mentioned @chris

It seems both locks were repeating the same command sequence over and over every 10-15 seconds. Once ONE of those sequences was done, it released the network to act on the other commands I was trying to send. I looked at my Polling config for both devices and it was set to 1 day, not sure why it was running this fast/often.

I excluded both of those devices from my network, and it’s behaving proper now. So, why did the update to OH2.5 trigger this repeating loop that kept the network busy is the question now…from what I can see, it was OH2.5 that was TX’ing the start of transaction 46.

chris · May 23, 2020, 4:32pm

When you upgraded your system, did you copy over the security keys? That would cause this sort of thing where the secure device is not responding (because the key is wrong).

jberry · May 23, 2020, 4:46pm

For this update today, I went back to my zipped install instead of messing with the RPM again (prior attempts I had moved from 2.4zip->2.4RPM, then 2.4RPM->2.5RPM; I’ve tried about every convoluted upgrade method/order I could come up with )

I moved my /opt/openhab2 out of the way, unzipped the new 2.5.5, did a bin/restore of the backup data, tweak the settings as necessary (removing old addon configs like restdocs and mail), and started it up.

My Zwave stick still shows a Network Security Key in it’s Network Settings, and I didn’t make any changes around that field/value/config anywhere. I have a 3rd device running security, a garage door opener, and I can still control it fine with zero changes (no exclude/reinclude). I believe that means the key is the same still, otherwise I couldn’t control a was-already-secure-connected device?

Would we expect to see a key rejection/failure in the debug log or another log if the key was wrong somewhere?

chris · May 23, 2020, 4:52pm

Sure - it will create a random key, so there is a near zero chance it will be the same as before if you didn’t copy it over. Maybe your backup/restore sorted this - I’m not sure.

Clearly the device I looked at earlier was not responding to secure commands, so that probably explains the issue at least.

5iver · May 23, 2020, 4:58pm

About once a year, my BE469s will lose their keys and need to be excluded, reset and reincluded. It always seems related to letting the batteries get too low. However, my NGD00Zs also drop their keys and they are hardwired.

I’ve just accepted this as part of regular Z-Wave maintenance.

jberry · May 23, 2020, 5:14pm

I dug through the backup zip and yes, the security key is backed up and restored. I dug through months of backups, key was always the same, so that didn’t change it appears (it’s located in a few places in userdata/jsonb).

Looking at node 25, one of the nodes in question from a prior screenshot, I see TX/RX/ACK’s/etc from what appears to be a back-and-forth communication. That communication/command is what was happening every 15 seconds and had been locking the controller so busy. Can you point me at the timestamp of it failing to talk? I’m probably misreading this somehow.

If my key was wrong, the garage door opener shouldn’t be working, and I could accept Security was totally broke due to mismatched key. However it IS working, all the while the locks were going poll-crazy.

as @5iver just mentioned, I could accept the BE469 losing a key randomly as well; however, not at the exact time I do a 2.5 update, then it magically everything is fine with a VM snapshot, then break again at update, etc.

The very specific call the BE469 kept spamming is a USER_CODE_GET. I’m about to re-include at least 1 of the locks on my network to see what happens. Maybe my lock is returning something not being handled/an exception is being swallowed somewhere and kicking off a retry timer now? Wish I had 2.4 debug logs now to see how often that was called before

Maybe 2.4 is what was “broke” and wasn’t handling this proper, and 2.5.5 has a fixed/good/proper handler for whatever is happening, and I just wasn’t noticing 2.4 was broke for some feature I wasn’t using

chris · May 23, 2020, 5:18pm

10:12:11 is a good example. It is pretty clear that the device did not respond. I can’t tell you why - if it was a security issue or not, but the protocol level ACK was there, so clearly it wasn’t a general comms issue.

jberry · May 23, 2020, 5:50pm

Well I reincluded the locks, and now they’re going crazy with a different call; filtered to the nodes to see how repetitive it was; same behavior it appears, bogs down the controller for other zwave commands, just a different call it’s making now:

jberry · May 23, 2020, 8:51pm

Well after 3ish inclusions that were supposedly successful (lock beeped/green checkmark & showed up in my Inbox, but the device never seemed to resolve manufacturer/type/etc); decided to try something new.

Deleted the items from OpenHAB, shut it down, moved the USB stick to my laptop, excluded the devices with Zensys, shared the USB from my laptop over Vmware and started OpenHAB back up, and did a new Inclusion in OpenHAB with the USB stick in my laptop right next to the lock.

The handshake/inclusion took WAY longer, logs flowed a lot more crazy with different calls instead of the same call repetitively, and finally everything identified and the network seems healthy.

Moved the stick back to the actual server, and everything continues to work fine.

My only conclusion is the BE469’s lost the keys just like 5iver has seen when I was back in 2.4, I never noticed they wern’t working, and something new in 2.5 just turned that into a retry forever attempt bogging the network down until it was fixed. Not sure why all the other inclusions were “successful” and acted like it was fine, but it definitely behaved differently after the nearby handshake.

So I guess stare at the logs and exclude/include any secure devices over and over and over until you see them finally behave differently, and move onto the next one.