Zwave device initialization failing after restart

rtdog · January 2, 2024, 3:39pm

I’m having trouble with device initialization during a reboot using 4.1 although I suspect I’ve always had this problem. After restarting openHAB my system runs ZWave initialization for about 8 hours until it gets caught up, I have about 90 nodes. There are some devices that never initialize. In my most recent boot there were 4 mains-powered devices that never initialized.

The devices get stuck in some “Heal-State” initialization stage: DELETE_ROUTES, SUC_ROUTE, etc. Their Status shows “Offline” until I interact with the device, then it goes ONLINE, but still is not initialized. I discovered this issue while troubleshooting an unrelated problem: Zwave polling unlinked channels - Add-ons / Bindings - openHAB Community. I have three basic concerns here:

There appears to be no way to get these devices initialized. I cannot HEAL the device because HEAL skips a device if it’s uninitialized. The option to “reinitialize the device” is not on the GUI presumably because it’s not yet initialized. I can’t figure out how to initialize them short of deleting and re-adding the node, which is not tenable for every reboot and I haven’t even tested if that works yet. Any idea on how I can force a reinitialization attempt?
There appears to be no location I can find to display this status. The GUI has a status detail line although it is not updated for any of these “heal states”: https://github.com/openhab/org.openhab.binding.zwave/blob/329d0b143ff81256c51f05936d50aa8341c436ff/src/main/java/org/openhab/binding/zwave/handler/ZWaveThingHandler.java#L1542 . The javascript scripting add-on I use has status and statusInfo, but I think these are the same values on the GUI and I cannot find a way to get binding specific details through the Things object. What I’ve been doing to determine the status is attempting a HEAL then reviewing the logs which will give me the status in the message where it says it can’t do the HEAL.
The device initialization is failing. The logs seem to indicate that these nodes are getting stuck after 3 failed timeouts at WAIT_RESPONSE. There are many other retries that succeed and the transaction failures are on different stages on the various devices, which are of different models, at different hours during the sequence. Is there a way for me to configure the retry count and delays?

rlkoshak · January 2, 2024, 4:35pm

I can’t think of a single reason it would take 8 hours to initialize zwave, even on an underpowered machine. There must be something very significantly wrong or your machine is way underpowered (like a RPi 0W).

Given that, I’d say that the fact that some devices didn’t initialize is a symptom or a wider problem.

I would ignore the initialization problem for now and look into why it takes so long in the first place. Even with that many nodes it should not take more than a minute or two, not 8 hours.

apella12 · January 2, 2024, 10:03pm

AFAIK if a powered device doesn’t get initialized in a few minutes, it will stay that way. For me that has been very rare and I restarted OH to fix. Maybe a delete and scan will work, but I don’t know either.
Could be language issue, but the heal states and initiation states are different and do not always seem to appear in the UI. A better java class to look at is here. org.openhab.binding.zwave/src/main/java/org/openhab/binding/zwave/internal/protocol/initialization/ZWaveNodeInitStage.java at 329d0b143ff81256c51f05936d50aa8341c436ff · openhab/org.openhab.binding.zwave (github.com)

2a) Generally the Debug level has a message as the stage advances.
See the log viewer example from file (restart):

3) WAIT_RESPONSE is not good. It means the controller has not ack’ed the message. I think there is 2 seconds for that and 5 seconds total for WAIT_REQUEST. WAIT_REQUEST means the node hasn’t responded. I could look at a log for you if you wish.

rtdog · January 2, 2024, 10:08pm

I appreciate the feedback rlkoshak and suspected that something is wrong so I’ve been digging through zwave logs for many hours now which brought me here. I’ll continue with additional research. I actually can’t say for sure it was 8 hours, I was basing that on the time I started the service until the time the zwave logs had the last devices initialized. The logs had wrapped too far for me to see all of the times. I realized after my post that it’s possible that there’s a concept that does a scheduled “network heal”, maybe at 2:00am, which might be the end of what I was looking at, although that would still be 3 hours. I can’t find the settings on the network heal thing and don’t remember how that’s set up.

I’m running a Raspberry Pi 4 which is what I believed from reading the forums is the most reliable/stable platform to run it on. I’m also running an Aeotec 5 stick, which as far as I am aware is a fairly stable stick to run it on. In fact I understand that the 700 series zwave sticks are not supported by openHAB, but I just inferred that from a number of forum discussions so I haven’t made the effort to try it out. If I thought a higher horsepower device would improve performance and/or reliability I’d switch for sure if I knew what to move to. Maybe an Intel NUC?

For posterity’s sake I did answer my question #2. I am able to force a node reinitialization through an HTTP call by scraping the “heal” call through the GUI and modifying the payload to be an action_reinit instead of an action_heal. I suspect the API call PUT /things/{thingUID}/config will do the same if I end up needing to do this systematically.

rlkoshak · January 2, 2024, 10:20pm

It’s a property on the Zwave Controller Thing. If you are having these kinds of problems I think the general advice is to disable the network heal. As I understand it, the heal is only really needed if you move a device anyway but I recall there being some problems in the networking heal in the past.

An RPi 4 should be plenty powerful and unless @apella12 says differently, it should take minutes, not hours initialize everything. And there are timelines to be met so if your machine is struggling it could cause problems. What’s your system load, available memory and swap usage?

OH 4.1 brings support for 700 series controllers.

apella12 · January 2, 2024, 10:39pm

Rpi4 is fine. I have 2GB version with about 35 nodes. Also have a Rpi3 with 25 nodes that still works okay. Keep in mind ZW radio communication is 100KB at max for ZW plus and 40KB otherwise. IMO this is a ZW network communication issue, not a computing limitation. Might want to review this post. [SOLVED] Unresponsive Z-Wave Network: Tools and Approaches to track down the issues - Tutorials & Examples / Solutions - openHAB Community

With 90 nodes a zniffer might help. Also the Silabs tool as well as the Aeotec tool referenced in the posting can do a node health checks.

rtdog · January 3, 2024, 1:04am

I’ve been using the Silabs tool for basic troubleshooting, and OTA updates. I haven’t spent too much time looking at it yet. I previously had a product named “Zwave Toolbox” which sniffed the network a bit but it’s a pain to use and it doesn’t work well. I’ll look into a 700 series controller for sure now that I know they’re supported, thanks to all those who worked on supporting the new controllers!

rtdog · January 3, 2024, 1:09am

My problem with running the Windows diagnostic tools is that I have to pull my stick from the OpenHab Pi and then my network will be down/degraded for the remainder of the day due to the long startup times which significantly reduces WAF (wife-acceptance-factor). I wonder if there is a way to utilize more than one controller on the network or to add another stick somehow for troubleshooting?

garyfree · January 3, 2024, 5:42am

Perhaps look at the controller itself and check you dont have a lot of orphaned nodes…get rid of anything that isnt on your network.
Depending on linux/windows there are a couple options for this.

apella12 · January 3, 2024, 2:55pm

I guess if the uninitialized devices still perform their function, then the WAF may be higher living with the problem.

The zniffer would be a separate device (spare 500 controller reprogrammed) that can be used in windows while OH/Zwave is running. I found it easier to identify chatty devices than the debug logs and doesn’t impact WAF.

rtdog · January 7, 2024, 9:08pm

Thanks garyfree, I generally use another tool to do inclusion, exclusion, and for some configuration settings as those tools give me immediate feedback and very granular control. I’ve been using the Silicon Labs Simplicity Studio Z-Wave PC Controller recently with good results. I have a new Z-Wave stick on order to assist with the troubleshooting for now.

In other news I switched my current Aeotec 5 stick from the USB 2.0 (black) port on my Raspberry Pi to the USB 3.0 port (blue), and believe a saw a considerable difference in the initialization time. It’s closer to 30 minutes now. I haven’t done any controlled study of the matter but it appears to be dramatically improved.

That being said I still can’t tell when devices are initialized because the last few states don’t update their status on the GUI and I’m not convinced that the API allows the detailed state status of ZWave devices to be viewed, in particular the heal states. Right now I have to click on each of the 90 devices and look at the GUI to see if “Reinitialise the device” is visible or not. I’ll continue to troubleshoot, thanks all for your feedback.

apella12 · January 7, 2024, 9:38pm

I’m not an expert on the interworkings of the Rpi4 USB posts, but the max speed of ZW communication is 100KB, so either 60 MB (USB2) or 625 MB (USB3) will be overkill. Also there is an issue with some of the older Aeotec 5 sticks with Rpi4. Unless you are sure you have the model with the hardware fix, put the stick in an unpowered (or powered USB2 hub), not directly in an Rpi4 USB port.