So, I have a Z-wave network with some 30-something nodes which is acting sluggish from time to time. I often stumble on threads where people have similar issues and often they are resolved, thanks to someone looking at their logs and going “Hmm, you should consider doing X about node Y”. It’s impressive to see but not always helpful for laggards (as myself) looking for an answer to a similar problem.
Now that I’ve decided to dig in to the issues, I’m asking for your input of efficient ways of debugging the system, so that others can also follow these steps to get their sluggish networks up to speed again.
The first steps are pretty straight forwards:
- Set log-level to debug
- Let the system run
- Collect the log
- Upload the log to the log viewer
- And then what? What’s the best approach to weed out the problematic devices? What to look for? And what is considered an ok response time?
In my nodes-summary, I find a few different device characteristics:
Avg. response times 25-50 ms, max response slightly above 100, no time.outs.
Avg. response time 100-450 ms, max response time 500-6000 ms, min response time 25-150. No time-outs. These devices are around 1/5 of all my devices.
Avg. response time 100-800 ms, max response time 1600-9000 ms, min response time is listed as 0 ms which probably is incorrect. Percent time-outs is between 7%-34%. These devices are about 1/6 of all my devices.
I guess category #1 is nothing to worry about but when it comes to #2 and #3, I have a hard time understanding if they all have problems of some sort or if perhaps #2 is causing the problems that are visible in #3 (or vice versa). How should I think and what to look for to find the culprits?
- When I hopefully find the problematic device(s) in step 5, what do I do? Is there a setting that can be changed as a first step? Is a reset the last resort before replacing the device?
I’d be happy to learn how to proceed in situations like these.
[The setup celebrates 3 months soon and is a RPi 3 with a clean OH3-install and an Aotec Z-Stick. Before that, I have been running OH2 for some years with the same hardware.]
You gonna get a lot of advice. There is a good post here.
My advice is to get a zniffer. It is talked about in that post. You can guess around the problem or you can know. Plus you can watch in real time versus Debug, then viewer.
By the way I do not think 9000 ms delay is possible. The messages have a five second limit.
My 2 cents good luck
Thanks for your input. None of those above 5000 ms are connected directly to the controller. Can it be that a message has 5000 ms limit in each hop in the mesh?
You need a zniffer to troubleshoot that. The issues are most likely something outside the binding.
- Avg. response times 25-50 ms, max response slightly above 100, no timeouts.
These are direct connections
- Avg. response time 100-450 ms, max response time 500-6000 ms, min response time 25-150. No time-outs. These devices are around 1/5 of all my devices.
these are repeated one or more hops.
- Avg. response time 100-800 ms, max response time 1600-9000 ms, min response time is listed as 0 ms which probably is incorrect. Percent time-outs is between 7%-34%. These devices are about 1/6 of all my devices.
These are repeated one or more hops with issues.
I would expect the issues to start when a few devices are multiple hops that also have poor connections so a lot of retries. When these start to be used or start sending reports things will go odd. Even a lowish volume of traffic can cause issues. You might find some device with multiple hops are generating a lot of power reports or similar but you may not even see these in the binding log as they never make it to the controller,
Well I thought that was the case And it is for a lot of messages that I looked at, but clearly not all.
One thing to note is that in the 9 seconds of what looks like calm in the debug, as @robmac notes, the controller is flooding the network with messages trying to get the message through with different speeds, hops, etc. Another series of forum posts to read as background is here.
OH (zwave binding) doesn’t know or care about hops. It just sends the message to the controller and waits (a certain time ) for a response or to cancel.
Thanks for the replies guys, I sure learned a few new things. Looks like there is no silver bullet to this issue. Getting a zniffer is a bit more than I had planned to do but I just might have to go down that road. I was hoping for more magic findings from the log I guess.
I will start by switching a couple of category 3-devices that occasionally shows up with suspiciously few neighbours on the network map and see if we get any improvements.
Oh, I forgot: Is there any good guide on what hardware to get for the Zniffer (preferrably with EU-availability)?
ACC-UZB3-E-STA is EU frequency
Around €32 + shipping.
I only found it shipping from US so far but I’ll keep looking. Thanks!
if it is the same place I use they have free shipping over around €50 if there is something else small,you need.
While you are waiting for hardware you can still use the PC controller tool. I’d advise checking to make sure there are no extra nodes, so called zombies. There is also a node health tool for your powered nodes that gives a number from 1 to 10. That will help id possible issues.
Oh I think I have a “zombie” node in my network – a device that I factory reset (accidentally) while it was included in the network. I added it back to the network again (via regular inclusion process) and it showed up with a new node number. Now I always have a zombie “zwave node 5” in my OH inbox that I just ignore. Can’t get rid of it, tried Habmin advanced options (mark as failed etc.), no luck. OTOH my zwave network is solid so I can’t say it’s doing any harm.
In post no. 2 above I referenced a good post on the subject. I don’t have a lot of experience with zwave performance with zombies, but if it was a battery device, it probably will not do much harm since it would not be involved in routing messages. If it was a powered device, I’d be more concerned. Regardless, it should be able to be removed with the aeotec/silabs PC controller. Being a type A personality , I do not like loose ends like a node popping up in my inbox, but that’s just me.
Sooo - no I have finally received the hardware and completed the setup of the zniffer. It sure is interesting to see which routes the messages take but I have no idea what I am looking for. I have few questions related to this:
- Is there an ideal location to place the Zniffer (e.g. close to the controller, midway between controller and nodes etc.)
- Are there patterns I should look for regarding Delta, Applications, Data etc.?)
- Are there any views or tools in Zniffer that I should be aware of?
- Any lessons learned I should bring with me?
@robmac has been doing this longer than I and may have different comments.
For me, I have the Zniffer next to the controller. Because: I have only a couple of actions using association groups (and that for nodes that are farthest from the controller). Almost all of my zwave actions are generated in OH rules, so my focus is on how quickly a frame gets from the device to the OH server (RPi4) and how quickly it gets a message out to the responding device. I generally look at the deltas, speed and hops (typical speeds are noted in post #3).
Initially I focused on the overall amount of traffic and have either replaced or calmed down “Noisy” devices, reduced polling and disabled the “heal”. High traffic can cause the delays you are seeing. Look for repeated messages (trying different routes) and frequent updates from a single device. All messages need to be Ack’ed per protocol, so note any “routing error” or messages that are not acked.
For a 45+ node network I average about 5 frames a minute. IIRC the max is about 10 frames a second, but I think you will have slow response problems at that level. I also keep a spreadsheet of the routes and speeds.
Zwavenodes2-14.pdf (230.8 KB)
I’m sure you will find the causes of your delays.
Total agreement with all that @apella12 says
all I can add is few notes about zniffer app
to see how many frames show the network load trace and you get a view of the number of frames a second. If therer are any high numbers if you click the number it will take you to the area of the trace so you can see what nodes were involved.
In addition to routed errors, many explorer frames indicate problems. If you filter the data column and only select explorer you will see which nodes are having communication issues.
You can also filter for Router Errors and other frames.
CRC errors. In general these occur because zniffer is too far away from the transmitting device and can be ignored. There is one time the CRC can not be ignored that I know of. There is a bug in older zwave protocol. If nodes with these older protocol versions are part of the route to a FLiRS device, the beaming frames are corrupted and will show as CRC errors. The only solution to this is to change the devices.
If you get stuck save a file using the options in the file menu and post.
Thanks guys for the encouragement, pointers and insights! I will fiddle around with the zniffer and see how far I get. I just have two more questions about the tool at this point:
@apella12 , can you extract the node stats directly from the tool?
@robmac , I can’t find the Network Load Trace-function that was visible in your screenshot. I run version 4.57.17 on Windows. Are you on a different version?
I am using a newer version. You can download from silicon labs.
If you mean the spreadsheet, no that was a manual effort. I would also wait until you have found and ironed out some problems and disabled the Heal otherwise it will be changing all the time. Now only a few nodes routings move around.
Thanks for that info @robmac. I just found out that by following some guide available out there, one ends up with the old version (still downloaded from Silabs). Simplicity Studio is the way to go. Thanks again!
And thanks for the info @apella12 .
Now I’m prepared to go down the rabbit hole!