So, where can I find some documentation to interpret what’s going on here? I’ve done a test (in a different house, not in the one having problems) with same Simon Tech equipment. I simply clicked one time to turn OFF and another time to turn ON and this is the output:
I’d advise a methodical approach like your screen shot or you will get overwhelmed. Originally, I tried to keep a table, but I’m kind of analytical. I did split my network so most nodes were direct to the controller. The last column is the health assessment from the silabs tool. nodes.pdf (253.9 KB)
Overall, the test you did looks okay.
Starting from the left the speed is 40K. Zwave plus can get to 100K, but maybe these are older devices. The RSSI (signal strength) is okay but could be higher. If it was really low the speed could drop to 9.6K. The delta is the time between the frames, for 40K 10-20 ms is good. The routing from 19 to 1 requires one hop (13). The only glitch is lines 8 to 14. Node 13 had to send the meter report 3 times (60 ms apart) before it was acknowledged by the controller. It also seemed like it was acked twice, also about 60 ms apart. A minor note is you get a meter report with every switch. Is that needed? However, as noted above, no major issue. If they are all like that you will be fine.
In the “Association Groups” for the “1: Lifeline” I always have only the “Controller” as device. Since in the slower places, those are actually far away from the controller, should I also select here the nodes for the Repeaters I have closer to each slow device to “force” it to pass through that repeater instead of going directly to the controller?
While checking the network though, for the node 43 which is a switch far from the controller, I do see that, supposedly, this is already communicating with the node 20, which is a repeater mid way to the controller.
Now, for that node 20 to the controller, I do see a direct connection as well but all direct connections to the controller are unidirectional:
I will be able to physically be there in a few days and perform the Zniffer tests though.
As I want to be 100% sure where the problem is (independently of being from the devices, the network or the setup, I just want this solved once and for all), what methodical way do you suggest me to do and which steps in order to actually be able to create a proper report on this matter?
No. The node map is not useful for diagnoses. Also only the controller in lifeline is correct. The mapping is a result of the find neighbors command. The controller is excluded from that command because it either bogs down (500 chip) or won’t work at all (700, 800 chips), there will only be unidirectional arrows. On the controller UI page, properties, it should only show itself as a neighbor
I would suggest triggering the problem node(s) and see what the Zniffer shows and take notes about hops and speeds. If you disable the nightly heal (my recommendation), you could just try to heal the nodes that are slow with many hops. The nightly heal will scramble it up again, so would need to be disabled. Don’t be surprised if your repeaters are not used in any of the routes.
Thank you once again for all your help. I now know that the issue actually lies on the devices for some reason.
The distance between some devices and the controller is also not the best, as it’s far away. But there are dozens of devices and that is usually favorable because there are multiple routes to reach the controller with only 2 hops, even for the most distante device.
As I said before, even clicking the physical button sometimes does nothing, other times the device acts in a very weird way and there are even some physical buttons with some pins broken and malfunctioning because of that (since they are not correctly touching the back pins and so they are sometimes completely off when touched because they come out a little bit).
Those devices that go off, obviously are then a bottleneck to the network because if that was included in a certain route, then when some device tries to reach that node, it won’t be able to, causing congestion on the network.
Pretty hard to figure out. I do see a lot of Basic. Those are only sent when the device is triggered by the local button. If they don’t work at that point, then it is the device because the Zwave is not involved. A message is sent back to the controller and if OH doesn’t show the change then there is a communication problem, but if the device doesnt work that is the device
Yup, that’s the thing. If a device is clicked and it doesn’t work, can never be Zwave fault but the device itself!
Biggest problem here is the congestion caused when multiple actions are triggered. Then for many seconds the network just because stuck, releasing multiple commands at the same time and causing lights (or whatever was triggered) to go ON and OFF very quickly. Very chunky and weird behavior.
Also another very weird thing was that the routes chosen were not the best most of the times. I saw hops “back and forth” and then being one of the more distant devices that ended up communicating with the controller, which makes no sense at all!
One way to stabilize the routings is disable the network heal. I doubt that is helping you. Then heal the individual nodes that are causing problems and watch what happens with the zniffer to see if that is faster. I did see a lot of 9.6k speeds. That is the slowest option.
I guess I must just be the geometry then. Did you try a 700 zstick? I recall you considered that. They are advertised as having better range.
EDIT: There were also a number of explorer frames. That is when reducing the speed and the stored routings didn’t work, so the device sends out an “anybody out there” to find a new route back to the controller. That tends to support the communication issues conclusion.
Yup, I’m now using in that installation a Aeotec Z-Stick 7.
Exactly. Something is wrong with the devices and I now suggested the supplier to put a technician from their side testing their devices creating a network of their own, with only their devices and testing each button, one by one.
It will take time to test near 150 devices (between light and blind switches) but that’s the only next thing to do now.
From my side, I can’t do more than what I did.
I even created a 24 pages report explaining each step I did to reach this conclusion. Let’s see what they will do.