[SOLVED] Z-Wave on 2.5.1 Broken

nolan_garrett · February 4, 2020, 11:09pm

Hello,

I have approximately 100ish Z-Wave network nodes, several of which utilize Z-Wave Security. I have had a fairly stable install for quite a while. The very last update seems to have broken my Z-Wave installation. I am receiving either queue full messages or alternatively or a constant resending of commands/polling via the binding. When I move the Z-Stick to a different device or controller software, it works fine. As soon as I get it back on OH, the queue builds and eventually it seems the network just hangs.

Logs: Microsoft OneDrive

I’ve tried:
Removing the node XML files
Using the latest 2.5.2 snapshot build (exact same issue)
Deleting the Z-Wave Things and recreating them
Moving the Z-Stick to a different computer and Z-Wave software to validate control/heal works

I am at a loss on how to further troubleshoot. Any help would be greatly appreciated.

robmac · February 5, 2020, 12:24am

Hi,

Some very odd things in those logs around the security negotiation. Have you tested the secure nodes from your other software and not had issues? If you are using PC controller possibly try testing from the test and also the network health check.

One at random

but lots are similar and are taking 1 to 2s to get the nonce but secure_encap returns is in a more normal time

I am running 2.5.1 with 150ish nodes but only have a couple of secure nodes and they are rarely used so I can’t say I don’t have the same issue. I certainly do not have any significant issues but my secure nodes only get used a couple of times a week so may not show as an issue compared to your system that has a constant flow of secure commands.

Will have a look when I return home at the secure node logs and get a zniffer on it.

Not sure if that helps but testing with pc program is worth a try.

chris · February 5, 2020, 9:11am

You can’t apply a blanket statement that this is a problem - it is quite possibly fine, but it depends on the device. If the device is a door lock, or thermostat for example, then this is not an uncommon time since such devices are FLIRS devices. These devices only wake up periodically to look for communications - there are two standard rates for FLIRS - 250ms and 1000ms. If the device operates on 1000ms, as most do, then 1200ms is probably about right.

chris · February 5, 2020, 11:12am

My guess is the update is not what changed, since there have been very few changes to the ZWave binding for a long time. That said, you don’t actually say what you upgraded FROM, so if it was 2.3 for example, then yes there were a lot of changes. If it was 2.4 then there were not too many changes, and if it was 2.5 then probably there are next to no changes.

That said, looking at the first log what I see is a queue full exception. This happens when there are 128 packets queued and it effectively kills things. Maybe this is not a good implementation and I should take a look at that, but that said, I’m pretty sure this queue has not changed for a very long time. It is interesting that in subsequent logs this doesn’t always show up and the queue does stay small.

Serious amounts of duplication as seen below can indicate there’s something wrong at the protocol level - either routing problems such that commands are being received through multiple routes, or acks are not getting back. Potentially a heal may help here.

The problem with this sort of thing is that it swamps the network, and also swamps the controller (it’s an old, slow, microcontroller). This may flow back to the binding in that if the controller doesn’t process frames, because its internally run out of memory which will manifest as rejected frames when the binding sends data - I also see this in your logs. It can also cause general congestion in the network, adding to delays, timeouts etc that all lead to things not working properly.

Please don’t - this will just make things worse for you since the binding will need to send an awful lot more data on the network.

nolan_garrett · February 5, 2020, 12:02pm

Good point. I’d been on the 2.5 milestone releases without issue, it seemed to break when when it went general release.

I’ve tried healing the network using a different controller (PC Controller on Windows). It acts like it works, but my issues aren’t resolved when I get the stick back to OpenHAB.

Should I just hard reset and start over? That seems to be an annual event for me, which usually resolves some problems but is a huge pain.

chris · February 5, 2020, 12:10pm

If you were on a reasonably late 2.5 snapshot, and then upgraded to the release, there were probably no changes (other than database).

This is a common assumption - “I upgraded the binding and everything stopped working, therefore it’s the binding”. Often this isn’t the case, and while I can understand why people can jump to this assumption, it’s always good to know all the details

By this you mean that the duplication I mentioned above is still showing in the logs?

I’m not really sure that resetting the stick will help - it might - at the end of the day it will force a complete mesh update as everything will be reset. It could also be caused by a problematic node in the mesh - it might be worth looking through the logs to make a note of what nodes this happens on, and see if there’s something in common. For example, are they all in one end of the house? Do they all route via one node - that sort of thing.

Resetting the node might work, but if it doesn’t, due to (eg) a faulty node, then I appreciate re-including everything is a major ballache so spending a bit of time to look for commonality might be a worthwhile first step.

nolan_garrett · February 5, 2020, 1:45pm

How are you identifying the specific nodes this is occurring on from the logs? That’s one issue and likely just my lack of knowing what to look for. OH tells me all nodes are online so there’s no clear way for me to see what nodes are dead from the UI. I don’t understand the Z-Wave protocol at a deep enough level to understand if the logs are pointing at a specific node or set or nodes it’s hanging on. Can you clarify what you’d be looking at to find those? If I could narrow it down to a list of potentially bad nodes or routing, I could re-include them or even exclude and reset them.

As an aside, is there a better Z-Wave controller to use that doesn’t have these queue size / memory problems? I frequently feel like my Z-wave challenges I have from time to time are related to being at 100+ nodes and the controller simply being terrible at any kind of scale.

chris · February 5, 2020, 1:59pm

Just by looking - unfortunately I don’t have anything fancy. It would be possible to write a script to process the log and find occurrences where received packets are duplicated in very short succession.

It’s basically what I showed in the image above. In that image you see 5 messages received that are the same in around 250ms.

No - at this point in time, they are all the same. That is changing, and there are new versions available from Silabs that now use an ARM processor, but these are not supported as Silabs have forced requirements on their use for certification (and at the moment, my focus is on certification). These restrictions will be lifted later in the year.

nolan_garrett · February 5, 2020, 6:06pm

I can now confirm for certain this is not the binding or OpenHAB. Using openzwave (python-openzwave specifically) I can replicate the controller queuing and hanging issue.

I think I’m going to try using the Windows PC Controller to reinclude nodes into the network, and see if that allows me to avoid doing a hard reset on the controller.

robmac · February 5, 2020, 10:24pm

One word

ZNIFFER

robmac · February 5, 2020, 10:31pm

Just got home and can see nothing like you are seeing so definitely 2.5.1 is not broken.

I will give you a few small paragraph from the latest z-wave documents to think about.This is new advice so not something the binding complies with yet but you can help yourself. Looking at your logs you have a lot of traffic. Is it adding value or just causing you issues?

Z-Wave is a radio technology with limited bandwidth. Therefore, it is NOT RECOMMENDED to use polling.

If used, polling communication MUST comply with the requirements stated in the sections 3.7.1 through 3.7.3.

Communication MUST NOT be considered as polling if: • A node issues one or more commands in a burst initiated by a user action. This applies to any combination of commands; also requests. • A node issues one or more commands initiated by the inclusion of another node. This applies to any combination of commands; also requests.

3.7.1 Polling with no errors Two timers named CommandTime and PollTime are used for polling requirements with no error. Illustrations are given for secure and non-secure cases in Figure 3 and Figure 4 The following requirements apply to the normal case where a polling request is successful.  CommandTime MUST be measured by the application  The application MUST wait PollTime before polling any other node  PollTime SHOULD be 10 seconds + CommandTime or more  PollTime MUST NOT be less than 1 second + CommandTime

and then

3.8.1 Unsolicited data collection communication Bursts of one or more commands which carry status information transmitted repeatedly without any user intervention MUST be considered to be unsolicited data collection communication. Using the transmission of a control command or a NOP command as a heartbeat indication MUST also be considered unsolicited data collection communication. To save bandwidth, data collection communication MUST comply with the following requirements.  A device MAY issue unsolicited data collection communication in any burst size  A device MUST NOT issue new unsolicited data collection communication less than 30 seconds since the last burst.

robmac · February 6, 2020, 7:23am

Fair comment but a 1s sleep on a FLiRS would average out to 0.5s across many calls.

There are some nodes that are a lot worse than this one but I selected this example as the actual message only takes 30ms to get an ACK on the command and is relatively consistent on that. The worst you would expect the nonce in is a little over 1s and the best should be 30ms so not really a blanket statement.

Whatever devices these are do seem to not behave very well when in secure mode in this network. Certainly points at not an issue with the binding but an issue with the network and these are not behaving very well as part of that network.

With the regularity they are being called they could be a significant part of the issue and a good place to start which is why I suggested a network test from PC Controller program.

chris · February 6, 2020, 8:18am

While I agree that some of these times look a little longer than I might expect, I don’t think there’s anything particularly wrong. Again, IMHO you cannot make the statement that secure mode doesn’t work well in this network. We see if takes time to wake up for the initial request (eg the nonce request) and then the transmissions are quick since the device is already awake.

Admittedly I’ve not spent a lot of time analysing the log, but from what I’ve seen it doesn’t strike me that there is anything wrong with secure connections, and the other problems I identified with duplicate packets are more than likely to cause any delays you see. Secure frames are after all the same as any other frame, so there should be no reason that “secure mode doesn’t work well in this network”.

robmac · February 6, 2020, 1:34pm

I do not think I said that.

If you think I was saying anything regarding the binding my comment was nothing to do with the binding but questioning the behaviour of the devices and the amount of traffic.

Considering the number of nodes set to secure and the volume of traffic it would be easy to understand why this network would get clogged up and it is nothing to do with the binding.

It is just the nature of zwave and extra security messages when you have a lot of nodes and a lot of regular messages to your secure nodes may not be optimal if they are not needed.

If the type of device then steels even more time by being slow at responding to controller requests it will not help. At times in these logs the controller is in a wait state for 2.5s out of 5s and at least 2s of that is waiting for nonce and this continues for a long time. So less questioning security more questioning how these devices behave.

Now I will make a few sweeping statements.

The binding does what it does very well and despite it needing a bit of rework in some areas to be approved it is still probably the best zwave there is.

If something is not working with the zwave on openHAB, most of the time the issue will be configuration or/and issues in the network or devices not the binding.

150 nodes on openHAB is 10* better than on one of the most expensive zwave controllers available with similar settings.

chris · February 6, 2020, 1:44pm

Fair point, I misquoted, but what you said was pretty close -:

Anyway, it doesn’t matter too much - I think we’re all on the same page and I appreciate your input.

nolan_garrett · February 6, 2020, 3:46pm

I only have 7 nodes with security enabled. Are you indicating this many is a potential issue?

chris · February 6, 2020, 4:10pm

There should be no reason that this is the case, but…

A lot of people don’t manage things well - they decide they want to get data a “million times per second” and set poll or reporting rates as fast as they can and then wonder why there are problems. Security makes things worse as it roughly doubles the amount of traffic. This in itself is not a problem, but it’s all about scalability - if you configure your devices to report at high rates, and everything uses security and you continue to increase the network, then it will break with some combination of the above.

If you think about it, then it should be fine. This is why the binding sets recommended values for things like polling - to try and keep the network traffic down. However it can’t control everything, and unfortunately it’s often the case that people don’t understand the limitations (why would they) and sooner or later things will fall over…

So, no, 7 nodes should not be a problem. 100 nodes should not be a problem - assuming the network is properly configured

nolan_garrett · February 6, 2020, 4:52pm

This makes sense. I try to keep the default values for everything unless there is a very specific reason I need to make a change. I try to be cognizant of the network traffic as while I am not a Z-Wave expert, I am an IT pro with a lot of networking expertise, and I understand low bandwidth networks to some degree.

I’ve been through so many OH upgrades and such though the configuration may have drifted over time. Is there a good place I can find the default binding / Z-Wave settings?

5iver · February 6, 2020, 4:54pm

By network you mean binding?

E.g. a Polling Period of 1 hour (the default) in a network of 120 devices means one device is polled every 30s. Add to that a Default Wakeup Period of 1 hour (default) with 50% of your devices battery powered and there will be one wakeup every minute. Continuously!

The maintenance functions in the binding definitely need to be adjusted for larger networks. For our ~130 node network…

Polling Period: 1 day (our devices self report)
Command Poll Period: Disabled except for a couple (our devices self report)
Daily Heal: Disabled (broken )
Wakeup Interval: 3 hours

chris · February 6, 2020, 6:19pm

No - I mean everything. Mostly I mean the configuration of the devices.

Generally the binding does not control how a user configures a device. If they set reporting period to 1 second, or ask for a report every time the temperatures changes by a micro degree, or a watt meter to send a report when the power changes by 1%, then this sort of thing will flood the network. The binding can’t control this at all.

Most devices don’t do this by the way and it is not recommended practice from ZWave Alliance for devices to do this sort of reporting when they have been remotely commanded to change state.