OH2 Z-Wave refactoring and testing... and SECURITY

Wow, many thanks for your great support Chris!
In my test environment everything works stable now. So the update did the fix!
I will test it tonight in my production environment also.
Thank you again for all your work!

1 Like

I took the plunge and installed the new binding.
It really made a difference :slight_smile:

Only NodOn Wall remotes are not included for some reason.
Log also complains about 3 nodes every 30s, even though they all work perfectly well.

2017-04-24 10:26:22.369 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:26:23.096 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:26:23.123 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:26:52.379 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:26:53.107 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:26:53.132 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:27:22.391 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:27:23.116 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:27:23.141 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:27:52.401 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:27:53.126 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:27:53.148 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!

Node 24 and 29 are Aeotech Multisensor 6 on USB power.
Node 22 is a NodOn Wall remote on battery. (node6 also a NodOn remote is OK)

Have restarter OH2 several times.
Give me a honk if you want a debug log @chris

1 Like

Great stuff :slight_smile:

So, this means that the controller thinks the device has failed. I had assumed when I wrote this that the controller would have the most reliable view of a devices state, so unlike previous bindings where I had kept an internal state, I decided to use the controllers state. However, it seems that this is less reliable than Iā€™d hoped, so I may need to change this conceptā€¦

Yes. Seems strange that the controller insist they are faulty even though it is relaying messages to and from them.
I have not power-cycled my PC and stick. Could that help do you think?

No - probably not.

From my initial experience with this, I found that the controller would report some battery devices as FAILed, until they wake up, but Iā€™ve seen other situations where this isnā€™t necessarily the caseā€¦ I hope to try and get some more information about this from Sigma and will then decide how to proceed.

I too have a Zooz 4in1 sensor on battery power that gets marked as failed. If I manually toggle the node to wake up it will green up for a few moments, and then the controller tried to talk to it again and marks it as failed again. Certainly seems strange like the controller or something is still trying to ping a sleeping deviceā€¦

Try healing the device from HABmin, that helped for me.
br,
Raymond

With all these reports, Iā€™ll add that Iā€™ve seen some improper response from some of my non-battery powered devices. When restarting, I often find I have to monitor the startup and validate no devices get marked as ā€œnot respondingā€ because if they are left in this state for too long, they then get marked as Failed. Itā€™s not always a guaranteed problem, but Iā€™ll say that I believe the newer method has proved less reliable than the old method. So Iā€™ll add a +1 to the votes to give @chris some more work in reverting back to the old method! :wink:

What is ā€œnot respondingā€ state mean?

Iā€™ve always found that the controller sorts things out in the end, but IMHO itā€™s not doing a great job of detecting/deciding when a node has failed. Iā€™d like to understand how this works (ie inside the controller) better and itā€™s on my list of things to speak to Sigma about when I speak to them next.

Has anyone successfully securely included a Kwikset 910/914 lock on this binding? I have other locks that are included, but I excluded one in testing. After 30 or more attempts, I cannot get this lock to securely include.

Yes, I have a 910 and itā€™s a pain. It wonā€™t secure include unless I pull it from the door and put it within inches of the controller. I have a gen 5 stick. After the inclusion and everything looks good in habmin (uses security is a green check mark) then put it back in the door and potentially wait a while while it figures out your mesh again. My controller is one hop from the lock so this takes a little while and a couple of restarts of OpenHab.

@chris I think thatā€™s exactly what the state was listed as at times ā€œnode is not respondingā€. Another time it happened because I turned the breaker off and so the power for the device was lost for about 30 minutes while I was installing another switch for example. When I brought the power back on, it had already marked the device as failed and I had to go exclude the device, then re-include it again. Just another example to help outline more times. It seems to me it would be when the controller is re-establishing links, there is room for error (not sure the reason though) and if a device goes offline for a short while, it can become detached as well. Same thing happened for the garage controller one time - which drove my chat with you about creating a portable device to secure include it again.

Now Iā€™ve just learned if Iā€™m going to restart OH, that I go and validate/check on all my devices as it starts up. On occasion I see a ā€œdevice not respondingā€ message, so I hit the On/Off button for it manually or thru OH, and suddenly itā€™s ā€œcommunicatingā€ again and has no issue. But if left untouched, they often stay in that state, then turn to a Failed state and I can no longer get them back without doing a full exclude/include process.

Ok, I donā€™t know this message. Thereā€™s one called ā€œNode is not communicating with controllerā€ - is that it maybe?

This definitely should NOT be necessary. Remember, the controller will do this even if in OH I have my own status, so adding an internal state wonā€™t solve this problem if it really is happening as you say (which I very much doubt).

Again, Iā€™d be very surprised if that was really the case (like REALLY surprised). You should not need to exclude the device and reinclude again - this is clear and is certainly not my experience. As above, if this is really the case then we can not solve this - sorry - but I think youā€™re wrong.

Just to reiterate - if what you say is true, and as soon as the controller marks a device as failed you have to exclude and re-include, then Iā€™m afraid that the ā€œold methodā€ will not solve this problem.

My experience is very different than yours hereā€¦

Ya that sounds familiar. I believe that would be the one.

On the others - I will see if I can observe it next time I need to restart. I donā€™t try to restart frequently, but Iā€™ll try to be mindful if I do need to and grab a snippet of screenshots for you. I donā€™t think Iā€™ll be cutting the power to anything though, so I canā€™t say Iā€™ll likely have any evidence of that again, but Iā€™ll keep it in mind if it does.

So that means that the device is FAILED as far as the controller is concernedā€¦

So what do you mean that after itā€™s left in the ā€œnon respondingā€ state for ā€œtoo longā€ they get marked as FAILED - what does failed mean? Iā€™m just trying to establish what the different states are so I can work out what the binding is doing, but Iā€™m a bit confused - sorry.

Sorry I know my lack of using the exact messages is likely causing it to be difficult.

If it getā€™s marked as ā€œNode is not communicating with controllerā€ - itā€™s not actually failed. I can then hit the on/off switch manually or tell OH to turn it on/off. Suddenly it will communicate again, node goes green, and all is well.

If however I was to leave the ā€œNode is not communicating with controllerā€ message, it will eventually turn into a different message, indicating that the node is ACTUALLY failed and has been marked as failed. I forgot the exact message that was appearing, but it was not the same as not communicating. When it reached this state, it was truly incapable of being healed, reset, or to work again without manual exclude then include.

Fair enough, point taken :wink:

Glad it isnā€™t just me. I will keep trying and not give up hope. Out of curiosity, when you successfully included, were you in Low Power Inclusion mode, High Power Inclusion mode, or Network Wide Inclusion mode? Iā€™m currently trying on Low Power and High Power.

Itā€™s no problem, but I just want to be clear so that when Iā€™m looking for problems, I know what Iā€™m looking for ;).

Yes - the controller says it has failed. See the code here -:

            switch (event.getState()) {
                case FAILED:
                    logger.debug("NODE {}: Setting OFFLINE", nodeId);
                    updateStatus(ThingStatus.OFFLINE, ThingStatusDetail.COMMUNICATION_ERROR,
                            ZWaveBindingConstants.getI18nConstant(ZWaveBindingConstants.OFFLINE_NODE_DEAD));
                    break;
            }

andā€¦

OFFLINE_NODE_DEAD = "Node is not communicating with controller"

So this message is set when the controller has marked the device state as FAILED.

Ok, Iā€™d like to understand what this message is as I donā€™t see any other messages along these lines in the sourcecode.

This is really quite major so Iā€™d appreciate it if you can clarify the above points. Iā€™m not sure thereā€™s anything I can do about it though - changing the way the node is detected as dead/failed will not help in this case, but first letā€™s understand exactly what the problem is.

Just to chime in in case you need more data points, I also have three battery sensor nodes (2 door sensors and one multi) in the same state where in HABmin they say ā€œNot Communicating with the Controllerā€ but if a trigger event happens they send a report to the controller and my related rules fire. I just donā€™t get the regular reports on things like battery status from them and see the 30 second error warning of ā€œNODE 44: Is currently marked as failed by the controller!ā€

Thanks - this is also what I see and is quite normal. My major concern lies with devices that need to be excluded from this state.

I would expect the controller to mark it as ā€œnot failedā€ if it received data, but I think this might only happen when the device wakes up - not just when it sends a report - as this is the only time the controller can communicate with the device.

I think it was on the default which seems to be Network wide. Although I know for sure the 910 will not include over multiple hops ā€“ never has for me on many different systems and two usb sticks. Iā€™ve always had to put it physically close to the controller. Make sure to exclude it from OpenHab and the controller itself and then initiate the inclusion from habmin.

Also make sure the batteries are full (or brand new)! I had a bad battery at one point (one of 4 AAs that it takes) and it made it do weird things. I verified the voltage of each battery one at a time and also the voltage of the pack as a whole.

1 Like