OH2 Z-Wave refactoring and testing... and SECURITY

vossivossi · April 24, 2017, 7:57am

Wow, many thanks for your great support Chris!
In my test environment everything works stable now. So the update did the fix!
I will test it tonight in my production environment also.
Thank you again for all your work!

OMR · April 24, 2017, 8:33am

I took the plunge and installed the new binding.
It really made a difference

Only NodOn Wall remotes are not included for some reason.
Log also complains about 3 nodes every 30s, even though they all work perfectly well.

2017-04-24 10:26:22.369 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:26:23.096 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:26:23.123 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:26:52.379 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:26:53.107 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:26:53.132 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:27:22.391 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:27:23.116 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:27:23.141 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!
2017-04-24 10:27:52.401 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 24: Is currently marked as failed by the controller!
2017-04-24 10:27:53.126 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 29: Is currently marked as failed by the controller!
2017-04-24 10:27:53.148 [WARN ] [rialmessage.IsFailedNodeMessageClass] - NODE 22: Is currently marked as failed by the controller!

Node 24 and 29 are Aeotech Multisensor 6 on USB power.
Node 22 is a NodOn Wall remote on battery. (node6 also a NodOn remote is OK)

Have restarter OH2 several times.
Give me a honk if you want a debug log @chris

chris · April 24, 2017, 8:40am

Great stuff

So, this means that the controller thinks the device has failed. I had assumed when I wrote this that the controller would have the most reliable view of a devices state, so unlike previous bindings where I had kept an internal state, I decided to use the controllers state. However, it seems that this is less reliable than I’d hoped, so I may need to change this concept…

OMR · April 24, 2017, 8:51am

Yes. Seems strange that the controller insist they are faulty even though it is relaying messages to and from them.
I have not power-cycled my PC and stick. Could that help do you think?

chris · April 24, 2017, 8:58am

No - probably not.

From my initial experience with this, I found that the controller would report some battery devices as FAILed, until they wake up, but I’ve seen other situations where this isn’t necessarily the case… I hope to try and get some more information about this from Sigma and will then decide how to proceed.

shadowmite · April 24, 2017, 1:20pm

I too have a Zooz 4in1 sensor on battery power that gets marked as failed. If I manually toggle the node to wake up it will green up for a few moments, and then the controller tried to talk to it again and marks it as failed again. Certainly seems strange like the controller or something is still trying to ping a sleeping device…

RayBe · April 24, 2017, 1:23pm

Try healing the device from HABmin, that helped for me.
br,
Raymond

shawnmix · April 24, 2017, 1:32pm

With all these reports, I’ll add that I’ve seen some improper response from some of my non-battery powered devices. When restarting, I often find I have to monitor the startup and validate no devices get marked as “not responding” because if they are left in this state for too long, they then get marked as Failed. It’s not always a guaranteed problem, but I’ll say that I believe the newer method has proved less reliable than the old method. So I’ll add a +1 to the votes to give @chris some more work in reverting back to the old method!

chris · April 24, 2017, 3:02pm

What is “not responding” state mean?

I’ve always found that the controller sorts things out in the end, but IMHO it’s not doing a great job of detecting/deciding when a node has failed. I’d like to understand how this works (ie inside the controller) better and it’s on my list of things to speak to Sigma about when I speak to them next.

pedxing · April 24, 2017, 3:03pm

Has anyone successfully securely included a Kwikset 910/914 lock on this binding? I have other locks that are included, but I excluded one in testing. After 30 or more attempts, I cannot get this lock to securely include.

Yes, I have a 910 and it’s a pain. It won’t secure include unless I pull it from the door and put it within inches of the controller. I have a gen 5 stick. After the inclusion and everything looks good in habmin (uses security is a green check mark) then put it back in the door and potentially wait a while while it figures out your mesh again. My controller is one hop from the lock so this takes a little while and a couple of restarts of OpenHab.

shawnmix · April 24, 2017, 3:30pm

@chris I think that’s exactly what the state was listed as at times “node is not responding”. Another time it happened because I turned the breaker off and so the power for the device was lost for about 30 minutes while I was installing another switch for example. When I brought the power back on, it had already marked the device as failed and I had to go exclude the device, then re-include it again. Just another example to help outline more times. It seems to me it would be when the controller is re-establishing links, there is room for error (not sure the reason though) and if a device goes offline for a short while, it can become detached as well. Same thing happened for the garage controller one time - which drove my chat with you about creating a portable device to secure include it again.

Now I’ve just learned if I’m going to restart OH, that I go and validate/check on all my devices as it starts up. On occasion I see a “device not responding” message, so I hit the On/Off button for it manually or thru OH, and suddenly it’s “communicating” again and has no issue. But if left untouched, they often stay in that state, then turn to a Failed state and I can no longer get them back without doing a full exclude/include process.

chris · April 24, 2017, 3:38pm

Ok, I don’t know this message. There’s one called “Node is not communicating with controller” - is that it maybe?

This definitely should NOT be necessary. Remember, the controller will do this even if in OH I have my own status, so adding an internal state won’t solve this problem if it really is happening as you say (which I very much doubt).

Again, I’d be very surprised if that was really the case (like REALLY surprised). You should not need to exclude the device and reinclude again - this is clear and is certainly not my experience. As above, if this is really the case then we can not solve this - sorry - but I think you’re wrong.

Just to reiterate - if what you say is true, and as soon as the controller marks a device as failed you have to exclude and re-include, then I’m afraid that the “old method” will not solve this problem.

My experience is very different than yours here…

shawnmix · April 24, 2017, 3:51pm

Ya that sounds familiar. I believe that would be the one.

On the others - I will see if I can observe it next time I need to restart. I don’t try to restart frequently, but I’ll try to be mindful if I do need to and grab a snippet of screenshots for you. I don’t think I’ll be cutting the power to anything though, so I can’t say I’ll likely have any evidence of that again, but I’ll keep it in mind if it does.

chris · April 24, 2017, 4:04pm

So that means that the device is FAILED as far as the controller is concerned…

So what do you mean that after it’s left in the “non responding” state for “too long” they get marked as FAILED - what does failed mean? I’m just trying to establish what the different states are so I can work out what the binding is doing, but I’m a bit confused - sorry.

shawnmix · April 24, 2017, 4:08pm

Sorry I know my lack of using the exact messages is likely causing it to be difficult.

If it get’s marked as “Node is not communicating with controller” - it’s not actually failed. I can then hit the on/off switch manually or tell OH to turn it on/off. Suddenly it will communicate again, node goes green, and all is well.

If however I was to leave the “Node is not communicating with controller” message, it will eventually turn into a different message, indicating that the node is ACTUALLY failed and has been marked as failed. I forgot the exact message that was appearing, but it was not the same as not communicating. When it reached this state, it was truly incapable of being healed, reset, or to work again without manual exclude then include.

nolan_garrett · April 24, 2017, 4:18pm

Fair enough, point taken

Glad it isn’t just me. I will keep trying and not give up hope. Out of curiosity, when you successfully included, were you in Low Power Inclusion mode, High Power Inclusion mode, or Network Wide Inclusion mode? I’m currently trying on Low Power and High Power.

chris · April 24, 2017, 4:21pm

It’s no problem, but I just want to be clear so that when I’m looking for problems, I know what I’m looking for ;).

Yes - the controller says it has failed. See the code here -:

            switch (event.getState()) {
                case FAILED:
                    logger.debug("NODE {}: Setting OFFLINE", nodeId);
                    updateStatus(ThingStatus.OFFLINE, ThingStatusDetail.COMMUNICATION_ERROR,
                            ZWaveBindingConstants.getI18nConstant(ZWaveBindingConstants.OFFLINE_NODE_DEAD));
                    break;
            }

and…

OFFLINE_NODE_DEAD = "Node is not communicating with controller"

So this message is set when the controller has marked the device state as FAILED.

Ok, I’d like to understand what this message is as I don’t see any other messages along these lines in the sourcecode.

This is really quite major so I’d appreciate it if you can clarify the above points. I’m not sure there’s anything I can do about it though - changing the way the node is detected as dead/failed will not help in this case, but first let’s understand exactly what the problem is.

rgerrans · April 24, 2017, 4:32pm

Just to chime in in case you need more data points, I also have three battery sensor nodes (2 door sensors and one multi) in the same state where in HABmin they say “Not Communicating with the Controller” but if a trigger event happens they send a report to the controller and my related rules fire. I just don’t get the regular reports on things like battery status from them and see the 30 second error warning of “NODE 44: Is currently marked as failed by the controller!”

chris · April 24, 2017, 4:35pm

Thanks - this is also what I see and is quite normal. My major concern lies with devices that need to be excluded from this state.

I would expect the controller to mark it as “not failed” if it received data, but I think this might only happen when the device wakes up - not just when it sends a report - as this is the only time the controller can communicate with the device.

pedxing · April 24, 2017, 4:48pm

I think it was on the default which seems to be Network wide. Although I know for sure the 910 will not include over multiple hops – never has for me on many different systems and two usb sticks. I’ve always had to put it physically close to the controller. Make sure to exclude it from OpenHab and the controller itself and then initiate the inclusion from habmin.

Also make sure the batteries are full (or brand new)! I had a bad battery at one point (one of 4 AAs that it takes) and it made it do weird things. I verified the voltage of each battery one at a time and also the voltage of the pack as a whole.