OH2 Z-Wave refactoring and testing... and SECURITY

nolan_garrett · July 16, 2017, 4:25pm

Sorry, yes, I was confused also. I’m now on the latest snapshot of both OH and the dev binding in this thread, will try a secure inclusion soon.

nolan_garrett · July 16, 2017, 4:52pm

I can confirm that I’m back to the same state I was on much earlier versions of the binding. Now that my ~70 nodes are all included, I can’t get secure inclusion to work for my new NGD00Z-4 garage door controller. I tried in NWI and Low Power Inclusion modes.

I am considering testing my theory and building an RPI with the latest dev snapshot and the binding, then taking the device and RPI away from my home where there are no other local nodes. If I do that, and I set the security key the same as I have on my primary install, and it includes successfully, will I be able to bring the NGD00Z-4 back onsite and it should join the network? Any other ideas on a way to include this device? I suppose I could cut power to the entire house, run my server off of my UPS, and do a bind while all other Z-wave equipment is offline…

nolan_garrett · July 16, 2017, 4:54pm

When I had to do this in the past, yes, I had to delete and re-add the things, but my install at the time “remembered” the Item links so it was pretty trivial to get the Things working correctly again.

chris · July 16, 2017, 4:56pm

I’m not sure I completely understand you, but I think the answer is no - you can’t use another installation to include the device, even with the same key, and expect it to work on another system.

The only thing I can think of is that all the other traffic is slowing things down. Can you send me a log (open a ticket on my website) so I can take a look. If so, then the answer might be to stop all other traffic during a secure inclusion.

shawnmix · July 16, 2017, 4:56pm

@nolan_garrett - these were my exact observations as well. I’m glad someone else had a similar experience to outline this so I know perhaps I’m not crazy.

@chris - I don’t know if you recall I had some issues in the past that I asked you to look into. Logs pulled and put together. But it was specifically around the secure inclusion failing for a device I know was not already paired. You did indicate the timing in which secure inclusion needed to happen from normal inclusion. No indications in the logs seemed to show that it was happening. I believe this is exactly the same situation as Nolan described.

I tried by resetting my ZWave stick, and this would allow secure inclusion to work, literally first try/attempt on the secure devices. But if I had gone along and finished including all my devices and came to do secure inclusion afterward, it would just keep failing. So when this happened, I finally tried to reset my stick again, and then the secure include would work right away.

I don’t know what to say for what could be a cause or how to really debug it, since debug logs never came up with anything useful from your reviews of my logs at the time. Just indicated that the secure inclusion wasn’t happening in the timeframe you indicated it needed to. For me, I’m only looking at about 40 or so devices, vs the 70 of Nolan.

shawnmix · July 16, 2017, 4:58pm

I honestly believe this is the exact problem and solution. I noticed often for the same reason, that during startup it seemed impossible to get secure inclusion to work with a few devices. I did try one time to add a secure device with only a few items, and it took longer, but after a few tries, and letting the network slow down (aka stop all the startup messages) - then it would be more stable.

EDIT: Not sure if there is a way to “cease” the controller function when trying to securely include, without disrupting or causing problems on the ZWave network. Don’t want other devices who were trying to send their hail messages and such to suddenly show as disconnected because they couldn’t finish initial communications.

nolan_garrett · July 16, 2017, 5:39pm

This is my theory. My experience of going from being able to securely include 4 devices without a single retry to now all occurred on the same OH2 snapshot and binding version. I’m convinced that number of nodes somehow impacts the secure inclusion. I have a noisy network - lots of Aeon and Fibaro devices reporting temperature, humidity, motion, and energy use.

I also agree with @shawnmix, as I’ve provided logs on this in the past as well and we weren’t able to identify the cause. I will try to get more logs now that I’m on the latest version - I’m traveling now so I may not be able to get it done until next weekend unless I can convince the girlfriend to do some button pushing for me.

sloth0815 · July 16, 2017, 5:39pm

Ok, I have done this and I am now using this Z-Wave binding. Mesh healing works, which is nice since I moved quite a few items around since the original inclusion. However, I am having 2 issues:

I cannot include the Fibaro Double Relay Switch FGS-222. I had the same issue with the normal binding until I upgraded to the latest snapshot version. Is this device not in the database yet?
I have several outlet pluggable sirens (Aeotec) and they only seem to function reliably when directly linked to the controller (no routing). Am I missing something?

Many thanks in advance,
Jonas

nolan_garrett · July 16, 2017, 5:41pm

This is my experience also. Don’t even bother trying a secure include regardless of node count if you’re doing a cold start of the binding. If I restart OH because I’ve tried too many excludes / includes and the binding starts acting odd, I have to leave OH alone for an hour or more and let the Z-Wave discovery / mesh processes finish before I can even hope to try a secure include.

chris · July 16, 2017, 8:13pm

Sorry - I don’t recall looking at comparison of logs with multiple devices and no other devices in the network.

shawnmix · July 16, 2017, 9:17pm

@chris - I think that was the problem. The logs didn’t show anything odd or out of the ordinary IIRC. It simply showed normal activity for everything, the proper stages for the secure inclusion, but the lack of the proper communication for the secure include.

For reference - this is the “ticket ID” on your site: b8WnG2jx9LD - Subject: Yale YRD240 Secure Inclusion Failure. It seems in re-reading, you had pointed out there was a timeout happening. I think this is leading to and being correlated with what we’ve experienced and outlined.

It’s as if the timeout happens perhaps because of too much chatter on the line? Don’t know enough about the underlying messages, but perhaps the secure include messages are getting queued up behind other transactions and not getting priority - ending in a timeout from the controller side? Possible to test using a method you said of just stopping other communications while secure include is happening - perhaps also is there a way to increase the timeout window - perhaps just bumping this could give us the time needed when larger setups are being used?

EDIT: If any of this is feasible to throw together a test side binding - I’d be happy to try and test it out to validate my experience. I have another lock sitting around that is not included and has failed to include (assuming for this reason). I can switch in the test binding and run a test with this lock device to validate.

chris · July 16, 2017, 10:12pm

I’ll take another look at the ticket - I remember looking at it in the past, but I just don’t remember the link with an empty network vs a busy network. I’ll take a look over the next couple of days.

nolan_garrett · July 17, 2017, 12:23am

Sorry, you’re right. I have only ever provided logs when I had a problem - never when it worked right. I really don’t want to have to reset my network to generate a “low traffic” example.

nolan_garrett · July 17, 2017, 12:58am

Exactly the same for me. @chris and I were thinking the first device I started having these issues with was defective or the hard reset option on it wasn’t working. It wasn’t until later I recognized it was systemic. In all, I’ve tried maybe 100 attempts. Somewhere past 30 devices for me and inclusion became impossible.

Super excited we think we’ve correlated something! Hopefully it leads us to a fixable root cause.

shawnmix · July 17, 2017, 3:08pm

@nolan_garrett - funny, I did the same thing! I actually exchanged the lock I had and made sure everything was fresh and the lock had never been seen by this system or anything as weak before I tried the first inclusion with it. Same result. I became frustrated and thought it might be something with my config or my ZWave network itself. So I reset, and made sure to do secure includes first. All worked fine. Agreed, I’m glad we can correlate some behavior as having the same symptoms.

@chris I would agree I don’t necessarily want to have to reset my stick to generate low level logs to compare, but if it’s of interest - I do have a secondary Pine64, the ZWave module for it, and a secondary lock. I could try getting a fresh OH install, put in your binding, and run any test you need with no devices and a secure include. Then remove the pairing of the lock, switch over, and kick out logs to show you a comparison of running a secure include with the same lock on my current working full network. Downside, it’s likely going to take about 2 weeks to do - I’m heading into 2 weeks of travel and lack of availability to test until I return Aug 2.

PS - The kicker … the extra lock is a BE469

chris · July 17, 2017, 4:30pm

I don’t think this is related to the number of devices, or to blocking from other devices etc as was suggested. Looking at the logs again from this ticket, there is no other communications happening with this device during the inclusion. -:

So maybe there’s something else at play then .

shawnmix · July 17, 2017, 6:46pm

@chris - I think that’s the core problem. There isn’t any logging indicative of any issues. It just looks like a timeout. But I can tell you that if I reset my ZWave stick and have 0 devices, secure include works. If I switch out and then try to include after having added other nodes, secure include fails.

My idea was possibly that OTHER ZWave devices happen to be sending messages around the same time and due to this is causing an issue with receiving the correct messages in the right timing order? Or possibly that the controller isn’t able to actually reach the secure device due to some type of a routing issue from other node communications? Not sure I’m kind of throwing out rough ideas as obviously I’m not 100% sure. But I can tell you, my behavior matches exactly with Nolan and that’s honestly the only difference in my setup from starting a fresh include vs include after others have been added.

chris · July 17, 2017, 7:04pm

The device didn’t respond - to my way of thinking, that is a timeout .

Where are they sending all these messages to though? There’s nothing in your logs showing any other messages during this time, so they aren’t being sent to the controller. Unless you have lots of devices sending data directly to another device (ie not to the controller), I can’t see that this is likely.

If that is the case (which I personally don’t think is likely, but I’m happy to be convinced otherwise), then how to stop this. There’s no way to stop other devices sending all these messages that is stopping the controller working. And if these timeouts are causing havoc during the secure inclusion, then why aren’t they causing the same issue all the rest of the time with any other node? A secure inclusion message is no different than any other message, so do you see lots of timeouts to all nodes?

I simply don’t think this is likely. I’m not contesting that it works if you have no nodes - at least it did once - I’m just saying that I don’t subscribe to the explanation.

Hmmm - if you aren’t including the device in direct range of the controller, then this might be the problem? ie routing isn’t relevant here as it’s secure inclusion, so the device needs to be within direct range of the controller.

Maybe ;). Let’s see… Maybe I’m being picky, but I’m just being careful not to jump to conclusions - there’s so many times I hear “I have exactly the same problem” to find it’s something totally different.

shawnmix · July 17, 2017, 8:09pm

I hear you on all fronts. I subscribe to the methodology of show me how it’s working properly or failing to work properly. With evidence of an issue before I can believe the idea. I wish I knew more about the protocol to be able to make more educated guesses to an area of concern.

I can tell you it’s not a distance issue - I always worried about this. I’ve tried putting it on top of the controller, literally, to the side, below, maybe a foot away, etc. None work. If it is an early or first node though, it works and doesn’t even need to be directly next to it. Works a few feet away even. And it’s quick too!

If there is something you can think of that I can test - I’m happy to try and run the consolidated test to output the result we need to see. Like I mentioned, I have a secondary device to test with a controller. The only difference though, is the controller will be different. But outside resetting my network and re-attaching everything to rules scenes renaming etc - I don’t want to go back to total square 1. But I’m happy to test other possible scenarios with the secondary setup.

vespaman · July 17, 2017, 9:06pm

What do you mean by “I cannot include the Fibaro…” ? If you cannot even include it in the network, the database does not matter at all. Try to exclude and re-include. Observe that the stick actually exclude, and also include. Try different pace on the include (sometimes too fast is not good, and definately too slow is bad). Or use auto include if the devices has this feature.
I don’t have a 222 myself, but i’d be very surprised if it is not in the database, once you get them included.

Regarding your sirens, I have no suggestions (sorry, I have no experience at all with sirens). But every problem resolving thing starts with viewing the debug-log.