OH2 Z-Wave refactoring and testing... and SECURITY

chris · July 7, 2018, 9:42am

It’s certainly not working as I expected -:

This shows the second message that is rejected is sent 5ms after the first one - this should have been delayed (like the third TX message is). This seems to be common…

mhilbush · July 7, 2018, 9:57am

Good catch. There’s 200 ms between transmissions on node 68, but the message for node 85 goes out immediately after the rejection.

chris · July 7, 2018, 9:58am

Ok, I’ve resolved the issue with the holdoff not working first time around - the request gets requeued before the timer is started, so that shouldn’t be a big issue.

As a matter of interest, what sort of computer are you running on? I’m wondering if I shouldn’t add a short delay between all transactions (say 20ms) to see if we can reduce the occurrences of failures in the first place…

mhilbush · July 7, 2018, 10:03am

Intel Core i5 overclocked to 4.2 GHz.

chris · July 7, 2018, 10:11am

It’s not linked to the node, so this is just coincidence… The holdoff is a total block on all sending.

In all, this does look a lot better. I see one OFFLINE in this log -:

After this, node 68 works fine, so this is really caused by comms issues with the controller. With the first holdoff being skipped, we effectively only have a single 200ms delay here before we’re offline.

After your earlier comments about it always taking 2 attempts, I increased the 200ms to 300ms - I was thinking about dropping it back down given the issue with missing the first hold, but I might leave it and try and eliminate these sort of OFFLINEs if we can.

I wonder if that’s why some people have more problems than others? Maybe adding a standard delay between transactions of 5 to 20ms would help avoid overloading the controller when people are using fast computers (just thinking out loud - comments welcome).

@5iver and @digitaldan - what are you guys running (probably also something fast ).

chris · July 7, 2018, 10:14am

Updated version that fixes the first holdoff is here. This uses a 250ms holdoff.

mhilbush · July 7, 2018, 10:45am

Looks much better! Only one node is offline, and that’s a battery-powered device.

Very few REJECTED messages, and when one occurs, it’s successful on the 2nd attempt.

Sending you a log.

5iver · July 7, 2018, 10:53am

Nothing too fancy here… I’m using 10+ year old junk from the recycling pile (AMD x2 5600+ 2.8GHz w/6GB DDR2). I nearly tossed it last week when the CPU fan mount spontaneously fractured. It got lucky… I found a model for it so I printed a replacement!

5iver · July 7, 2018, 11:24am

What’s going on here? This first bit looks OK…

But then there’s a barrage of multiple reports (off screen), and then some more gets, then node dies and comes back online…

5iver · July 7, 2018, 11:33am

Another strange one…

5iver · July 7, 2018, 11:57am

The first one was from a siren (battery powered frequently listening), the second from a dimmer, this one is from a WADWAZ-1 door sensor. I currently have 30 dead nodes and growing, all battery powered.

chris · July 7, 2018, 12:15pm

I’m shocked. I felt sure you’d be using a cryogenically cooled super-computer

chris · July 7, 2018, 12:17pm

Can you email me over the log?

5iver · July 7, 2018, 12:17pm

On their way…

chris · July 7, 2018, 12:21pm

Great - thanks. The hold-off seems to be a big step forward .

I will merge this in to the dev binding tonight probably (I’ll take a look at Scotts log first).

FTR -:

github.com/openhab/org.openhab.binding.zwave

Update transaction management to add holdoff timer and ensure FIFO

openhab:development ← cdjackson:tranaction_updates

opened 08:38AM - 07 Jul 18 UTC

cdjackson

+9486 -493

This adds a hold-off timer if the controller sends a RESponse indicating an erro…r. This normally means the controller is busy (eg out of buffers) and resending the request immediately will simply result in the same error. Failure of network management transactions are not considered when setting a device to DEAD. This improves the transaction comparator to ensure that transaction queue is managed in a FIFO way when transactions have the same priority. Signed-off-by: Chris Jackson <chris@cd-jackson.com>

sihui · July 7, 2018, 12:26pm

Hey guys, that is not fair, one of those is my main working computer (upgraded with an ssd and still working fine with Win10). And yes, I bought it in 2008 …

5iver · July 7, 2018, 12:30pm

My POS has got one too!

vespaman · July 7, 2018, 12:42pm

I have always had issues starting up the z-network ~70nodes, and I have OH running on my main server, rackmount core i7 4-something GHz, so I thought I should try the new updated version, and gut feeling is that it works much better than before, also looking in logs gives me the feeling that this is a step up from before. I have not been updating so much lately, so maybe this is something that was fixed earlier during the last month or so, but I just wanted to give my feedback!

chris · July 7, 2018, 12:42pm

I suspect that there is some “bad sh!te” happening at the network level. It’s a guess as we don’t have any visibility of that at the binding level, but the multiple responses are indicative of lots of retries happening, or maybe the network being congested and retries getting queued. I thought that the controller only sends a few retries though (3 I thought). In this sequence the binding is sending 1 GET request, and we get 12 REPORTs.

The OFFLINE at the end I will look at. This is another area that I shouldn’t set the device offline. Here the frame is rejected by the controller, not the device, so we shouldn’t blame the device and set it offline.

chris · July 7, 2018, 12:43pm

Thanks - it’s really useful…