Z-Wave node reported offline yet commands do arrive at the node - how?

Hi,

I’m on OH2.3. I have a double relay switch (FGS222) that is controlling my garage door, somewhat on the edge of my network, so signal is not the greatest over there. According to the log file and the UI, at seemingly random moments, the node loses communication with the controller, and later regains it.

Examples:
2018-09-29 17:30:01.783 [hingStatusInfoChangedEvent] - ‘zwave:device:52422632:node3’ changed from ONLINE to OFFLINE (BRIDGE_OFFLINE): Controller is offline
2018-09-29 17:30:01.784 [hingStatusInfoChangedEvent] - ‘zwave:device:52422632:node3’ changed from OFFLINE (BRIDGE_OFFLINE): Controller is offline to ONLINE
2018-09-29 17:38:08.003 [hingStatusInfoChangedEvent] - ‘zwave:device:52422632:node3’ changed from ONLINE to OFFLINE (COMMUNICATION_ERROR): Node is not communicating with controller

However when using the device, it works 100% of the time. So even when OH thinks the device is OFFLINE, I can control it successfully.

This is rather confusing when troubleshooting node issues. Shouldn’t the online/offline logging say something about the actual ability to control a node?

Thanks in advance!

It does - it says that there is a problem communicating with the controller.

This will be caused by the responses not being received as expected. Without seeing the log I can’t say too much more than that - it could be that the device doesn’t support something that the binding expects it should (which means the database might need updating to resolve it) or it might be that the link from the device to the controller is not very strong and it’s missing some responses (which would require more repeaters to resolve).

Right, thanks for your answer. However I have been using the node for over a week on a daily basis without problems. Are you saying these messages may be a sign of a very subtle problem that I just haven’t experienced yet?

W.r.t. the logs: I can go collect some. Do I need to turn any debug logging on?

Also related: today I tried 2.4.0-M4 for a short while. In case of this node it seemed much more sensitive to the weak signal than 2.3. While 2.3 seems to work well and only complains a little in the log, with 2.4.0-M4 there is a lot of complaining and the node starts behaving really odd and unreliably. Could this be a regression resulting from the new ZWave binding? Or is the new binding more strict on purpose?

Thanks

Possibly - just because the controller can send messages to the device, doesn’t of course mean that the device can send messages to the controller. All I can currently say is that this issue is caused by the controller not receiving the responses it’s expecting - this could be an RF issue, or a device compatibility issue (as explained earlier).

Yes.

I can’t really say without knowing what the issue is. The new binding is quite different to the old one, so without logs it’s impossible to guess what might be the issue.

The new binding works pretty much the same way when it comes to deciding a device is DEAD. If it doesn’t get a response to messages, then it will mark the device DEAD (and OFFLINE). The new binding might be polling differently though, so this might be what you’re seeing. A debug log is really needed to understand the issue better.

Ok, I’m gonna take your word for this. Wondering though (getting slightly off topic here) why the direction makes a difference. Does it have something to do with the difference between tx power and rx sensitivy on both ends?

Anyway, I’ll be back later with some debug logs.

Yes, one end might transmit a higher power, or have a higher sensitivity, or better antenna. RF is not completely predictable, and even less so in the environment these devices try to operate as a lot of compromises are made in their design.

Also, along with power/sensitivity, antenna performance, reflections, any noise near the receiver (eg TV, microwave ovens, WiFi etc - none of which are meant to radiate in the same band as ZWave, but will likely increase the local noise) may also impact.

Let’s take a look at the logs - it will provide a better indication of what’s up.

Thanks for explaining!

I collected some logs. I trimmed them to a bit to make them fit on pastebin. If I trimmed crucial info, please let me know. The “problematic” node is node 3. This is on OH2.3.

There are 3 notable events AFAICS:

  • At 20:12, the switch (i.e. the garage door) is opened and closed succesfully, while the node is registered OFFLINE (you can see it coming ONLINE hours later at 01:00)
  • At 01:03, I actually had some trouble opening/closing the door. While the door would start opening immediately on toggling the switch, it would stop too early, before it was opened completely. Only at the 5th try it would open completely (and close again).
  • At 10:44, a successful open and close (note that the controller thinks the switch is OFFLINE since 01:30)

events.log:
https://pastebin.com/xrS24SvC

openhab.log:
https://pastebin.com/EJMt1ufE

Regards,
Robbert

The log shows that 97% of messages to node 3 are timing out (ie there is no response from the device). The controller reports that there is NO ACK from the device, so this is a good indication that the problem is an RF issue. If it was a problem with a higher level response, then we would see an ACK, but no data.

If you want to look at the log in a more user friendly way, you can use my log viewer -:
https://www.cd-jackson.com/index.php/openhab/zwave-log-viewer

Alright, that doesn’t surprise me. So I just needed to adjust my understanding of the log messages and the probably assymetrical properties of transmitting/receiving RF between nodes.

What still puzzles me though: can RF issues also cause the behaviour described at 01:03 with the door stopping halfway? The switch is configured as a simple pulse switch (500ms), so I would think that to stop the door in the middle of opening, another pulse would be required. I’m not sure how RF issues can cause an extra pulse (or extra long pulse?) Or is this a totally separate issue?

Thanks, very useful!

I’m not sure, but when you have RF issues like this, the thing you need to keep in mind is that the binding waits up to 5 seconds for the response. During this time, other commands will be queued, and sent after this 5 seconds. If you look in the log around this time, there are quite a few commands coming in to the binding (ON/OFF/ON/OFF…), and these will get queued, and it is likely to cause what you saw (I think).

Hmmm… I doubt this: I am definately careful to not flood the device with extra commands while the door is still in motion. And even if this was true, then this could only explain any issues with subsequent toggles, right? At least for the very first ON command, knowing I just sent a single toggle, then this can’t be the cause, because there are no commands queued up yet. So I really don’t understand why it stops moving early in this case.

I am probably able to reproduce this with some effort. Any pointers to troubleshoot this further?

Well, this is what is in the log - I’m not sure how it can be wrong. These COMMAND messages are logged right at the beginning of the command processing - ie right when the core framework passes the command to the binding. There’s no buffering or queuing at this point, and no real ability for delays etc (at least, not in the binding).

I think this is real…

Not really - I guess you have some sort of rule that is governing this if you are using timers to send a 500ms pulse? It’s not a binding issue at this point, so I would look at the way the rule works, and possibly add logging into the rule to see what it’s doing.

I think you misunderstood me: I wasn’t denying I sent all those ON/OFF commands and that some of them may have gotten queued up and delivered later, and possibly messed with the door. However, I was trying to explain that initially the problem (the door stopping to move prematurely) already happened right after the very first ON command, at which time I definately had not sent any additional commands YET.

If you look at the timeline right after arriving at the door…

  • 01:03:14 - ON command --> door starts opening immediately
  • 01:03:18 - (guess) --> door stops opening (not even halfway open)
  • 01:03:19 - 5s ACK timeout
  • 01:03:25 - me sending OFF command in response to the door being stuck

… you see that I only sent the OFF command 11 seconds after the first ON command. If I’m correct, that should mean that the command queue was empty during those 5s between 01:03:14 and 01:03:19. So with that in mind, how can a previously queued up command be the cause of the door stopping around 01:03:18?

There is no rule, and the timer is built-in the device. It’s just a simple FGS222 relay switch set to pulse mode by setting configuration parameter 4 “Relay 1: OFF-delay time (100ms steps)” to “5”.

I agree with you it does not look like a binding issue though. Just wanted your opinion on that. I guess I will slowly start to blame the hardware for this.

Sorry - I misunderstood. However in that case, I have no suggestion as there is nothing happening at the time you say it stopped…

Right :slight_smile: Well in that case I’ll focus on getting some extra mesh coverage, trying some different hardware and trying 2.4.0-M4. Thanks for your help so far!

Hi Chris,

Still troubleshooting this issue. It’s fairly reproducible now, and I have a theory. Btw, I switched to 2.4.0-M4 in the mean time, and also reset and excluded/included the node. It’s node 12 now, problem persists.

What also may be an interesting factor here is retries in case of NOACK. See below:

Following my 1 switch toggle (at 17:33:15), a total of 3 SWITCH_BINARY_SET ON commands are sent to the node. Since communication from controller to node is perfect (the other way around is problematic), the node pulses three times, at :15, :20 and :25. And that is causing the garage door to start opening, stop 5s later, and start closing again 5s later. Remember, this is just 1 switch toggle by me.

So, OH apparently assumes on NOACK that communication TO a node has failed, while actually this can’t really be concluded from a NOACK. You can only conclude that communication from the node back to the controller has failed. In my situation (admittedly a corner case with RF issues, but still), this stops my garage door from otherwise operating almost perfectly.

Do you agree on this analysis, and if so: do you have a solution for this. (Can I disable the retries for this node for example?)

Full log of three testruns here: https://pastebin.com/neiLd6Aq. All runs exhibited the same behaviour from the node/door.

Thanks for looking into this.

PS. I am obviously also going to improve my mesh coverage, but IMO that shouldn’t stop us from making corner cases that could work, actually work.

This is normal. If there is no response from the device, the binding will send a retry.

Yes - I agree, but have no solution. As you say, there is no way to know if the command, or response fails - all the binding knows is the transaction failed, so it retries. There is nothing else it can do.

If your device is sufficiently close, then the other conclusion here is the device is faulty.

I’m open to suggestions, but I don’t really see any option here. If your system is not working properly, the device is basically unusable as you can not communicate with it.

Well, I think what the binding could do is not retry in such situations. Not retrying would make my device work correctly since in my situation the communication from controller to device (almost) never fails. Only the other way around communication is problematic, but I strictly speaking don’t need that to control the door.

So to achieve this, the retry count should be made dynamic instead of static (always 3). Probably the easy fix would be to make the # of retries configurable (per node). If made happen, I would set that to 0 for this node and that would fix my node. The hard fix is to figure out some criteria that can be used to automatically adjust the retry count in case of “weak tx nodes” - nodes that receive commands correctly, but are struggling to TX the ACK back.

What situations exactly - do you just mean not retry ever? As I said earlier, there is no way to detect this - you have a system that is basically not usable and really you need to resolve this. (sorry :wink: ).

I don’t see the need to add complexity to the binding for a case that needs to be fixed. The system requires 2 way communications or is starts to break down. The controller will (by itself) also mark this device as failed after a while, and then it may also stop communicating.

I’d really suggest to resolve your network issues so that you have 2 way communications to all nodes.

In situations of nodes with a TX power that is failing regularly to reach the controller, while the node’s RX capability works 99,9% of the time.

No need to detect programmatically if it is user configurable. I can detect it with my eyes by seeing the node/door switching 3 times when it should be only switching 1 time. Every time I toggle this switch.

I understand, but ironically in my case, because the system assumes a working 2 way comms in all cases, it already has broken down :wink: I think the goal of having working 2 way comms is laudable, but in reality we have to deal with a lot of failures in 2 way comms, which to some extent could easily be dealt with in software. And I think this is one of them (but obviously I’m biased :slight_smile: )

By the way, one could also make an argument for the opposite case: a node may need more than 3 retries to operate reliably.

This hasn’t happened yet in 3 weeks, so I guess the % of comms from node to controller that does get through is sufficient to keep it alive. So the controller allows it to live, but OH can’t control it correctly at the moment…