Multiple Fibaro Wall Plugs FGWP102 keep getting excluded from the network unexpectedly

robbert · February 4, 2020, 7:03pm

Hi,

OH: 2.5.1-2 (from milestone repo), Aeotec Z-Stick Gen 5, Fibaro FGWP102, firmware 3.2.

I own a Fibaro wall plug that keeps going offline at unexpected moments. After this happens various actions can be needed to revive it, this differs. Sometimes just an unplug/replug fixes the wall plug, sometimes it needs to reinitialized, and sometimes excluded/included.

Here are debug logs of one such event: https://privatebin.net/?ed558c2693173958#8PtEcNWP45w2bT8FFeAiD1VMYGhLC5bUMBPyjcWN41i5

It’s node 16. In this particular case I sent an OFF command to the wall plug at 21:21:27, after which the switch died. I tried some stuff: reinitializing, unplugging/plugging, but that did not bring it back to life.

Does anyone know why this happens and if this behaviour is fixable?

Thanks in advance.

Bruce_Osborne · February 4, 2020, 7:09pm

I will try and look at this after work. The log viewer may give some clues.

https://www.cd-jackson.com/index.php/openhab/zwave-log-viewer

robbert · February 4, 2020, 7:49pm

Hi Bruce,

Thanks. I already loaded it into the log viewer, but am having trouble interpreting what’s wrong.

I tried excluding the device. I put the Z-Stick in exclusion mode, triple clicked the button on node 16. Nothing happened at all. Tried it a few times obviously.

Anyway, since the device wasn’t responding to anything any more, I performed a device reset using the procedure at [1]. After successful reset, I tried reincluding a couple of times: still nothing.

Finally, I tried erasing the device’s energy consumption memory. Surprisingly, after this I was able to include the device again.

So, this device definately ended up in a very weird state somehow. Starting to blame the hardware more and more.

[1] From the manual:

Menu allows to perform Z-Wave network actions. In order to use the
menu:

Press and hold the B-button.

Wait for the device to indicate desired position with a colour:
• GREEN - erase energy consumption memory
• VIOLET - Z-Wave network’s range test
• YELLOW - device reset

Release the B-button.

Click the B-button to confirm selection.

robbert · February 15, 2020, 9:07am

So this happened again, but with a different plug (on the same spot as the previous one).

After the previous wall plug failure 12 days ago, I changed the wall plug at this location for another one. Same model and firmware. Today the new one dropped offline just after switching it off (similarly to the previous failure).

EDIT: This is with ZWave binding 2.5.2.202002091130.
EDIT: After issue occurred, wall plug was signaling RED, meaning not being included anymore.
EDIT: Steps taken to fix: 1. erase device’s energy consumption memory 2. reinclude device 3. remove ghost device from controller 4. heal network.

Any idea what’s happening here? (node 22 in this log)

robmac · February 15, 2020, 8:28pm

It would be good to see the other traffic but if those big 7s response times are just after a load of unsolicited reports then could be overload. 0.7s indicates a bit of an issue at that time.

If not is there a reliable set of routes?

You have node 22 set to defaults so you are getting a load of reports from that. If you have all other nodes at similar settings and more than a few nodes it will be an issue.

Alternately do you have another node that has a big issue as it would be odd if two devices were faulty.

Can only see this with a lot more log for a good few hours.

chris · February 15, 2020, 8:56pm

My suspicion is that there is an RF link issue - the link quality will change through the day and if this occurs over a poor link, then there will be timeouts.

These are not “response times” since there are no responses. There is no link.

robmac · February 15, 2020, 9:02pm

who knows from that info and yes they are no acks but not going to be picky

at 20ms to get an ACK this is not routed to this node
two devices in same place same issue

If LWR fails which must be direct it will try bouncing around but then as it is a neighbour of controller as 20ms proves it will do direct again then explorer

so I suspect we need to see more info

to be honest I suspect another node flooding and this is the fall guy. possibly is repeater in bad behaviour,

Bruce_Osborne · February 15, 2020, 9:09pm

We should see that then unless the OP has filtered the log. Filtered logs are useless.

chris · February 15, 2020, 9:11pm

You won’t necessarily see this in the logs - it depends on where in the network it’s happening. It will only be seen if the destination is the controller, and the flooding is heard by the controller.

robmac · February 15, 2020, 9:18pm

The logs are really good for many things but they can only see what the binding processes. They might hint but this is one of those things that may not show in logs but zniffer would be a big help.

robmac · February 15, 2020, 10:36pm

So original log

retry hidden in here

again direct but then 61ms

robmac · February 15, 2020, 10:40pm

Are you running Chris’s snapshot with fix for heal?

not long after reports start flooding in

then bang you send to 16 and it does not come back

Application keeps trying but reports are still coming in.

but in the end your network is full and it dies

Two things if this is version with Chris’s fix for Heal it is not working for you. Do not run heal.

Restart to free any threads and do not run unless you get a version.

if you still get this then you are flooding. Too many unsolicited reports are hitting or a bad boy.

Now could be a load of stuff that logs are not showing also so it could be a bad bad node that is making a lot of noise. One type of bad boy is a phantom association but there are others.

If turning your reporting down does not fix you need a zniffer.

Have you got a lot of polling still. I doubt you need a lot.

Poll after again I doubt you need it,

then reports …

Just looking back in my notes. Those fibaro can get phantom association.

Your controller and the fibaro do support explorer so LWR will fix even if nothing is good in controller table. Heal is not needed for these to fix. The location has direct RF so no hops will be LWR.

I don’t think you have phantom though. Just too much traffic and a version of binding with broken heal.

robbert · February 16, 2020, 8:59am

Hi Rob,

Thanks for your response! Let me clarify and respond.

The first occurence of this issue that i posted a log of happened at Feb 3rd, at which time Chris’s fix for heal did not exist yet. I was running 2.5.1-2 vanilla from the milestone repo then.

The healing fix became available Feb 9th and I installed it right away and subsequently (successfully) healed all nodes in the network.

The second occurance i posted a log of happened at Feb 15th and at that time I was using binding version 2.5.2.202002091130, with fix.

So, this issue persists after fixing the healing bug.

Do you mean ghost nodes? I checked for ghost nodes using both OpenHAB/Habmin as well as PC Controller, but did not find any (apart from the ones that I have to remove right after the issue happened). Any other things to check?

I will do so, but tbh I don’t see a lot of reporting in the logs (except maybe for NODE 17 when it is active). You commented in an earlier response about have my wall plug (node 22 then) at default settings. This is true, however from the log I fail to see how this is generating too much traffic. Can you clarify from what part of the logs you concluded having the node at default settings can be problematic?

It’s on its way as we speak.

Most of my device configs are default. I checked all device configs and they all have set their “Polling period” to “1 Day”. Do I need to do more checks to determine this?

If by “phantom” you mean ghost nodes, I can’t find any and do regular checks. The binding is also fixed now. So that would only leave “too much traffic” as an option…

One more thing I was considering. After the issue occurs, the wall plug has lost his “pairing” with the controller - I mean, it is signalling RED when plugged into a socket indicating it is NOT included in the network. At the same time the controller still thinks the node is present - I have to remove the ghost node from the controller, reinclude the node and heal the network to fix this.

So my question: seeing that the wall plugs end up in this “suddenly” excluded state: is it at all possible that the binding or controller or network flooding are responsible for creating such situation?

Personally, from my observations I would be more inclined to consider faulty device (however unlikely, cause they both would need to be faulty). Or maybe an electrical problem that confuses the devices?

What do you think?

Thanks again!

robbert · February 16, 2020, 9:03am

To be clear, the logs aren’t filtered.

robbert · February 16, 2020, 9:18am

Do note that IMO these do not seem “normal” timeouts caused by changing RF interference throughout the day. If it would be “regular” interference, I would expect:

the link to also become active again sometimes (because interference may also become less at times)
the device NOT to be totally dead / suddenly not included anymore
this to happen randomly, and not necessarily right after a switch toggle (as per both these occurences).

IMO these timeouts are a result of the device dropping dead all of a sudden - obviously we have a link issue then. Also, this isn’t normally a device with a poor link. It’s centrally located in my network, usually has fast response times and has at least 6 neighbors among which the controller.

chris · February 16, 2020, 9:50am

No - associations that don’t exist. Eg if yo configure an association between device 2 and 3, then remove node 3 from the network and don’t remove the association.

Maybe the device has a fault and has reset. This seems most likely now. It’s always useful to provide all the information you have - if you’d mentioned this initially that the device is in a reset state, then we may have had a different prognosis.

Agreed - it may become active again later in the day or the next day (or a few minutes or seconds later - depending on the nature of the beast)

Again - this is new information. It is however still included in the network as far as the controller is concerned (I assume?) If you restart the controller, is it listed in the nodes list?

So? That is also a feature of interference.

That could also be the case and you may have more information than the rest of us (based on your comment that it is no longer included in the network).

If it were interference though, the binding will still mark it DEAD if it fails to respond after 3 retries (as we saw in your log) so that part is completely expected.

Again, you have more information than me.

Based on the little information I had when I made the comment, it looks perfectly like interference - working one minute, and not the next. If you had said that the device has reset to factory state, I would have provided a very different response. I would generally expect it to return, but if it is now excluded from the network then that is a whole different issue.

robmac · February 16, 2020, 10:03am

I think this will only be clear when you have your zniffer.

It does not take a lot of traffic to clog when you start to get failures due to the nnn retries at all levels. There are so many things. Two devices with direct access failing but the location could be in range of another node that they may even be repeaters for and and it is other traffic that is not getting to the controller so will not be in logs. Explorers by their very nature clog networks if there are too many so very hard.

I must admit with only 16 devices I would clear it down and start again but wait for your zniffer and watch every add. If you wait then the zniffer log will hopefully show what is up.

chris · February 16, 2020, 10:05am

Maybe, but if the device is resetting as mentioned above, then it may not show anything on the sniffer anyway.

It will be interesting to see the sniffer log, but a device shouldn’t really just reset like this.

robmac · February 16, 2020, 10:07am

and certainly two devices

I think if I read the post correctly earlier one was removed and a new one put in the same place but symptoms continue.

chris · February 16, 2020, 10:10am

Yep, but I’m not sure we have all the information now to say they are the same. Maybe the earlier device had a different issue and we need to be careful about jumping to assumptions based on new information that may not be applicable to any previous issue.