Z-wave things lose their Lifeline Association Group

Landstad · October 10, 2018, 5:34pm

Update:

For now most things are working OK, although my log is filled with duplicates from one node. Hopefully it will stay as stable as it is now.

Landstad · October 14, 2018, 9:16am

Update:

Everything worked for a couple of days. Had to move the RPi. Shutdown, made a backup image (as it seemed to be working perfectly) with win32diskimager. Now some things have lost their lifeline again (not only GUI).

I waited 15 hours and it never got the lifeline. I then tried adding the lifeline (through habmin) to one thing, waited two hours now and it’s still not working. No new version through openhabian-config yet, so I will update again when a new version has been installed and tested.

chris · October 14, 2018, 9:24am

If you have the logs showing when the lifeline was lost, I would like to see this. Otherwise, I don’t think there is anything I can do - sorry.

I don’t think there are any changes in the pipeline, so a new version won’t change this. Really, I’d need to see a log capturing what is happening in order to resolve it.

Landstad · October 15, 2018, 8:28am

I was referring to this. I did not see the post from sihu regarding being trapped to milestone builds, so I am still on M4.

Might M5 / newest snapshot provide better handling and maybe a fix?

If I get the lifelines back (I haven’t as of yet…) I won’t make them go away again just in order to make logs as it is quite cumbersome getting them back(!). But I will turn on zwave debug logging the next time I upgrade in case the lifelines get nulled again then (which I suspect they will).

chris · October 15, 2018, 8:33am

I think the latest snapshot has this included - there is no M5 build yet as far as I know.

Sure - this is the best approach unless you know exactly how to reproduce the issue, which I don’t think is the case or we would be having a different conversation . Thanks.

Landstad · October 16, 2018, 1:10pm

My theory is that they disappear when initializing after a reboot (and maybe when restarting services), but not always I think.

And how I get them back is another story: Trying 100 times, sometimes exclude/include/restart services/add lifeline/remove “pending” lifeline/adding again a few times and suddenly they’re there.

Maybe the queues you were writing about are very long and seldom popped from or something.

Anyway: Here is the full log after installing the latest snapshot (started install apprx 14.47, interesting log items are probably a bit later). All lifelines I checked (for Fibaro Dimmer 2 nodes) were gone after the update. For example for node 6.

Log here

chris · October 16, 2018, 1:14pm

Unless the device is being reinitialised, the associations should not be changed. Anyway, I will take a look at the log - thanks.

chris · October 16, 2018, 1:15pm

I had a look at the log, and there are no changes to the lifeline, or any other associations, in this log for node 6.

Landstad · October 16, 2018, 1:20pm

Yea, me neither. Maybe they somehow got lost some time between yesterday night (last time I know they were working) and when I did the update. That’s very unlucky…

Some working associations look like this in habmin:

and this in paperui:

The nodes I can choose between when altering look nothing like that:

Could this be an indication of what might be wrong?

chris · October 16, 2018, 1:23pm

What the UI shows should not change anything in the device - something must have sent the commands to the device to reconfigure it and this is what we need to find. As we see in the log, the binding doesn’t change configuration of the device during initialisation, unless the device is reinitialised somehow.

If you have the log from yesterday, then it would be worth looking through to see if the associations are changed - you should be able to simply search for the word association in the first instance (hopefully it doesn’t show many from other bindings etc).

Landstad · October 16, 2018, 1:56pm

I only turn on debug log when I know I am going to need the log for something unfortunately.

I do have a few nodes that are discovered when doing a zwave discovery. When added they show as online, but no manufacturer et cetera. I guess the controller might have some old information about them (previously excluded and included as a new node).

When I try to mark the nodes as failed I get log like this:

2018-10-16 15:51:34.730 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 2: Configuration update received

2018-10-16 15:51:34.737 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 2: Configuration update set action_failed to true (Boolean)

2018-10-16 15:51:34.741 [DEBUG] [message.RemoveFailedNodeMessageClass] - NODE 2: Marking node as having failed.

2018-10-16 15:51:34.745 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 255: Added 462 to queue - size 1

2018-10-16 15:51:34.748 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Transaction SendNextMessage 0 out at start. Holdoff false.

2018-10-16 15:51:34.752 [DEBUG] [wave.internal.protocol.SerialMessage] - Assembled message buffer = 01 06 00 61 02 01 86 1D 

2018-10-16 15:51:34.755 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - NODE 255: Sending REQUEST Message = 01 06 00 61 02 01 86 1D 

2018-10-16 15:51:34.759 [DEBUG] [ing.zwave.handler.ZWaveSerialHandler] - Message SENT

2018-10-16 15:51:34.762 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Transaction SendNextMessage started: TID 462: [WAIT_RESPONSE] priority=Controller, requiresResponse=true, callback: 134

2018-10-16 15:51:34.763 [DEBUG] [WaveSerialHandler$ZWaveReceiveThread] - Receive Message = 06

2018-10-16 15:51:34.766 [DEBUG] [nal.protocol.ZWaveTransactionManager] - processReceiveMessage input 0<>128 : Message: class=null[0], type=ACK[2], dest=255, callback=0, payload=

2018-10-16 15:51:34.769 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Received msg (0): Message: class=null[0], type=ACK[2], dest=255, callback=0, payload=

2018-10-16 15:51:34.773 [DEBUG] [nal.protocol.ZWaveTransactionManager] - lastTransaction TID 462: [WAIT_RESPONSE] priority=Controller, requiresResponse=true, callback: 134

==> /var/log/openhab2/events.log <==

2018-10-16 15:51:34.775 [me.event.ThingUpdatedEvent] - Thing 'zwave:device:512:node2' has been updated.

==> /var/log/openhab2/openhab.log <==

2018-10-16 15:51:34.776 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Received msg: ACK

==> /var/log/openhab2/events.log <==

2018-10-16 15:51:34.780 [vent.ConfigStatusInfoEvent] - ConfigStatusInfo [configStatusMessages=[]]

==> /var/log/openhab2/openhab.log <==

2018-10-16 15:51:34.783 [DEBUG] [WaveSerialHandler$ZWaveReceiveThread] - Receive Message = 01 04 01 61 08 93 

2018-10-16 15:51:34.786 [DEBUG] [nal.protocol.ZWaveTransactionManager] - ZWaveReceiveThread queue empty

2018-10-16 15:51:34.792 [DEBUG] [nal.protocol.ZWaveTransactionManager] - processReceiveMessage input 0<>128 : Message: class=RemoveFailedNodeID[97], type=Response[1], dest=255, callback=0, payload=08 

2018-10-16 15:51:34.793 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Transaction SendNextMessage 1 out at start. Holdoff false.

2018-10-16 15:51:34.796 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Received msg (0): Message: class=RemoveFailedNodeID[97], type=Response[1], dest=255, callback=0, payload=08 

2018-10-16 15:51:34.797 [DEBUG] [nal.protocol.ZWaveTransactionManager] - lastTransaction TID 462: [WAIT_RESPONSE] priority=Controller, requiresResponse=true, callback: 134

2018-10-16 15:51:34.800 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Checking outstanding transactions: 1

2018-10-16 15:51:34.802 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Last transaction: TID 462: [WAIT_RESPONSE] priority=Controller, requiresResponse=true, callback: 134

2018-10-16 15:51:34.805 [DEBUG] [ve.internal.protocol.ZWaveController] - Incoming Message: Message: class=RemoveFailedNodeID[97], type=Response[1], dest=255, callback=0, payload=08 

2018-10-16 15:51:34.807 [DEBUG] [message.RemoveFailedNodeMessageClass] - Got RemoveFailedNode response.

2018-10-16 15:51:34.809 [ERROR] [message.RemoveFailedNodeMessageClass] - NODE 2: Remove failed node failed as node not found

2018-10-16 15:51:34.811 [DEBUG] [e.internal.protocol.ZWaveTransaction] - TID 462: Transaction COMPLETED

2018-10-16 15:51:34.815 [DEBUG] [ding.zwave.handler.ZWaveThingHandler] - NODE 2: Got an event from Z-Wave network: ZWaveNetworkEvent

==> /var/log/openhab2/events.log <==

2018-10-16 15:51:34.816 [arthome.event.BindingEvent] - org.openhab.binding.zwave.event.BindingEvent@f3173

==> /var/log/openhab2/openhab.log <==

2018-10-16 15:51:34.821 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 255: Response processed after 59ms

2018-10-16 15:51:34.823 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 255: TID 462: Transaction completed

2018-10-16 15:51:34.825 [DEBUG] [nal.protocol.ZWaveTransactionManager] - NODE 255: notifyTransactionResponse TID:462 DONE

2018-10-16 15:51:34.828 [DEBUG] [nal.protocol.ZWaveTransactionManager] - ZWaveReceiveThread queue empty

2018-10-16 15:51:34.830 [DEBUG] [nal.protocol.ZWaveTransactionManager] - Transaction SendNextMessage 0 out at start. Holdoff false.

Might this have anything to do with this? The (excluded/included) node may or may not be one of the problem nodes. Any tips on how to get the dummy nodes removed so that they are not found during discovery?

chris · October 16, 2018, 4:57pm

Given that we don’t know what is causing the loss of the associations, how do you know when you need to log the data? I would recommend keeping debug logging enabled for now so we can try and find the cause of the issue.

The controller doesn’t store any information about devices (well, only a very small amount, and not information such as the manufacturer data).

I’ll need to look at this further…

Landstad · October 17, 2018, 8:32am

I thought I knew - previously it has seemed like it happened after a reboot or restart services.

I don’t really want to have debugging enabled all the time because I am running from an SD-card and am worried about the SD card silently breaking. Which it might anyway I guess.

I got everything up and running again yesterday (I think). For one Fibaro 2 dimmer I had used the switch_dimmer1-channel (and I believe that worked at some point after upgrading to 2.4), but in the logs I saw the communication came on switch_dimmer and switching to this worked. So here the lifeline was not lost, it was just not shown in UI and the items were bound to the wrong channels.

For two Fibaro relays I also had to switch - now to switch_binary1. I believe both communicated on another channel on the milestone build (but I might be mistaken although I don’t think so). I don’t have a full understanding of when the (channel)1 is used vs just (channel) for the two-channel things I have that are only using one channel. Maybe it depends on which wiring diagram is used?

But! I have definitely had a few false positives for disappearing lifelines:

Lifeline is only not shown in GUI, but is actually there and everything is in fact working as it should.
Me saving GUI with wrong information after changing another parameter in paperUI which deleted the association groups because the ass groups did not show in GUI.
OpenHAB switching which channels to communicate on from 2.3 to 2.4 - lifeline was not lost but in combination with #1 and not looking thoroughly at logs I thought lifeline was lost.

Still though, more than one lifeline have disappeared after a regular reboot for sure (I shut down in order to take a backup when everything was working well. When mounting and booting some things were not communicating) at least once.

MAYBE this was because of me having deleted ass groups as by #2 above, and the messages were far down on a queue which by chance were popped just after the reboot and not before. This would be an indication of queues not actually being popped at all for a while as the system had been working well for apprx 24hrs at that point.

Now everything is working as it should again. Thanks for your follow up @chris

Can I help in any way?

dastrix80 · October 19, 2018, 9:19am

So my devices have lost it again… one in particular

I know this happens because all of a sudden my motion detection fails.

Here is the best i can do for a debug, when ive updated the lifeline

Rename the file to txt

zwave.xml (33.0 KB)

chris · October 19, 2018, 9:22am

Please can you provide the log showing what happened? Without this, I can’t do anything, and I’m not sure what you are expecting?

dastrix80 · October 19, 2018, 9:24am

I attached the debug?

chris · October 19, 2018, 9:30am

Ah - ok, thanks. I didn’t see that there when I first replied…

Landstad · October 19, 2018, 9:38am

I don’t have any logging unfortunately, but I lost a lifeline for one device yesterday as well. It was very easy to get it back using habmin this time though (maybe the new version with better queue-priority is working well?).

I know this isn’t very helpful without a log, and as soon as I find the time to switch to running from an SSD-disk I will turn on debug log in order to capture the bug. But I won’t do that to the SD-card.

chris · October 19, 2018, 10:21am

The log looks like it has useful information so I will take a look at this deeper over the weekend - thanks.

dastrix80 · October 19, 2018, 10:47am

Great news! Thanks Chris