Zwave stalls at polling

Hi,
I am struggling with zwave, seems that my installation freezes on polling due to unfinished initialization. Stays like this for over an hour.

OH is in latest stable release, with zwave binding from the same source.

30-Dec-2019 19:14:16.285 [DEBUG] [wave.handler.ZWaveSerialHandler$ZWaveReceiveThread] - Receive Message = 01 09 01 41 53 9C 00 04 12 02 6D 
30-Dec-2019 19:15:08.842 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 71: Polling...
30-Dec-2019 19:15:08.843 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 71: Polling deferred until initialisation complete
30-Dec-2019 19:15:28.596 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 37: Polling...
30-Dec-2019 19:15:28.596 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 37: Polling deferred until initialisation complete
30-Dec-2019 19:15:35.879 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 107: Polling...
30-Dec-2019 19:15:35.879 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 107: Polling deferred until initialisation complete
.....
.....
30-Dec-2019 20:44:26.661 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling initialised at 1800 seconds - start in 50 milliseconds.
30-Dec-2019 20:44:26.711 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling...
30-Dec-2019 20:44:26.712 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling deferred until initialisation complete
30-Dec-2019 20:44:33.634 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Command received zwave:device:controller:node24:switch_dimmer1 --> REFRESH [RefreshType]
30-Dec-2019 20:44:33.634 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling initialised at 1800 seconds - start in 50 milliseconds.
30-Dec-2019 20:44:33.684 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling...
30-Dec-2019 20:44:33.685 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling deferred until initialisation complete

How can I correct it? Restart helps, but later the same problem occures.

Thanks in advance.

Jj

I suspect those are battery operated devices that have not been awake long enough to be fully discovered by the openHAB binding.
You need to wake them up numerous times hen first discovered until OH gets all the needed information.

Unfortunately not. Node 24 is Fibaro Dimmer 2, powered from the 240V…

OK, sorry.
Have you tried turning on debug logging? It is on the documentation under a heading for when things do not go as planned. Posting logs here can help others determine what is happening.

Zwave binding log with verbosity DEBUG is ON, I’ve pasted the tail of this log in my first post. Last night I had several restarts, maybe the watchdog is trigger happy (restarts OH after 3 minutes of Zwave debug log inactivity). I have changed the watchdog timeout to 10 minutes, as I see no apparent reason in the log for the stall.

30-Dec-2019 22:28:38.369 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 71: Polling initialised at 1800 seconds - start in 50 milliseconds.
30-Dec-2019 22:28:56.300 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Command received zwave:device:controller:node24:switch_dimmer1 --> REFRESH [Refre$
30-Dec-2019 22:28:56.301 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 24: Polling initialised at 1800 seconds - start in 50 milliseconds.
30-Dec-2019 22:28:56.304 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 71: Command received zwave:device:controller:node71:switch_dimmer5 --> REFRESH [Refre$
30-Dec-2019 22:28:56.304 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 71: Polling initialised at 1800 seconds - start in 50 milliseconds.
30-Dec-2019 22:29:12.259 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 108: Command received zwave:device:controller:node108:switch_dimmer5 --> OFF [OnOffTy$
30-Dec-2019 22:29:12.260 [DEBUG] [col.commandclass.ZWaveMultiLevelSwitchCommandClass] - NODE 108: Creating new message for command SWITCH_MULTILEVEL_SET
30-Dec-2019 22:29:12.260 [DEBUG] [.openhab.binding.zwave.internal.protocol.ZWaveNode] - NODE 108: Encapsulating message, instance / endpoint 5
30-Dec-2019 22:29:12.260 [DEBUG] [otocol.commandclass.ZWaveMultiInstanceCommandClass] - NODE 108: Creating new message for command MULTI_CHANNEL_ENCAP endpoint 5
30-Dec-2019 22:29:12.260 [DEBUG] [.openhab.binding.zwave.internal.protocol.ZWaveNode] - NODE 108: SECURITY not supported
30-Dec-2019 22:29:12.260 [DEBUG] [.openhab.binding.zwave.internal.protocol.ZWaveNode] - NODE 108: Command Class COMMAND_CLASS_MULTI_CHANNEL is NOT required to be secured
----->>>>>> Here the watchdog reset sequence was fired
30-Dec-2019 22:33:03.711 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 42: Handler disposed. Unregistering listener.
30-Dec-2019 22:33:03.723 [DEBUG] [ternal.protocol.initialization.ZWaveNodeSerializer] - NODE 42: Serializing to file /var/lib/openhab2/zwave/network_efe59261__node_42.xml
30-Dec-2019 22:33:03.739 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 21: Handler disposed. Unregistering listener.
30-Dec-2019 22:33:03.743 [DEBUG] [ternal.protocol.initialization.ZWaveNodeSerializer] - NODE 21: Serializing to file /var/lib/openhab2/zwave/network_efe59261__node_21.xml
30-Dec-2019 22:33:03.757 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 93: Handler disposed. Unregistering listener.
30-Dec-2019 22:33:03.763 [DEBUG] [ternal.protocol.initialization.ZWaveNodeSerializer] - NODE 93: Serializing to file /var/lib/openhab2/zwave/network_efe59261__node_93.xml
30-Dec-2019 22:33:03.774 [DEBUG] [rg.openhab.binding.zwave.handler.ZWaveThingHandler] - NODE 38: Handler disposed. Unregistering listener.

This message simply means that the device has not initialised (for whatever reason). If it’s a mains device, then maybe it’s not communicating with the binding. It’s hard to tell without a debug log - if you provide one I will take a look to see if I can see what’s up.

Thanks Chris,
let me investigate it further, as last stalls were at different conditions. I do not want to put a full log here, as with debug on the file is really large (+100M).

A 10 or 20MB log file should hopefully cover enough to see what’s happening, but I can simply slice larger files into more manageable blocks if needed.

I can cut the last 20MB of the log, but first let me confirm that this is not an issue with the watchdog kicking in prematurely.

How would you like me to post the log? Zipped, ss an attachment to the post?

That should work.

What watchdog is this you refer to? I don’t think there is one in the binding.

I made a simple script to monitor the OpenHab and restart it when zwave stalls.

Enough for this year, let’s see how it will run in 2020 …

Hmmm - ZWave should not “stall”. Restarting the system is probably a bad idea - how are you detecting this “stall”?

1 Like

I have a zwave network with over 40 nodes. I have noticed that from time to time it becomes unresponsive - inputs via OH interface are not executed by things.
I developed a simple watchdog script. It monitors number of frames at z-stick controller and the timestamp of the zwave log. If either of them changes, the watchdog is reset. If there is no activity for 30 minutes, I stop OH, rename the log files using the name generated using the current timestamp and start over with the clean log files. Then at the tail of the file I suppose to have some evidence of the problem cause.

I am leaving for short holidays, so my OH will be running with the DEBUG on (I’ve checked, there is enough disk space) during this time. I will continue the investigation upon my return.

All the best for the New Year.

Dear Chris,
would you be so kind and have a look at the attached log file from today. It’s from the whole day, no-one was home, so the OH was not busy.

Between 8:15 and 8:20 the frame count at the z-wave stick stopped increasing (I know it from openhab.log, where I check this value every five minutes).
When I came home around 14:47 zwave items did not react to the commands from OH. At 19:35 I stopped OH.
zwave.log
Any hints how to proceed?
Thanks in advace,
Jj

I think some of your devices are not configured well. I see a number of them sending multiple commands - this could mean that you have configured multiple association groups, or it could also mean that everything is configured ok, but there is a device in the network that is causing retries -:

There are two issues here that you should investigate. Firstly, we see the multiple commands - normally 3 in very quick succession. Secondly, they are updating every 30 seconds - given the first problem with everything being sent 3 times, this creates a lot of network traffic. You should configure the system to reduce the amount of messages sent. Here we see that the value is always the same, and there is normally the ability to only send updates when there is a significant change.

I see this on other nodes as well. The issue here is that you have a large network, and a LOT of data being sent that is really not required. This is likely swamping the controller - I see CAN messages, and messages being rejected when sending, which normally mean the controller is overloaded processing messages.

If I look at what happens when your system dies, we get a lot of messages from node 71 and 42 - these all come in very fast

Immediately after this we start seeing a lot of errors being returned from the controller when the binding tries to send a command -:

These are all bad things that you should look at resolving :wink:

However, the final problem is that there is then an error in the serial port and after this it will not be possible for the binding to work -:

06-Jan-2020 08:12:06.442 [WARN ] [g.openhab.binding.zwave.handler.ZWaveSerialHandler] - Got I/O exception Input/output error in writeArray during sending. exiting thread.

I believe that this error comes from the underlying serial driver, but I don’t know what’s causing it as it comes from the OS or hardware I think.

If you use the latest binding, it will restart automatically, however you really should look to reduce your network traffic as this may be contributing to these issues.

One point -:

Wow - really? You have a VERY heavily loaded system and the ZWave network is pretty busy. You really should look over this and reduce the traffic (as already said :wink: ). In this 30 minute log, there are around 4200 messages, so that is 2 to 3 ZWave messages received every second on average - many of these are duplicates and I suspect a very high percentage are unnecessary.

I hope this helps.

1 Like

Thank you very much. I will have a closer look at the devices and will get back to you with the findings.

I’ve had a look at the nodes you had mentioned.
71 is Fibaro RGBW controller, I have a few of the in the network. This one have 6 channels in use (4 brightness levels, total brightness and scene number). The power channels are not in use. I cannot find any associations nor a field to setup refresh rates/thresholds. I have also checked the rules that are linked with this node and enabled extra logging when they are triggered. Just to double check if they are not causing the issue. None of them is linked to the power or sensor power - they are simple “switch light on in the evening” type…
I also checked the description of the RGBW controller, by default it sends the power report every 30 seconds. I cannot find the way to check or change this setting using HABMin. Maybe I messed something up with the previous HABmin (on OH 1.X) in the past.
Node 42 is PAN 04 dual switch. I quickly examined the settings, W meter reporting is set to 720 (x5 sec = 1 hour which is the factory default), same for kWh meter (10 -> 1 hour, default), association groups for Relay 1+2 , Relay 1 and Relay 2 are set to controller.

I am really puzzled where else I could look for parameters to limit unnecessary network traffic…
I will try to factory reset and reinclude node 42 and 71, but I would prefer to find a root cause before doing it.

There were dead nodes in the network. When I deleted them, everything works fine. Problem solved. Thanks for help.

1 Like