[SOLVED] Z-Wave unreliable in 2.5.0.M4

I have the same USB key … and healing at 2 AM … updated yesterday so I’ve already passed this point

There are current release versions of OpenHAB.

The stable release, normally released every 6 months, is currently 2.4 released late last year.

The Testing releases, also known ad Milestone releases are released every month. 2.5M4 or 2,5 Milestone 4 is the current version.

The Unstable, or Snapshot releases are built & released every night and are only designed for developers to test all pieces working together. Snapshot releases are not expected to totally work or be stable.

OK folks… before we all grab our pitchforks and start calling this a bug… this is common traits of a failed node. Sometimes when things are working good for a long time and we upgrade and things suddenly break, it is easy to blame the upgrade itself, when in fact, it is a low battery, node slightly out of range or whatever edge case. In a system with a lot of nodes, it could have gone unnoticed previously.

1 Like

I have not finished my investigations because of time constraints. I’ll post the results here as soon as I find the exact reason.

The issues (#1195, #1178) you reported on Github are still open.
If it turns out that they still exist in M2, M3, M4 and snapshots, why has not it been resolved so far, or why was it decided to publish the milestones anyway?

I did not report the first, and it was a known issue. The second one was mine. My workaround was to disable the daily network heal and restarting OH. I can still heal individual devices, if needed.

It’s a rare issue, possibly related to large networks (>120 devices). Chris is extremely busy, and with all the changes this year, the IDE is not what it was, especially when it comes to debugging.

I’m not the right person to ask, but the 2.5M2 preparation topic may shed some light on that.

2 Likes

I only have 10 nodes. (1 controller, 1 repeater, 5 FLiRS and 3 “non listening” nodes (=3x motion sensor)) . Only the last 3 make troubles after “daily healing”.

The workaround to both issues is easy. Just disable the nightly heal. My network of about 100 nodes has been running fine without the nightly heal for a couple months.

2 Likes

Did you do a manual heal for each node? How often? How do you know if it’s neccessary? Do you restart OH server each time?

I’d do that if necessary, but I haven’t needed to do it yet. My network is pretty stable. If I add some new nodes I might need to do it. If I see a node starting to become flaky, I might try running a heal on that node. But I haven’t needed to do that either.

1 Like

I should note that after disabling the nightly heal, a restart of openHAB is advised just in case the binding is in a weird state due to a previously run heal.

@mhilbush

One last question.

Suppose you have a nightly heal at 2:00 A.M. and all the nodes have been healed within the next 24 hours, except for one or two nodes, then you should do a manual heal (of the missing nodes) until it is complete? Then “disable heal” and thereafter restart OH?

In my analysis, if a heal doesn’t complete in 24 hours, it will never complete. In this scenario, I’ve seen the binding get stuck in the “initialize/heal” state for that node. Once in this state, it will never get out of that state until the binding is restarted. Once stuck in this state, the device will never be polled again until you do a restart. That’s why I disable the nightly heal, then do an openHAB restart.

This is just my experience, of course. OTOH, it’s backed by log files that show this behavior.

1 Like

In a battery operated sensor, I found removing & replacing the battery & making sure it was awake brought it back online fully.

@mhilbush @Bruce_Osborne

I my case, I can do a “manual heal” by waking the node up many times (tripple pressing the button)! Without restarting OH.

EDIT: I have to check if 1x pressing has the same effect. (–>heal successful)

If a node is hung trying to heal, may may not work.
Verify how to wake up your device. Triple-clicking on some devices is used for inclusion.exclusion mode.

1 Like

But then I must have had many ghost nodes? But I have not. But controller is NOT in inclusion/exclusion mode during a heal! Isn’t it?

But you are right. 3x press (include/exclude); 1x press “wake up” :open_mouth: :open_mouth:

1 Like

The mode likely times out if the controller is not in inclusion or exclusion mode.
Some of my devices take a single press & I have one where you are supposed to press & hold for at least 5 seconds to wake up.

Hmm. I’m not sure why cycling the power and waking up the device fixes the issue completely. The primary symptom of the issue is that the initialization thread (which is used for device initialization, as well as heal), never completes. While it might’ve shown the node with a thing status of ONLINE, how do you know the binding killed the node initialization thread?

In the issue I opened, and as documented in the log file, once in this state, wake ups have no effect on resolving the issue. I’ve found nothing short of an OH restart will fix the issue. Even a binding restart won’t fix it, because the binding never kills the orphaned initialization threads. This behavior can be demonstrated by looking at the zwave threads using the karaf console. I’ve observed numerous node initialization threads even after a binding restart.

Edit: I should be more clear, after stopping the binding, the init threads are still there.

See discussion here.