Zigbee devices work for sometime and then suddenly stop working

It’s always mvn clean install

Same Problem here… :frowning:

Please start a new thread . Over 90% of the time you may have the same symptoms but a different problem. It gets too confusing to have multiple users & problems in the same thread.

Ok, I will take it back

this one seems to be potential culprit as it replaces serial port implementation

The issue starts with Add methods to get supported cluster list from converters (#517) · openhab/org.openhab.binding.zigbee@f7cd109 · GitHub which bumps ZSS from 1.2.3 to 1.2.8. I will investigate further…

It’s always mvn clean install

Hehe, that’s not what I meant :slight_smile: mvn install gives me a jar which I can install in openHAB via karaf console but I have to install all the bundles (jars) independently and also install correct version of dependent libraries. All of this manually. I was hoping there’s some magic maven plugin that will generate a meta-bundle of sort that will hold all of these dependencies and will make the process a bit simpler…

Sorry - I misunderstood (I thought it was a strange question :wink: ).

No - there’s not at the moment. I’ve just added creation of the feature file into the pom, so it is now created automatically, but you’d still need to load that. I just have a bash script to copy everything to the addons folder - it’s as simple as anything.

1 Like

So I’ve tried different versions combinations today and it looks like the issue starts with ZSS 1.2.5. 1.2.4 works fine. I’ve tested it with binding 2.5.0 + OHC serial change.
Now, looking at commits between these two versions (likely there’s not many) I can’t see anything particularly suspicious. I can’t explain why the issue exists only with CC2531 and not with Ember…
I’ll keep digging…

There were no changes at all to the TI driver between these two versions. This is the history list for the driver -:

Yeah, I realize that and believe me it did surprise me a bit. I have two theories at the moment:

  1. There are some changes between 1.2.4 and 1.2.5 that are not directly related to the driver but influence it somehow. E.g. there’s some synchronization/locking issue introduced that cause a thread reading from the serial port buffer to starve… Just a theory and again it doesn’t explain why other dongles are not affected.
  2. I don’t have clear reproduction scenario. My testing boils down to creating a fresh network, joining two routers (smart switches) and playing on/off for a while. Usually I see the issue starting at the time of discovering first device that joins the network. That said there may be some level of non-determinism in the issue itself that fools me into incorrect conclusions.

What I am going to do now is to generate a reliable, repeatable and easy to setup testing scenario and then do some actual debugging on the latest version of ZSS instead of hunting for a culprit commit. Would be great to eliminate openHAB from the testing setup and test just ZSS with the console or ideally with a test app (or JUnit test) that drives the ZSS directly.

1 Like

I’m using the HUSBZB-1 USB Hub, purchased from Amazon.
I’m using a Raspberry Pi 4.
I have had the best results by using zigbee, zwave, and serial bindings for the usb dongle I’m using.
I’m not sure but I think that I read that 57600 baud is the default rate. I have not tried 115200 yet.

*** I first tried just the zigbee and zwave bindings, and found that the dongle was hard to pair and dropped out quite a bit… but it became much more stable after adding the serial binding.
I not sure if it was necessary, but it made a difference for me.

If this binding combination should not be used, please advise…

Thanks to All…

Please starts a new thread for this discussion. It is off-topic for this thread and adds confusion for those trying to assist.

Got it :smiley: It was fun :slight_smile:

So it’s a deadlock introduced in ZSS between 1.2.4 and 1.2.5 in particular in Initial implementation of fragmentation commit (although I haven’t explicitly tried versions with and without this commit).

The reproduction scenario is that I rapidly switch on/off on the Paper UI until the device stops responding (a few toggles is enough usually). With some trace logging enabled I can see that we receive data from the dongle but we don’t parse it timely and the parser thread is clearly lagging parsing small batches of incoming packets before freezing again for a few seconds. Here are the stack traces of all the threads at the time where parser thread freezes.

There are two main actors to look at in that threads dump:

  • zigbee-thinghandler-commands-15 - first thread at the top - it is a thread that handles a currently processed OpenHAB action to toggle the OnOff device (there is a series of threads waiting representing toggles that follow it)
  • ZToolPacketParser - a thread that parse packets incoming from the dongle.

So the story is following: sending a frame the call stack goes via ApsDataEntity synchronized method send(). When receiving a frame a parser goes via ApsDataEntity synchronized method receive(). If we receive frame at the time when we’re sending it the receive will wait for the send to complete. However the send method going further calls CC2531 driver sendSynchronous which will complete when a response is received or after a timeout. But in order to receive the response we need to pares incoming frame with ZToolPacketParser which still waits to receive previous packet. It does unlock after a while due to timeout in sendSynchronous but the delay introduced causes the ZToolPacketParser to not parse incoming bytes fast enough leading to overflowing the buffer.

Why there is no issue with Ember dongle? Well, one of the reason could be that Ember’s equivalent of sendSynchronous() spawns a new thread allowing the sending thread to complete therefore the parser can complete receiving and in the end it can receive the confirmation thet sendSynchronous required.

Options to fix:

  1. Spawn new thread in CC2531 driver for sending equally as we do in Ember
  2. Don’t sync on ApsDataEntity methods at all. Why do they need to by synchronized at all?
  3. Sync on ApsDataEntity methods separately for sending and receiving path
  4. … thoughts?
2 Likes

Thanks for investigating @wsowa.

This is probably the best thing I think.

That’s a good question. From a quick look, if the threads got out of sync (or out of order) then the fragment order check would fail. I don’t recall if I had synchronisation issues when I wrote that code, but I’m hesitant to change it without good cause.

I don’t have an issue with this either…

Thanks.
Chris

I have raised a bug for ZSS and I suggest to move a discussion about the solution there given we confirmed this is ZSS isue not an OpenHAB: https://github.com/zsmartsystems/com.zsmartsystems.zigbee/issues/986

I would suggest that the fix to ZSS, when available, should be released as a hotfix to 2.5.1 binding and we don’t wait for next OH release. Currently all the users of latest OH version that run zigbee network via CC2531 dongle have their network rendered useless.

This is not possible unfortunately - 2.5.1 was released already and we can’t change that. People will have to use 2.5.2.

1 Like

Any idea when to expect 2.5.2? @Kai originally said monthly releases but it has already been longer than that.

Sorry - I’ve no idea. In the meantime the snapshots will need to be used.

1 Like

@wsowa maintainers have just been informed that the next release (2.5.2) will be today!!! Please can you address my comments - ie rebase your PR against master and I will try and push to get the release delayed given the very short notice.

1 Like

Hi,just replied to your comments. I have no access to the computer today until the evening so I can’t rebase it but feel free to do so. The change is really a few lines of code.

Unfortunately I don’t think it will make it into the next release - it seems that will be done at 5PM UK time tonight so I will not have time to fix it before that. Sorry.