OH 3.2.0.M3 / ZigBee binding (Ember M35x Coordinator) / RFC2217 / hacked Silvercrest ZigBee gateway (EFR32MG1B-256k @ EmberZNet 6.7.8.0)

OK, AFAICT the RFXCom binding is rock solid, and you’re right, there might be a correlation. I’ll give the RFXCom device back to the RFXCom binding and provide a new logfile.

Steps planned:

  1. Disable all bindings connected to local physical serial ports (Z-Wave and RFXCom).
  2. Restart OH
  3. If 1. doesn’t persist: Disable all bindings connected to physical serial ports (Z-Wave and RFXCom).
  4. Create an Ember-Concentrator and configure it to use RFC 2217.

Anything else I could do to get meaningful log entries?

Well, it’s crashing from what I can see in the log, so it’s apparently got the “same” issue (at top level - ie serial port failure) as the zigbee binding :wink:

I think this is fine, but we still should not exclude the fact that there might be a common element. Still, let’s go with this as trying different things will potentially help understand what’s happening (or maybe just confuse the situation more :slight_smile: ).

All other bindings disabled,

1st_2nd_try.zip.txt (79.4 KB)
3rd_4th_try.zip.txt (55.9 KB)

Timeline:
GUI: Add Ember Coordinator
00:20 1st try: OH — WiFi — Hub — wired — ZigBee gateway
GUI: Delete Ember Coordinator
GUI: Add Ember Coordinator
00:26 2nd try: OH — WiFi — Hub — wired — ZigBee gateway
GUI: Delete Ember Coordinator
GUI: Add Ember Coordinator
00:40 3rd try: OH — wired — Hub — wired — ZigBee gateway
GUI: Delete Ember Coordinator
GUI: Add Ember Coordinator
00:45 4th try: OH — wired — Hub — wired — ZigBee gateway

Skimming through the files, I see:

  • RX ASH error: BAD PACKET
  • status=EMBER_NOT_JOINED
  • Error creating instance of EzspFrame / Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds for length 6

From a quickish look, I don’t really see any problem - obviously other than the fact that it stops working :wink:

Yes, there are a number of these and the binding seems to recover from them - as it should. The packets get retried, and life seems to go on. This sort of thing does happen on a serial / USB link as well, but it’s reasonably rare. I would suggest that this is indicative of lost data on the remote end.

That’s “OK” - the system has not got that far - it keeps falling over during the coordinator initialisation.

In themselves, these are not the cause of a problem - the frames just get dropped, but communications continues.

Ultimately, this always results in the NCP ceasing to communicate -:

2021-11-03 00:28:54.921 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH TX EZSP: EzspSetConfigurationValueRequest [networkId=0, configId=EZSP_CONFIG_SOURCE_ROUTE_TABLE_SIZE, value=16]
2021-11-03 00:28:54.921 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: TX EZSP queue size: 1
2021-11-03 00:28:54.921 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=false, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:28:54.961 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameNak [ackNum=4, notRdy=false]
2021-11-03 00:28:54.961 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:28:54.961 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:28:55.396 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:28:55.396 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:28:56.244 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:28:56.244 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:28:57.920 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:28:57.920 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:29:01.126 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:29:01.126 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:29:04.343 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - ASH: Retry Sent Queue Length 1
2021-11-03 00:29:04.343 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=4, ackNum=4, reTx=true, data=BD 00 01 53 00 1A 10 00]
2021-11-03 00:29:04.933 [DEBUG] [e.ember.internal.ash.AshFrameHandler] - Sending EZSP transaction timed out after 10 seconds

The binding tries to resend the data, but ultimately gives up.

Do you have Wireshark? It might be interesting to see what is happening on your TCP connection. IF you want to post Wireshark logs I can take a quick look at what happens when this all falls over.

There are two malformed packets from the ZigBee gateway:

I would have to correlate this to the OH log … - I will try tomorrow.

I would like to say that this shouldn’t matter. These are invalid TCP packets, so TCP should take care of this (ie by resending corrupted frames) such that what pops out in the binding is correct.

However, it will be interesting to see if the timing correlates with problems in the binding…

@chris

Serial gateway running on the ZigBee gateway is:

1st try: serial gateway without any command line option, i.e. using hardware flow control
2nd try: serial gateway with option -f, i.e. software flow control

OH logs and Wireshark packet captures:
sniffed.zip.txt (60.7 KB)

I don’t think that flow control type set in the binding will make any difference here since I don’t believe that this is passed from the client to the server. As I understand the server is started separately, and flow control might make a difference there.

From the binding log it looks pretty much the same in both logs. What is interesting, and I’d not spotted earlier, is that there is an “Error 81” in some of the logs (I’ve not checked if this is always the case). This error actually comes from the Silabs chipset and it means that the device has gotten out of sync with data, and it has asserted. The only recovery from this is to reset the device and start again. I don’t really know what would cause this here as I’m not familiar with the server application - I could guess that the ASH protocol might be getting out of sync, although delays through the network are pretty minimal compared to the relatively slow speed on the serial port.

Looking at the Wireshark log I don’t really see any issue as such - it just stops passing data which is consistent with the fact that the NCP is asserting.

@chris

And now for something completely different: :slight_smile:

Sonoff ZB Bridge (EFR32MG21@EmberZNet 6.7.9.0) running Tasmota 10.0.0 via RFC 2217

OH log:
Tasmota_Sonoff_ZB_Bridge_OH.txt (109.4 KB)

Tasmota console:
Tasmota_Console.txt (34.8 KB)

Looks familiar …

It works flawlessly with Home Assistant Operating System.

I guess this isn’t a surprise that the result is the same - isn’t the lidl device basically the same? Probably the implementation is broadly the same.

I really have no clue about Home Assistant and how it works. All I know is that I have literally 10s of thousands of installs out there with direct connections (and that’s just the ones I know of - I’ve no idea how many customers German Telekom supports with this) and they also work flawlessly.

I’m not really sure where to take this - if you can try and debug that would be great, but otherwise I’m not sure what I can do.

As a final step I will try to get RFC 2217 working under OH@Linux.The problem might be related to Windows.

ZigBee@Home Assistant seems to be based on GitHub - zigpy/bellows: A Python 3 project to implement EZSP for EmberZNet devices.

I don’t really want to start digging through Python code - at the end of the day it’s just two implementations of ASH. My implementation of ASH was produced 4 or 5 years ago and has worked without any issues on lots of projects. Of course it’s always possible there’s a bug, or some sort of tight timing issue that’s causing problems here, but it works well on the serial connection and I’m not really sure how to debug this further at the moment (sadly).

Results for openHABian 64bit 3.2.0 #2546 / Raspi4:

Sonoff ZB Bridge@Tasmota 10.0.0 / EZN 6.7.9.0:
BAD PACKET, Error 81, ASH timeout
Sonoff_ZB_Bridge_Tasmota_10.0.0_EZN_6.7.9.0_openHABian_3.2.0_2546.log.txt (352.6 KB)

Lidl Silvercrest ZigBee gateway / EZN 6.7.8.0:
ASH timeout
Lidl_Silvercrest_ZB_Gateway_EZN_6.7.8.0_openHABian_3.2.0_2546.log.log.txt (26.6 KB)

My final conclusion:
Support for RFC 2217 in OH/OH Zigbee binding is broken (Windows 10, openHABian).

Workaround for openHABian can be found here:

Successfully tested with Sonoff ZB Bridge and Lidl Silvercrest ZigBee gateway.

So I guess if the problem can be solved in the OS and it works on openHABian, then the issue lies somewhere in the OS or RFC2217 implementation rather than the binding?

All we can say is that we cannot connect the OH ZigBee binding to a ‘serial’ port using RFC 2217 (as of now - tested on Windows 10 and openHABian). However, the Z-Wave binding seems to work in combination with RFC 2217 (Share Z-wave dongle over IP (USB over IP using ser2net / socat ) guide - #253 by mvbergen). In other words: there must be be a problem in the chain of the components. Unfortunately, we don’t know where - and IMHO it would be difficult to find the root cause. Besides the RFC 2217 problem the OH ZigBee binding seems to work flawlessly. The fact that the Z-Wave binding does support RFC 2217 and the OH ZigBee binding doesn’t could be an indication (not a proof though) that the problem lies within the OH ZigBee binding or is caused by an incompatibilty between the ZigBee binding and the OS/OH. Whether the OS/OH or the OH ZigBee binding is the culprit, I don’t know.

Unfortunately I don’t think there’s a lot I can do to debug this further as I don’t have any gateway device to support any testing.

If you are interested in debugging the ZigBee Binding/RFC 2217 problem any further, I could send you a hacked Silvercrest ZigBee gateway (EFR32MG1B-256k @ 6.7.8.0). As an alternative I could send a hacked Sonoff ZigBee gateway (EFR32MG21, IIRC), but the Sonoff device is a WiFi - ZigBee bridge which might make debugging a nightmare due to WiFi latency issues.

1 Like

Thanks @anon71759204 and sorry for the slow response - I’ve been ill for the past few days. Anyway, if you’re happy to part with one, and on the basis there’s no guarantees that this can be fixed as I suspect it could be difficult to trace, then I’m happy to take a look. I do find it strange that this occurs, and am concerned there might be some underlying issue that is (mostly) only revealed when run like this, so it would be good to take a look.

1 Like

As the problem with RFC 2217 seems to be independent of the [Ember] ZigBee gateway used, I’d appreciate if you could take a closer look into the problem. I’ll PM you for the shipping address.

2 Likes