RFXCOM binding stops for no clear reason

Right - so I’ve been battling with this, on and off, for months - but it’s got a lot more frequent recently. I did think it might be aligned to the echo binding memory leak, but since that’s been resolved - I know it’s not that.

So - the problem - every so often - the RFXCOM binding just stops receiving/transmitting. No errors in logs, it just stops. Restarting the binding doesn’t fix it, but a full restart of Openhab will. Openhab otherwise continues on quite happily.

Running this on a RPI 3B with a RFX433E (about 2 years old) running the latest Pro1 firmware, running raspbian with openhab installed on top. Been using the same setup for about the same amount of time, but have upgraded underlying OS and Openhab consistently. Running 2.5.5 with the amazon echo binding up to the fix provided in the memory leak/connection problem thread. The 433e is monitoring a weather station, an energy monitor and is controlling 2 Somfy blinds which are only activated at sunrise and sunset.

There’s no consistency in timing, events or otherwise. I’m finally posting this with some logs on trace level to see if that provides anything.

So - here’s the situation with this afternoon:
14:58 - last message recieved by the RFXCOM binding
transceiver checks continue and report the device online.
Around 15:55 I notice the binding has stopped
15:58 - I attempt a restart of the binding - this disposes of the various items, but disposing of the handler takes too long (apparently). After that - the tranceiver is unitialised.
Around 2 minutes later - I attempt a USB hot port swap for the transceiver to see if this resolves the issue, but the transceiver doesn’t come back online in OpenHab. (note - I’ve done hot swaps when it IS working with no issue previously)
So I then restart OpenHab (note - I didn’t restart the PI) and it comes back absolutely fine.

Now previously when this has occurred, doing things like a USB port swap can seem to resolve the problem for a month or two, but the issue has always returned. It’s possible I’ve had multiple issues over time of course and swapping port resolved problem A - but not problem B.

Following here: RFXCOM Binding stops working after some time - I also made sure the 433e was on the latest firmware (it wasn’t two weeks ago but it is now) - but it’s still suffering.

I don’t quite believe it’s related to the device because it doesn’t require a full reboot to restore the transceiver to working - it just needs openhab to be restarted.

What’s notable about where it stops is that, in the terms of the logging, it’s made to to RFXComBridgeHandler:237, but never makes it to RFXComHandler:146 - the normal path you see from other events and I just can’t see how that can occur, but then I’m really not deep into the flow of how this binding all hangs together.

Anyway - logs attached - I’ve trimmed from before the event - but included enough to show the same rain device has reported before successfully.

Any thoughts on next steps on how to diagnose would be very welcome.

I’m about to add more blinds to the system so I’m very keen to get this resolved. Having the kitchen blinds not open/close automatically isn’t so much of a pain if I haven’t spotted the hang, if it’s going to affect more rooms, that’s a problem.

events.log (251.4 KB) openhab.log (142.6 KB)

Grrr - and again - not 1 hour after restart:

2020-06-17 16:55:17.409 [TRACE] [nternal.connector.RFXComStreamReader] - Message length is 16 bytes
2020-06-17 16:55:17.412 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 16 bytes from the message
2020-06-17 16:55:17.414 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.417 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 15 bytes from the message
2020-06-17 16:55:17.419 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.421 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 14 bytes from the message
2020-06-17 16:55:17.424 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.426 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 13 bytes from the message
2020-06-17 16:55:17.429 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.431 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 12 bytes from the message
2020-06-17 16:55:17.433 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.436 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 11 bytes from the message
2020-06-17 16:55:17.438 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.440 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 10 bytes from the message
2020-06-17 16:55:17.443 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.445 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 9 bytes from the message
2020-06-17 16:55:17.448 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.450 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 8 bytes from the message
2020-06-17 16:55:17.453 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.455 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 7 bytes from the message
2020-06-17 16:55:17.458 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.461 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 6 bytes from the message
2020-06-17 16:55:17.464 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.471 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 5 bytes from the message
2020-06-17 16:55:17.474 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.476 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 4 bytes from the message
2020-06-17 16:55:17.478 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.480 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 3 bytes from the message
2020-06-17 16:55:17.483 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.485 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 2 bytes from the message
2020-06-17 16:55:17.488 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.490 [TRACE] [nternal.connector.RFXComStreamReader] - Waiting remaining 1 bytes from the message
2020-06-17 16:55:17.492 [TRACE] [nternal.connector.RFXComStreamReader] - Received 1 bytes from the message
2020-06-17 16:55:17.494 [DEBUG] [internal.handler.RFXComBridgeHandler] - Message received: Raw data = 1056020DE80000CA000000000089CD0559, Packet type = WIND, Seq number = 13, Sub type = WIND2, Device Id = 59392, Wind direction = 202.0, Wind gust = 0.0, Average wind speed = 0.0, Temperature = 0.0, Chill temperature = 0.0, Signal level = 5, Battery level = 9
2020-06-17 16:55:17.496 [TRACE] [scovery.RFXComDeviceDiscoveryService] - Received: bridge: rfxcom:bridge:9e599abb message: Raw data = 1056020DE80000CA000000000089CD0559, Packet type = WIND, Seq number = 13, Sub type = WIND2, Device Id = 59392, Wind direction = 202.0, Wind gust = 0.0, Average wind speed = 0.0, Temperature = 0.0, Chill temperature = 0.0, Signal level = 5, Battery level = 9
2020-06-17 16:55:17.498 [TRACE] [scovery.RFXComDeviceDiscoveryService] - Adding new RFXCOM rfxcom:wind:9e599abb:59392 with id ‘59392’ to smarthome inbox
2020-06-17 16:55:21.236 [TRACE] [internal.handler.RFXComBridgeHandler] - Checking RFXCOM transceiver connection, thing status = ONLINE
2020-06-17 16:56:21.239 [TRACE] [internal.handler.RFXComBridgeHandler] - Checking RFXCOM transceiver connection, thing status = ONLINE
2020-06-17 16:57:21.242 [TRACE] [internal.handler.RFXComBridgeHandler] - Checking RFXCOM transceiver connection, thing status = ONLINE
2020-06-17 16:58:21.245 [TRACE] [internal.handler.RFXComBridgeHandler] - Checking RFXCOM transceiver connection, thing status = ONLINE
2020-06-17 16:59:21.248 [TRACE] [internal.handler.RFXComBridgeHandler] - Checking RFXCOM transceiver connection, thing status = ONLINE

So - despite nothing being in syslog to indicate a USB problem - I’ve followed the advice here: RFXCom stops sending or receiving any message with error in syslog (which didn’t show up in my googling of threads!)

–edit-- - I did have a hub I knew worked, but it is USB3 which has problems on the RPI - so I’m acquiring a USB2 powered hub to try and see if this resolves the problem.

So we’ll see.

1 Like

So the hub hasn’t helped (and it is separately powered). BUT it did draw attention to some timeouts which are also occurring. Swapped the hub out around 1pm today, last timeout was at 14:32 and then it locked up again at 19:03. So the timeouts and the lock ups aren’t coinciding - doesn’t mean they aren’t symptoms of the same problem.

Nothing in dmsg, nothing in syslog.

So now I really am stuck.

Options from here unless someone has a better idea: rebuild the PI ground up. I’m loathed to do this - it has a variety of little services running on it - including DNSMasq etc and while I have backups of all the config, remembering the entire setup will be a pain.

Get a new PI 4b and work on that one. I’m actually considering that anyway as I’m concerned about having a few services sitting on one device (especially one running on an SD card).

Get a new RFXCom device. It’s possible it’s the device failing - it is 2 years old.

The cheapest option is the rebuild - but it’s also the one which will take the longest.

I’m inclined to buy the new PI and see as I don’t particularly want to buy a new RFX if I can avoid it - I can always use a new PI, but have no use for 2 RFXComs.

Suggestions? Help? Please?

So - to keep progress and interest.

I was about to jump on and buy a new PI to start a rebuild and I thought “you know - a thread just silently ending looks like a race condition to me”. (I used to be a java dev many years ago!)

Once I found out how to find the active threads in the console - I was surprised to find 82 blocked threads. 62 of them from the HarmonyHubDiscoveryServer. Additionally - I had a suspicious blocked thread from the SamsungTV binding.

They were pretty much all blocked in the thingDiscovered Method of org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl. All waiting on the lock on cachedresults. Several others were in a similar state on thingRemoved - again - almost all waiting on the cachedresults lock.

And indeed - for the RFXCom binding - there is, indeed, a thread blocked on thingDiscovered.

The suspicious (to me) one is actually blocked in thingRemoved - but much further down the chain - trying to close a websocket in the SamungTV RemoveControllerWebSocket.java file. I’m not entirely sure - but it’s leading me to think there’s a problem there.

SO - I’ve now removed the samsungtv binding and the harmonyhub binding - we’ll see if that fixes the RFXCom lockups in any way.

I’m betting it’s the samsungtv binding at this point. It’s the only one that wasn’t just sitting waiting on the cahcedresults lock. Why it was causing the lock to be held - I’m not clear though.

Right - so - a week on, I upgraded to 2.5.6 - left that for a couple of days, and then reinstalled the HarmonyHub binding.

So far so stable. To this point I’ve now raised a bug on github pointing the finger at the SamsungTV binding. I’m still not clear why the disposal of the websocketclient in the binding should get hung up - but it does for some reason.