How to automatically restart cloud connector after an unexpected disconnection

SeeAge · January 17, 2023, 9:10pm

Hi, this makes no sense to me. Why can’t we have a recovery within the cloud connector binding?

I am having the same issue and implemented an item for calling the exec script to restart the cloud connected many months ago. Meaning the issue is not the server but the local cloud connector.

Everytime this happens, the cloud connector had a disconnect and states that it is reconnected while it is actually not.

A simple handshake between the local cloud connector binding and the server should be enough.

OpenHAB reconnects: Hello myopenhab, I am still here
No answer from myopenhab
Restart openHAB cloud connector

rpwong · January 17, 2023, 9:28pm

Actually, there may not be enough evidence to draw a conclusion either way. What we know is:

myopenHAB does not think the OH server is connected
the OH server thinks its connected to myopenHAB
the OH server can still send notifications through myopenHAB (only verified by some users)

If the server weren’t actually connected, that third point wouldn’t be possible. That’s why some of us think that the problem is with myopenHAB.

If it’s simple, then I’d encourage you to take a look at the code for the cloud connector and try adding it. I’m not a developer, so I don’t have have the ability to contribute on that end. I would if I could.

Short of that, the solution I’ve posted above is essentially a handshake. I just chose to test if a command can be received through the REST API instead.

ssalonen · January 18, 2023, 5:58pm

Yep as @rpwong summarized well.

I think we have pretty well established the fact that client is reconnecting the connection successfully. There is handshake in which client talks to server and server talsk to client. There is a regular check that communications work, with one party sending ping, and another responding pong. These basic healthcheck and handshake things actually all come from Socket.IO protocol, based on websocket technologies.

We have several reports that notifications go through, even while cloud shows that instance is offline. Actually, to my knowledge, there is zero reports that notifications would not work in this weird state.

The thing is that openhab cloud is tracking separately which clients (uuids) are online, which are offline. Whenever we get a new connection handshake, we update the status to online. Whenever there is disconnect (whether it is due to ping/pong failure or “clean” disconnect; does not matter), the status is updated offline.

This online/offline status tracking is implemented within the openhab project (openhab-cloud repo), backed by a database.

Unfortunately it remains a mystery why the online/offline tracking is not working as expected. There could be a bug within Socket.IO library on server/client (e.g. missing “connection”/“disconnect” event). Or perhaps there are some race conditions which leads to updating to wrong state on cloud side (**). The Socket.IO library versions are oldish, perhaps update would help? Update is not trivial, backwards compatibility with old clients needs to be considered, and there is no proper means to test this out safelt.

This is quite hard to solve since

this is a volunteer project with limited hours put into this, it is a “charity”, not a business. All the maintainers etc. have probably day jobs etc.
the debugging on cloud side is depending on those volunteer hours, limited opportunities to debug “end-to-end”, debugging e.g. one specific client and trying to see how it looks on cloud side
there’s performance topics to be considered, I have understood the myopenhab.org free service is actually quite well used

**) I spent some time staring at the cloud side code with help from digitaldan and we did find one race condition. This was fixed some time ago but clearly it was not the (main/common) root cause people are experiencing

rpwong · January 18, 2023, 8:57pm

Thinking back to @SeeAge’s comment, would it be possible to make a version of the cloud connector that sends an online message to myopenHAB periodically (without having to disconnect and reconnect)?

ssalonen · January 19, 2023, 4:51am

It is already done in practice in the form of “pong” messages. This is part of socket.io protocol

On client side, one can build custom logic based on those messages. For example, oh logs nowadays the time between ping/pong. I presume same is possible with server side.

I guess the thing is that making a database call (that is where the online/offline status is stored) all the time might not…fly in practice. Comes back to the topic of limited pre-production
testing capabilities.

Then again, this gave me an optimization idea: could we check only new connections on cloud side, and verify the online status is correct?

Probably such a thing is good to spread out over time (30s after connect but at latest 2min after connect for example), so that things do not crash and burn when large swarms of client connect at once (eg when server restarted)

And of course what I described in the above is my hunch only…who knows if there is something else that is actually broken.

digitaldan · January 29, 2023, 12:22am

Hi all, i may have a solution to our “split brain” issue going on here. There are a number of changes in the works right now, but a big one will be a push i’m doing tomorrow that will hopefully ensure our cloud code only allows 1 connection from a authorized UUID to try and connect. The issue i think we have now is that the underlying socket.io retry logic tries a couple of times to reconnect in the background, while those happen serially on the client, due to load balancing, DB calls, redis calls, etc., the cloud service can actually be processing those in parallel which means a connection that the client gave up on, may finish after a good connection is made, which then overwrites the DB with the wrong server address and we no longer know how to route proxy connections to it.

We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0

rpwong · January 29, 2023, 2:19am

This explanation makes a ton of sense. Fingers crossed!

Alexaas · February 11, 2023, 7:08pm

Is this a fix on the cloud server or do I have to upgrade my openhhab server? My myopenhab connection is still breaking after some time.

digitaldan · February 11, 2023, 7:20pm

Is this a fix on the cloud server or do I have to upgrade my openhhab server? My myopenhab connection is still breaking after some time.

We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0

The next OH 3.4 release (which may be this weekend or next) will have a reconnect logic fix in the binding. Any other fixes needed will be done on the cloud side.

Andreas_Beistle · March 5, 2023, 8:36am

Hey Russ,

thanks for your workaround, saved me a lot of nerves during continuous disconnections and missing notifications trough OH Cloud. And in the End pointed on the reason which causes the issue:

My Connection brakes down with the following log:
“Error connecting to the openHAB Cloud instance: already connected”
→ Checking time and Events in my Network I figured out that it always happened during the forced disconnection of my internet connection trough the provider. Anyone knows if there is a possibility to solve this without restarting the Cloud Service?

The Reason why I ask is, when the service is restartet, all rules are reloaded, so some runtime Data gets lost, or is there any possibility to restart the service without all rules reloaded?

My Setup:
Openhabian 2.5.12 on RPI 3B+

Thanks again for your nice Workaround
Greetings Andy

rpwong · March 5, 2023, 7:13pm

I don’t think that should happen when you just restart a binding.

Honestly, I can’t really help you with an OH2.5 system. That’s more than two years old, and there have been a lot of updates to…everything.

Andreas_Beistle · March 15, 2023, 6:43pm

Hey Russ, thanks for your reply.
To be true…I am thinking of an upgrade to OH3.x since a long time…looks like this is the trigger to start with the work
Greetings Andy

rpwong · March 15, 2023, 7:33pm

There are a lot of conversations about upgrading right now, as some users realize that they’re on 2.5 and see that OH4 is targeting a release later this year. If you have time, I’d suggest moving to 3.4.2, which is a significant leap forward and will prepare you for OH4. Otherwise, you’ll be dealing with breaking changes from two major releases at once.