How to automatically restart cloud connector after an unexpected disconnection

Thanks for addressing what has been a mildly vexing issue and for providing a guide that a user who is not a developer can follow and understand what is going on. Often when someone provides help, it is in the form of “do xyz”. Then I spend an hour or two searching the docs and the forum to figure out how to do xyz.

Once I get a little runtime, I will report back on how this is working.

Yeah, and I do that too. When we’re explaining something in response to a question/problem, it’s very easy to gloss over steps that seem like common sense, but are really learned through repetition. "Writing a tutorial’ is a different mindset from “responding to a question”.

I’ve added a version of the rule that has a counter so that users can get a sense of how frequent the restarts are (as opposed to successful reconnections. I stopped short of actually calculating the success rate of reconnections. :wink:

I get it but comment still talks about openHAB_Cloud_Status and “toggled off by the REST command”

Instead, I think the REST commands set myopenHAB_Connection as “Online”?

Oh, I see. That wasn’t in the old version, but it was more relevant to a draft I didn’t publish. I had used a switch, but replaced it with a string. I’ll update that. :wink:

1 Like

I use this Item in my UI. Works fine for now and looks good :wink:

While I really appreciate you effort. Wouldn’t it make more sense if the cloud connector implements some sort of recovery mechanism by itself?

Yes and no.

Yes, we need to eventually solve this ongoing issue, which is fairly recent (relative to my four years using openHAB). Hence:

No, it doesn’t make sense to me for myopenHAB to have a recovery mechanism.

Individual OH servers go offline all of the time for various reasons (reboots, power outages, Internet outages, upgrades, etc.). We wouldn’t want myopenHAB to keep trying to reconnect to servers that are actually offline, and there’s no way for it to know if that’s the case. I’m actually not sure if that would even be possible (but I’m not a developer). I suspect that it’s not.

Hi, this makes no sense to me. Why can’t we have a recovery within the cloud connector binding?

I am having the same issue and implemented an item for calling the exec script to restart the cloud connected many months ago. Meaning the issue is not the server but the local cloud connector.

Everytime this happens, the cloud connector had a disconnect and states that it is reconnected while it is actually not.

A simple handshake between the local cloud connector binding and the server should be enough.

  1. OpenHAB reconnects: Hello myopenhab, I am still here
  2. No answer from myopenhab
  3. Restart openHAB cloud connector

Actually, there may not be enough evidence to draw a conclusion either way. What we know is:

  1. myopenHAB does not think the OH server is connected
  2. the OH server thinks its connected to myopenHAB
  3. the OH server can still send notifications through myopenHAB (only verified by some users)

If the server weren’t actually connected, that third point wouldn’t be possible. That’s why some of us think that the problem is with myopenHAB.

If it’s simple, then I’d encourage you to take a look at the code for the cloud connector and try adding it. I’m not a developer, so I don’t have have the ability to contribute on that end. I would if I could.

Short of that, the solution I’ve posted above is essentially a handshake. I just chose to test if a command can be received through the REST API instead.

Yep as @rpwong summarized well.

I think we have pretty well established the fact that client is reconnecting the connection successfully. There is handshake in which client talks to server and server talsk to client. There is a regular check that communications work, with one party sending ping, and another responding pong. These basic healthcheck and handshake things actually all come from Socket.IO protocol, based on websocket technologies.

We have several reports that notifications go through, even while cloud shows that instance is offline. Actually, to my knowledge, there is zero reports that notifications would not work in this weird state.

The thing is that openhab cloud is tracking separately which clients (uuids) are online, which are offline. Whenever we get a new connection handshake, we update the status to online. Whenever there is disconnect (whether it is due to ping/pong failure or “clean” disconnect; does not matter), the status is updated offline.

This online/offline status tracking is implemented within the openhab project (openhab-cloud repo), backed by a database.

Unfortunately it remains a mystery why the online/offline tracking is not working as expected. There could be a bug within Socket.IO library on server/client (e.g. missing “connection”/“disconnect” event). Or perhaps there are some race conditions which leads to updating to wrong state on cloud side (**). The Socket.IO library versions are oldish, perhaps update would help? Update is not trivial, backwards compatibility with old clients needs to be considered, and there is no proper means to test this out safelt.

This is quite hard to solve since

  • this is a volunteer project with limited hours put into this, it is a “charity”, not a business. All the maintainers etc. have probably day jobs etc.
  • the debugging on cloud side is depending on those volunteer hours, limited opportunities to debug “end-to-end”, debugging e.g. one specific client and trying to see how it looks on cloud side
  • there’s performance topics to be considered, I have understood the myopenhab.org free service is actually quite well used

**) I spent some time staring at the cloud side code with help from digitaldan and we did find one race condition. This was fixed some time ago but clearly it was not the (main/common) root cause people are experiencing

3 Likes

Thinking back to @SeeAge’s comment, would it be possible to make a version of the cloud connector that sends an online message to myopenHAB periodically (without having to disconnect and reconnect)?

It is already done in practice in the form of “pong” messages. This is part of socket.io protocol

On client side, one can build custom logic based on those messages. For example, oh logs nowadays the time between ping/pong. I presume same is possible with server side.

I guess the thing is that making a database call (that is where the online/offline status is stored) all the time might not
fly in practice. Comes back to the topic of limited pre-production
testing capabilities.


Then again, this gave me an optimization idea: could we check only new connections on cloud side, and verify the online status is correct?

Probably such a thing is good to spread out over time (30s after connect but at latest 2min after connect for example), so that things do not crash and burn when large swarms of client connect at once (eg when server restarted)


And of course what I described in the above is my hunch only
who knows if there is something else that is actually broken.

1 Like

Hi all, i may have a solution to our “split brain” issue going on here. There are a number of changes in the works right now, but a big one will be a push i’m doing tomorrow that will hopefully ensure our cloud code only allows 1 connection from a authorized UUID to try and connect. The issue i think we have now is that the underlying socket.io retry logic tries a couple of times to reconnect in the background, while those happen serially on the client, due to load balancing, DB calls, redis calls, etc., the cloud service can actually be processing those in parallel which means a connection that the client gave up on, may finish after a good connection is made, which then overwrites the DB with the wrong server address and we no longer know how to route proxy connections to it.

We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0

9 Likes

This explanation makes a ton of sense. Fingers crossed! :crossed_fingers:

Is this a fix on the cloud server or do I have to upgrade my openhhab server? My myopenhab connection is still breaking after some time.

Is this a fix on the cloud server or do I have to upgrade my openhhab server? My myopenhab connection is still breaking after some time.

:point_down:

We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0

The next OH 3.4 release (which may be this weekend or next) will have a reconnect logic fix in the binding. Any other fixes needed will be done on the cloud side.

Hey Russ,

thanks for your workaround, saved me a lot of nerves during continuous disconnections and missing notifications trough OH Cloud. And in the End pointed on the reason which causes the issue:

My Connection brakes down with the following log:
“Error connecting to the openHAB Cloud instance: already connected”
→ Checking time and Events in my Network I figured out that it always happened during the forced disconnection of my internet connection trough the provider. Anyone knows if there is a possibility to solve this without restarting the Cloud Service?

The Reason why I ask is, when the service is restartet, all rules are reloaded, so some runtime Data gets lost, or is there any possibility to restart the service without all rules reloaded?

My Setup:
Openhabian 2.5.12 on RPI 3B+

Thanks again for your nice Workaround
Greetings Andy

I don’t think that should happen when you just restart a binding.

Honestly, I can’t really help you with an OH2.5 system. That’s more than two years old, and there have been a lot of updates to
everything.

Hey Russ, thanks for your reply.
To be true
I am thinking of an upgrade to OH3.x since a long time
looks like this is the trigger to start with the work :slight_smile:
Greetings Andy

There are a lot of conversations about upgrading right now, as some users realize that they’re on 2.5 and see that OH4 is targeting a release later this year. If you have time, I’d suggest moving to 3.4.2, which is a significant leap forward and will prepare you for OH4. Otherwise, you’ll be dealing with breaking changes from two major releases at once.