Thanks for addressing what has been a mildly vexing issue and for providing a guide that a user who is not a developer can follow and understand what is going on. Often when someone provides help, it is in the form of “do xyz”. Then I spend an hour or two searching the docs and the forum to figure out how to do xyz.
Once I get a little runtime, I will report back on how this is working.
Yeah, and I do that too. When we’re explaining something in response to a question/problem, it’s very easy to gloss over steps that seem like common sense, but are really learned through repetition. "Writing a tutorial’ is a different mindset from “responding to a question”.
I’ve added a version of the rule that has a counter so that users can get a sense of how frequent the restarts are (as opposed to successful reconnections. I stopped short of actually calculating the success rate of reconnections.
Yes, we need to eventually solve this ongoing issue, which is fairly recent (relative to my four years using openHAB). Hence:
No, it doesn’t make sense to me for myopenHAB to have a recovery mechanism.
Individual OH servers go offline all of the time for various reasons (reboots, power outages, Internet outages, upgrades, etc.). We wouldn’t want myopenHAB to keep trying to reconnect to servers that are actually offline, and there’s no way for it to know if that’s the case. I’m actually not sure if that would even be possible (but I’m not a developer). I suspect that it’s not.
Actually, there may not be enough evidence to draw a conclusion either way. What we know is:
myopenHAB does not think the OH server is connected
the OH server thinks its connected to myopenHAB
the OH server can still send notifications through myopenHAB (only verified by some users)
If the server weren’t actually connected, that third point wouldn’t be possible. That’s why some of us think that the problem is with myopenHAB.
If it’s simple, then I’d encourage you to take a look at the code for the cloud connector and try adding it. I’m not a developer, so I don’t have have the ability to contribute on that end. I would if I could.
Short of that, the solution I’ve posted above is essentially a handshake. I just chose to test if a command can be received through the REST API instead.
I think we have pretty well established the fact that client is reconnecting the connection successfully. There is handshake in which client talks to server and server talsk to client. There is a regular check that communications work, with one party sending ping, and another responding pong. These basic healthcheck and handshake things actually all come from Socket.IO protocol, based on websocket technologies.
We have several reports that notifications go through, even while cloud shows that instance is offline. Actually, to my knowledge, there is zero reports that notifications would not work in this weird state.
The thing is that openhab cloud is tracking separately which clients (uuids) are online, which are offline. Whenever we get a new connection handshake, we update the status to online. Whenever there is disconnect (whether it is due to ping/pong failure or “clean” disconnect; does not matter), the status is updated offline.
This online/offline status tracking is implemented within the openhab project (openhab-cloud repo), backed by a database.
Unfortunately it remains a mystery why the online/offline tracking is not working as expected. There could be a bug within Socket.IO library on server/client (e.g. missing “connection”/“disconnect” event). Or perhaps there are some race conditions which leads to updating to wrong state on cloud side (**). The Socket.IO library versions are oldish, perhaps update would help? Update is not trivial, backwards compatibility with old clients needs to be considered, and there is no proper means to test this out safelt.
This is quite hard to solve since
this is a volunteer project with limited hours put into this, it is a “charity”, not a business. All the maintainers etc. have probably day jobs etc.
the debugging on cloud side is depending on those volunteer hours, limited opportunities to debug “end-to-end”, debugging e.g. one specific client and trying to see how it looks on cloud side
there’s performance topics to be considered, I have understood the myopenhab.org free service is actually quite well used
**) I spent some time staring at the cloud side code with help from digitaldan and we did find one race condition. This was fixed some time ago but clearly it was not the (main/common) root cause people are experiencing
It is already done in practice in the form of “pong” messages. This is part of socket.io protocol
On client side, one can build custom logic based on those messages. For example, oh logs nowadays the time between ping/pong. I presume same is possible with server side.
I guess the thing is that making a database call (that is where the online/offline status is stored) all the time might not…fly in practice. Comes back to the topic of limited pre-production
Then again, this gave me an optimization idea: could we check only new connections on cloud side, and verify the online status is correct?
Probably such a thing is good to spread out over time (30s after connect but at latest 2min after connect for example), so that things do not crash and burn when large swarms of client connect at once (eg when server restarted)
And of course what I described in the above is my hunch only…who knows if there is something else that is actually broken.
Hi all, i may have a solution to our “split brain” issue going on here. There are a number of changes in the works right now, but a big one will be a push i’m doing tomorrow that will hopefully ensure our cloud code only allows 1 connection from a authorized UUID to try and connect. The issue i think we have now is that the underlying socket.io retry logic tries a couple of times to reconnect in the background, while those happen serially on the client, due to load balancing, DB calls, redis calls, etc., the cloud service can actually be processing those in parallel which means a connection that the client gave up on, may finish after a good connection is made, which then overwrites the DB with the wrong server address and we no longer know how to route proxy connections to it.
We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0
thanks for your workaround, saved me a lot of nerves during continuous disconnections and missing notifications trough OH Cloud. And in the End pointed on the reason which causes the issue:
My Connection brakes down with the following log:
“Error connecting to the openHAB Cloud instance: already connected”
→ Checking time and Events in my Network I figured out that it always happened during the forced disconnection of my internet connection trough the provider. Anyone knows if there is a possibility to solve this without restarting the Cloud Service?
The Reason why I ask is, when the service is restartet, all rules are reloaded, so some runtime Data gets lost, or is there any possibility to restart the service without all rules reloaded?
Openhabian 2.5.12 on RPI 3B+
Thanks again for your nice Workaround
There are a lot of conversations about upgrading right now, as some users realize that they’re on 2.5 and see that OH4 is targeting a release later this year. If you have time, I’d suggest moving to 3.4.2, which is a significant leap forward and will prepare you for OH4. Otherwise, you’ll be dealing with breaking changes from two major releases at once.