I think this comment is still for the old version? I got confused when reading the code.
I think now with this script it should not try restart unless it is really necessary…I like that the logreader trigger is “connected”, not “disconnected”: The addon itself tries the reconnect with exponential backoff so this is now more of a fail-safe when we end up with “broken” connection
//Wait for 10 seconds, then post a command to reset myopenHAB_Connection through the cloud
//It shouldn't be necessary to wait longer than 10 seconds, since we're just sending a command
Cloud_Test_Timer = createTimer(now.plusSeconds(10),
Could you avoid the wait by having another rule, triggered by “myopenHAB_Connection” state update to “Testing”?
I tried to copy your integration of this automatically restored but I have one issue.
At the moment I click on “Create Thing” something happens at point 4 (1-3 are ok) in the background but I can not find the created Thing from the Exec-Binding in my “Things” list.
But the thing is definitively created which I can see in the Log.
My question is now, How can I create a switch which refers to the running channel? So I struggle between point 4.3 and 4.4
But my further approach would be to create a rule which turns on a with on my UI where I have restarted the cloud connector manually in the past at the moment the switch at point 4 in your description turns to ON.
The Thing was created but not under the name I expected.
All Exec-Things will be first named as “Base” in the Things-List. I named the “Exec-Thing” TestCloud and searched under T for the Thing.
So Problem is solved.
That’s for the 300-second waiting period before checking the status of myopenHAB_Connection. It’s just there to give time for multiple disconnect/reconnects. It’s not absolutely necessary, but I wanted to give some time for everything to settle before potentially restarting the cloud connector.
Actually, I think the 10-second wait period just isn’t necessary at all, since we’re now triggering the rule on “Connected to…” If it’s connected, the command can be sent immediately. I’ll remove that.
Thanks! I try to make my tutorials easy for beginners, and that mostly means taking time to explain why we’re doing something (without getting too technical). When someone is very familiar with a task, it becomes easy to skip steps that would not be obvious to others. In this case, I don’t spend much time on creating items since OH users should know how to do that. The challenging parts are really the unfamiliar exec and logreader things.
Yeah, exactly. It’s a simpler rule than the first version, but the curl command is daunting if you haven’t done it before. I relied on this post to figure it out.
That’s a good question. Since the general state of myopenHAB_Connection is Online, I believe that it will be persisted as such on a system shutdown/restart. I don’t think the rule would run when the system is shutting down, so it won’t be set to Testing. On a restart, I think the cloud connection will be reestablished before the rule runs.
I’m not using myopenHAB_Connection in a UI, so I haven’t tested this. Even if it is tested on restart, the rule will still run properly afterward.
Thanks for addressing what has been a mildly vexing issue and for providing a guide that a user who is not a developer can follow and understand what is going on. Often when someone provides help, it is in the form of “do xyz”. Then I spend an hour or two searching the docs and the forum to figure out how to do xyz.
Once I get a little runtime, I will report back on how this is working.
Yeah, and I do that too. When we’re explaining something in response to a question/problem, it’s very easy to gloss over steps that seem like common sense, but are really learned through repetition. "Writing a tutorial’ is a different mindset from “responding to a question”.
I’ve added a version of the rule that has a counter so that users can get a sense of how frequent the restarts are (as opposed to successful reconnections. I stopped short of actually calculating the success rate of reconnections.
Yes, we need to eventually solve this ongoing issue, which is fairly recent (relative to my four years using openHAB). Hence:
No, it doesn’t make sense to me for myopenHAB to have a recovery mechanism.
Individual OH servers go offline all of the time for various reasons (reboots, power outages, Internet outages, upgrades, etc.). We wouldn’t want myopenHAB to keep trying to reconnect to servers that are actually offline, and there’s no way for it to know if that’s the case. I’m actually not sure if that would even be possible (but I’m not a developer). I suspect that it’s not.
Actually, there may not be enough evidence to draw a conclusion either way. What we know is:
myopenHAB does not think the OH server is connected
the OH server thinks its connected to myopenHAB
the OH server can still send notifications through myopenHAB (only verified by some users)
If the server weren’t actually connected, that third point wouldn’t be possible. That’s why some of us think that the problem is with myopenHAB.
If it’s simple, then I’d encourage you to take a look at the code for the cloud connector and try adding it. I’m not a developer, so I don’t have have the ability to contribute on that end. I would if I could.
Short of that, the solution I’ve posted above is essentially a handshake. I just chose to test if a command can be received through the REST API instead.
I think we have pretty well established the fact that client is reconnecting the connection successfully. There is handshake in which client talks to server and server talsk to client. There is a regular check that communications work, with one party sending ping, and another responding pong. These basic healthcheck and handshake things actually all come from Socket.IO protocol, based on websocket technologies.
We have several reports that notifications go through, even while cloud shows that instance is offline. Actually, to my knowledge, there is zero reports that notifications would not work in this weird state.
The thing is that openhab cloud is tracking separately which clients (uuids) are online, which are offline. Whenever we get a new connection handshake, we update the status to online. Whenever there is disconnect (whether it is due to ping/pong failure or “clean” disconnect; does not matter), the status is updated offline.
This online/offline status tracking is implemented within the openhab project (openhab-cloud repo), backed by a database.
Unfortunately it remains a mystery why the online/offline tracking is not working as expected. There could be a bug within Socket.IO library on server/client (e.g. missing “connection”/“disconnect” event). Or perhaps there are some race conditions which leads to updating to wrong state on cloud side (**). The Socket.IO library versions are oldish, perhaps update would help? Update is not trivial, backwards compatibility with old clients needs to be considered, and there is no proper means to test this out safelt.
This is quite hard to solve since
this is a volunteer project with limited hours put into this, it is a “charity”, not a business. All the maintainers etc. have probably day jobs etc.
the debugging on cloud side is depending on those volunteer hours, limited opportunities to debug “end-to-end”, debugging e.g. one specific client and trying to see how it looks on cloud side
there’s performance topics to be considered, I have understood the myopenhab.org free service is actually quite well used
**) I spent some time staring at the cloud side code with help from digitaldan and we did find one race condition. This was fixed some time ago but clearly it was not the (main/common) root cause people are experiencing
It is already done in practice in the form of “pong” messages. This is part of socket.io protocol
On client side, one can build custom logic based on those messages. For example, oh logs nowadays the time between ping/pong. I presume same is possible with server side.
I guess the thing is that making a database call (that is where the online/offline status is stored) all the time might not…fly in practice. Comes back to the topic of limited pre-production
Then again, this gave me an optimization idea: could we check only new connections on cloud side, and verify the online status is correct?
Probably such a thing is good to spread out over time (30s after connect but at latest 2min after connect for example), so that things do not crash and burn when large swarms of client connect at once (eg when server restarted)
And of course what I described in the above is my hunch only…who knows if there is something else that is actually broken.
Hi all, i may have a solution to our “split brain” issue going on here. There are a number of changes in the works right now, but a big one will be a push i’m doing tomorrow that will hopefully ensure our cloud code only allows 1 connection from a authorized UUID to try and connect. The issue i think we have now is that the underlying socket.io retry logic tries a couple of times to reconnect in the background, while those happen serially on the client, due to load balancing, DB calls, redis calls, etc., the cloud service can actually be processing those in parallel which means a connection that the client gave up on, may finish after a good connection is made, which then overwrites the DB with the wrong server address and we no longer know how to route proxy connections to it.
We also have some changes to the cloud addon to connect more gracefully that we will try and get into the next 3.4.x release as well as 4.0