Openhab cloud connection stops working, does not try to reconnect

seanch · April 12, 2022, 5:42pm

Today I “sort of” reproduced it, but not really. (Tuesday, April 12, 2022 at ~14:05 UTC) I had been away from the house for 5 days, and the first thing I tried after getting back was a Google Home voice command. After a long pause, I got the error message “…openhab not available…”. I don’t remember the exact phrase, but I think it was the one I usually hear when I have this problem. I waited 30 seconds and tried again but it was the same result.

Then I logged in with SSH to check the openhab.log file, but there was no errors in the ping/pong messages. It was still going, and nothing seemed out of the ordinary.

Then I logged on to myopenhab.org, to see if the status was offline or online, but the status was online, so that was ok as well.

Then I tried a Google Home voice command yet again, and that time it worked. I tried another 5 voice commands and all were successful.

So I don’t really know what caused those two first errors today, but somehow it fixed itself without having to go into the OpenHAB UI and clicking ‘Save’ (without any changes) in the cloud connector settings like I usually have to do.

I don’t even know if this is related, but I thought I should mention it here anyway, just to get all the data points in one place.

I’ll keep testing.

seanch · May 4, 2022, 7:57pm

Finally today I was able to reproduce it just like it has happened several times before (this time with full logging enabled and the experimental debug version of the cloud connector).

Google Home voice commands stopped working. I retried about 10 times and all failed.
I checked openhab.log and ping/pong messages were totally fine. I check back several hours, and everything seemed normal.
I logged in to myopenhab.org and the status there was offline. (“Your openHAB is not online. Please check if your installation is running or recheck the openHAB settings in your account.”)
I tried some more voice commands, and tried reloading myopenhab (ctrl-f5), but nothing helped.
I rechecked the log and the only messages there were still ping/pong messages, and they were all normal.
Finally I logged into the local openhab web console and clicked “Save” in the Openhab Cloud settings page. That fixed everything, like it usually does.

So what I have learned today is that definitely ping/pong messages can be fine, but still myopenhab thinks the openhab instance is offline.

The only thing that was able to fix it was clicking Save in the settings page as before.

ssalonen · May 5, 2022, 6:51pm

This makes me think if the issue could be server side… Cc @digitaldan to review above comment

rpwong · May 6, 2022, 12:17am

Interestingly, I’m now seeing the same thing. My server disconnected/reconnected three times over a ten-minute period earlier today, but myopenhab.org showed me as offline. This is the first time for me (as far as I can remember).

As usual, saving the openHAB Cloud config to restart the cloud connector got me back online immediately.

Mark_VG · May 6, 2022, 7:21am

Just realized my cloud connection had also gone down sometime a few hours ago and not reconnected.
Did the save and now back connected and running.

ssalonen · May 6, 2022, 4:39pm

Did anyone else have debug logging enabled?

rpwong · May 6, 2022, 5:20pm

Not me, since I haven’t been having any issues until yesterday (and not again since restarting the connector). If it happens again, I’ll install the troubleshooting binding.

DarkoG · May 6, 2022, 9:49pm

It happened to me also, I have trace logging enabled. I also see the OH crashed completely because of memory leak I guess.

@ssalonen What exactly are you looking for in the logs?

2022-05-06 21:01:05.430 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = xxx-xc1-xxx-xxx-xxx, base URL = http://localhost:18080)
2022-05-06 21:01:12.417 [INFO ] [del.script.irrigation:Main_algorithm] -  not started. checkUserStartTime[true] itmIrrigationAuto.stat[ON] isWatering[false] WaterLeakDetectedRainSensor.state[OFF]
2022-05-06 21:01:13.676 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connecting (attempt 1)
2022-05-06 21:01:13.679 [TRACE] [io.openhabcloud.internal.CloudClient] - Manager.EVENT_TRANSPORT
2022-05-06 21:01:13.686 [TRACE] [io.openhabcloud.internal.CloudClient] - Transport.EVENT_REQUEST_HEADERS
2022-05-06 21:01:19.529 [WARN ] [okhttp3.OkHttpClient                ] - A connection to https://myopenhab.org/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
2022-05-06 21:01:19.530 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connected successfully (attempt 1)
2022-05-06 21:01:19.531 [TRACE] [io.openhabcloud.internal.CloudClient] - Manager.EVENT_TRANSPORT
2022-05-06 21:01:19.533 [TRACE] [io.openhabcloud.internal.CloudClient] - Transport.EVENT_REQUEST_HEADERS
2022-05-06 21:01:23.694 [TRACE] [io.openhabcloud.internal.CloudClient] - Transport.EVENT_REQUEST_HEADERS
2022-05-06 21:01:26.301 [INFO ] [nternal.handler.SpotifyBridgeHandler] - Spotify returned an error during polling: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2022-05-06 21:01:29.050 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO connected
2022-05-06 21:01:29.052 [INFO ] [io.openhabcloud.internal.CloudClient] - Connected to the openHAB Cloud service (UUID = xxx-xxx-xxx-xxx-xxx, base URL = http://localhost:18080)
2022-05-06 21:01:29.054 [TRACE] [io.openhabcloud.internal.CloudClient] - Transport.EVENT_REQUEST_HEADERS
2022-05-06 21:01:43.184 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.io.monitor.internal.EventLogger@129fa48' takes more than 5000ms.
2022-05-06 21:01:51.184 [WARN ] [p.internal.http.HttpResponseListener] - Requesting 'http://192.168.4.221/real_time_data.xml' (method='GET', content='null') failed: Total timeout 3000 ms elapsed
2022-05-06 21:01:58.266 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.io.monitor.internal.EventLogger@129fa48' takes more than 5000ms.
2022-05-06 21:01:53.970 [ERROR] [io.openhabcloud.internal.CloudClient] - Error during communication: {}
io.socket.engineio.client.EngineIOException: xhr poll error
	at io.socket.engineio.client.Transport.onError(Transport.java:63) [bundleFile:?]
	at io.socket.engineio.client.transports.PollingXHR.access$100(PollingXHR.java:26) [bundleFile:?]
	at io.socket.engineio.client.transports.PollingXHR$6$1.run(PollingXHR.java:140) [bundleFile:?]
	at io.socket.thread.EventThread$2.run(EventThread.java:80) [bundleFile:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:?]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:115) ~[?:?]
	at java.net.SocketInputStream.read(SocketInputStream.java:168) ~[?:?]
	at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
	at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1449) ~[?:?]
	at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1060) ~[?:?]
	at okio.Okio$2.read(Okio.java:139) ~[?:?]
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[?:?]
	at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) ~[?:?]
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) ~[?:?]
	at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) ~[?:?]
	at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) ~[?:?]
	at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[?:?]
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[?:?]
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[?:?]
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[?:?]
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[?:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[?:?]
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) ~[?:?]
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) ~[?:?]
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) ~[?:?]
	... 3 more
2022-05-06 21:02:08.139 [TRACE] [io.openhabcloud.internal.CloudClient] - Transport.EVENT_REQUEST_HEADERS
2022-05-06 21:02:08.141 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO disconnected: transport error

.....

2022-05-06 21:52:24.408 [ERROR] [org.apache.felix.fileinstall        ] - In main loop, we have serious trouble
java.lang.OutOfMemoryError: Java heap space
.....

2022-05-06 23:07:53.034 [WARN ] [e.jetty.util.thread.QueuedThreadPool] - 
java.lang.OutOfMemoryError: Java heap space

seanch · May 6, 2022, 10:18pm

When I reproduced this issue on May 5, nothing other than ping/pong messages were present in the logs. I didn’t have any crashes, out-of-memory conditions, disconnects or anything weird at all that day, except for the Google Home commands stop working and myopenhab.org claiming my openhab was offline. I had full trace logging enabled and was using the troubleshooting binding and openhab version 3.3.0~S2784-1 on Ubuntu.

...
...
... hours and hours of ping/pong messages and nothing else since my last Google Home voice command, which was successful...
...
...
2022-05-05 11:42:08.721 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:42:08.771 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 50 ms
2022-05-05 11:42:33.773 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:42:33.823 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 49 ms
2022-05-05 11:42:58.825 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:42:58.875 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 50 ms
2022-05-05 11:43:23.878 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:43:23.927 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 50 ms
2022-05-05 11:43:48.929 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:43:48.979 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 50 ms
2022-05-05 11:44:13.981 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2022-05-05 11:44:14.030 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 49 ms

ssalonen · May 7, 2022, 4:11am

I am not sure what I am looking in the logs exactly but trying to find a common story.

It seems that we have now two confirmed cases with ping pong active but myopenhab.org saying that instance is online.

This new piece of information! To me it is saying that the client and server are connected and we can conclude that from (socket.io) client point of view everything would seem to be fine. Ping/pong watchdog is running still, and not stuck, for example.

Another piece of information is that both of you seemed to have issue at the same time, pointing to possible server side root cause.

We can also say that this is fairly uncommon scenario, I personally haven’t had any issues for months. At some point it was certainly more frequent.

Hmm. To understand the root cause, I am not sure we can do much without myopenhab.org maintainer in the loop, looking how things are connected in server side.

One more thing: when things are running normally, is there anything other than ping/pong in the logs?

EDIT: Turns out similar findings have been made already 2y ago by @wborn . Connection error recovery failed · Issue #134 · openhab/openhab-cloud · GitHub

rpwong · May 7, 2022, 4:53am

Interesting. I’m curious about this:

The openHAB Cloud Notification Actions also work while the Cloud thinks the instance is offline. This is another indicator that the connection is working properly (see community topic).

If anyone sees their server go offline again, try triggering a notification from openHAB to see if it reaches your phone.

seanch · May 7, 2022, 9:36am

Normal operation (for me at least) has just ping/pong messages in openhab.log, so nothing was out of the ordinary with the logs at the time of the recent Google Home issues.

I agree that whenever the problem occurs, a few people report here (and in other threads) having the issue at the same time.

It was more frequent in Jan/Feb 2022 that is correct. My first case of this I saw in Dec 2021. Even if it was more frequent before, it is happening way too often. For me personally an acceptable failure rate of home automation is perhaps once per year, and I’m now at 10-20 per year with this Google Home ↔ Openhab connection.

This type of issue might be very hard to diagnose, so one suggestion could be to add a secondary ping/pong mechanism on a higher network layer that run quite infrequently, perhaps once every 4 hours. This could help self-heal the broken connections when they occur and also gather diagnostics data on the server on how often and in which circumstances this type of issue occurs.

seanch · May 7, 2022, 10:03am

I have set this up now on one of my buttons in the home so I will try it the next time this problem happens.

I find the notifications menus in myopenhab quite weird. When I went there today all the notifications were from 4 years ago (mostly “openHAB is online”). The pager mechanism showed me there were 6 pages of notifications to list, but pressing pages 4 to 6 just showed empty lists.

Pressing the “Today”-button didn’t do anything, even though I had a single test notification I published earlier today. Nothing happened when I pressed the “Last week” button either.

When I pressed the “All” button, the page changed from “Notification” to “Events” (7 events were displayed: online/offline events from 2 days ago to 1 month ago).

I don’t know if that behavior is correct.

ssalonen · May 7, 2022, 12:43pm

Yep, not sure if this has been investigated here in the community in detail. One idea that has been raised is to have HTTP GET to home.openhab.org and see if it times out. I am not sure if this is trivial to implement as you cannot (to my knowledge, at least) use uuid+secret to get through the HTTP Basic authentication.

Pedro_Liberal · May 7, 2022, 10:16pm

I don’t have new information to share but can confirm that I also am affected by this. Wife acceptance factor is starting to be affected by this.

Same behavior as described:
Google home stops being able to reach openHAB.
Check my openHAB.org and my instance appears Offline.
Check my local instance and immediately after I press the “save” button in the openHAB cloud integration options, my instance switches to online in the cloud openHAB.
Google home is once again able to reach out to openHAB.
I don’t have exact dates or hours but I’ll be keeping an eye on this thread now.

seanch · May 8, 2022, 12:46am

I don’t know the code, but I was thinking more of having an application level ping/pong message exchange using the same socket.io connection. My understanding was that the existing ping/pong watchdog mechanism was at a lower level of the socket.io library.

I was thinking of a ping/pong exchange that actually required some verifiable data to be returned by the myopenhab.org server, and if that data is not retrieved correctly, then the openhab connector assumes the connection is dropped and reconnects. This mechanism being more cpu/net expensive and possibly requiring more server resources could be run at a (much) lower frequency than the existing socket.io ping/pong.

This is just my thinking without knowing any of the code of either server/connector, so I will apologize in advance if what I just wrote is pure gibberish.

Pedro_Liberal · May 8, 2022, 5:45pm

I just noticed that I’m not able to reach openhab.org, and created this post:
Openhab.org not reachable?

Do you guys reckon this might be related??
If I use a vpn in germany I can reach to openhab.org. This is weird…

hmerk · May 8, 2022, 5:58pm

No, should not be related as the two are pointing to different servers at different hosters.

ssalonen · May 15, 2022, 9:40am

The troubleshooting versions (for openHAB 3.2.0 and for openHAB 3.3.0-SNAPSHOT) have been updated.

If this works for you without issues I am proceeding to merge it to main codebase. Good improvements for everyone.

Appreciate all the testing for this, please report if it runs without issues.

You can upgrade from previous troubleshooting version as follows:

Remove the addon using MainUI
Reinstall the addon

seanch · May 18, 2022, 1:00pm

Thanks for the update. I have uninstalled the old addon and reinstalled the new one today.

For some reason I was unable to uninstall the old troubleshooting connector. Every time I clicked uninstall on either the troubleshooting connector or the regular openhab cloud connector entry I got an error message in the UI, but nothing in the logs. Restarting openhab didn’t help, so I ended up doing the following:

Cleared the cache and reset permissions. After that nothing worked, UI was just all black every time I tried to restart openhab.
Downgraded to 3.2.0 (stable).
Upgraded to 3.3.0~S2902 (I was using 3.3.0~S2784 before)
Uninstalled the connector
Restarted openhab and verified no connector was installed
Installed the new 3.3.0 snapshot troubleshooting cloud connector (bundle org.openhab.io.openhabcloud:3.3.0.202205150929)

Everything seems to be working ok now, and I see the ping/pong messages and other trace logs from CloudClient and OkHttpClient.

Note that I am not present at this house very much in the summer months. I will check the install remotely periodically and every time I am at that location.