Openhab takes 2mins to stop

mtrax · September 6, 2023, 7:25am

My openhab service is taking a long time to stop.
I have OH v4.0.2 running on Ubuntu LTS however I’ve had this issue on/off for a while.
frequently Openhab takes a up to 2 mins I initially trhought it was mysql but I’ve updated my service to make this a prereq service. I normally see this issue when I’m rebooting

How do I figure how why its taking so long to stop?

rlkoshak · September 6, 2023, 4:36pm

Set logging to trace and watch openhab.log to see what it seems to be doing during shutdown. You should be able to see it hanging somewhere, often waiting for something to flush.

AndrewFG · September 6, 2023, 5:09pm

Try disabling bindings (or their respective Bridge things) one by one (and wait till that thing would have shutdown) before doing the OH shutdown.

Lolodomo · September 6, 2023, 5:20pm

I remember a discussion in Github for OH3. I explained that every binding using a serial connection needs few seconds, around 5, to stop.
At this time, my OH setup required 34 seconds to stop if I correctly remember.
I will check again with 4.0.2 during this evening.

Lolodomo · September 6, 2023, 6:00pm

I believe @morph166955 was at the origin of the discussion in github I would like to find. Maybe he can help to find it.

morph166955 · September 6, 2023, 7:12pm

The problem in that case was a thread issue with jupnp. During shutdown a race condition happened and we ran out of threads and lost the unsubscribe replies. That was corrected in jupnp 2.6.0 I believe by adding an additional thread pool. If you want to determine if this is still an issue, disable the pooling by setting all three pools to -1 in the config.

Lolodomo · September 7, 2023, 6:45am

My point was not to point an outdated / resolved issue but I know there were some interesting comments in that discussion, especially on serial connections.

I just checked:

$ date ; sudo systemctl stop openhab.service ; date
jeudi 7 septembre 2023, 08:38:38 (UTC+0200)
jeudi 7 septembre 2023, 08:39:26 (UTC+0200)

So it takes now 48 seconds in my case to stop OH 4.0.2 on a RPI3, that is certainly more than before with a relatively similar OH setup (maybe with 1 or 2 added bindings) as I had in mind 34 seconds with first OH 3 version.

Now if it takes 2 minutes in your case, that is probably not normal, you may have a problem somewhere, maybe a problem with one binding ?

Lolodomo · September 7, 2023, 6:48am

And 41s to stop OH on a second try.

$ date ; sudo systemctl stop openhab.service ; date
jeudi 7 septembre 2023, 08:46:20 (UTC+0200)
jeudi 7 septembre 2023, 08:47:01 (UTC+0200)

mstormi · September 7, 2023, 9:10am

I’ve been seeing the same problem ever but never had enough time to investigate.

I believe there must be some fundamental design flaw involved.
Even if there was some sort of wait-for timeout involved, that must not be serialized in the first place so that times add up, must it ?
Second, why wait for threads (beyond those doing core I/O to logs and persistence) to “flush” at all?
Ideally, people should be able to kill -9 the java process or instantly switch off power of their box.
Don’t get me wrong, I for sure do NOT advocate this, yet it seems to be working, indicating longish shutdown times are a bug not a housekeeping or other requirement.
In that sense, it does not make a difference if it’s 2 mins or a half.
(FTR I don’t have any database running other than standard rrd4j).

AndrewFG · September 7, 2023, 9:17am

The issue is probably due to the destroy() method in the thing handlers. The OH developer documentation say that destroy() must execute fast. However many binding developers do not respect that rule, and the destroy() method blocks instead. This causes a chain reaction since destroy() on other subsequent bindings is delayed until the prior call returns. A proper implemenation of destroy() should use scheduler.submit() if it has any tasks that may take a long time.

mtrax · September 7, 2023, 10:47am

Do i use something like a trace all or is there a specific trace component?

rlkoshak · September 7, 2023, 1:55pm

https://www.openhab.org/docs/administration/logging.html

Set the log level of org.openhab to TRACE.

morph166955 · September 9, 2023, 1:46am

Odd question, when you run “shell:threads | grep -i httpclient | grep -i runnable | grep -i selector | wc -l” at the karaf console what do you get? I noticed that the neeo binding is creating a whole bunch of HttpClients and they aren’t cleaning up like they should. My system is crashing when that number gets around 4000. I opened an issue here…

github.com/openhab/openhab-addons

[neeo] Excessive Thread Leak - HttpClient-*-SelectorManager

opened 09:13PM - 08 Sep 23 UTC

morph166955

bug

Moved from https://github.com/openhab/openhab-core/issues/3794 What I believ…e to be the transport part of neeo seems to be creating threads such as: ``` "HttpClient-14819-SelectorManager" Id=38543 in RUNNABLE (running in native) at java.base@17.0.8/sun.nio.ch.EPoll.wait(Native Method) at java.base@17.0.8/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:118) at java.base@17.0.8/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129) - locked sun.nio.ch.Util$2@11af4028 - locked sun.nio.ch.EPollSelectorImpl@13b07a12 at java.base@17.0.8/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141) at platform/java.net.http@17.0.8/jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:889) ``` And not closing them. This is eventually causing OH to run out of available space for threads. My last crash showed over 4000 similar threads. When neeo transport is not installed, I have 1 open HttpClient thread. Within a minute of adding it (after waiting a considerable time) I had over 500. Also, potentially related, this exception is thrown randomly (not 4000 times). ``` 2023-09-07 21:04:18.043 [WARN ] [ache.cxf.phase.PhaseInterceptorChain] - Interceptor for {http://10.255.0.201:3000/v1/api/unregisterSdkDeviceAdapter}WebClient has thrown exception, unwinding now org.apache.cxf.interceptor.Fault: Could not send Message. at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:67) ~[?:?] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[?:?] at org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:710) ~[?:?] at org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1086) ~[?:?] at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:932) ~[?:?] at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:901) ~[?:?] at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:461) ~[?:?] at org.apache.cxf.jaxrs.client.SyncInvokerImpl.method(SyncInvokerImpl.java:150) ~[?:?] at org.apache.cxf.jaxrs.client.SyncInvokerImpl.method(SyncInvokerImpl.java:145) ~[?:?] at org.apache.cxf.jaxrs.client.SyncInvokerImpl.post(SyncInvokerImpl.java:85) ~[?:?] at org.apache.cxf.jaxrs.client.spec.InvocationBuilderImpl.post(InvocationBuilderImpl.java:153) ~[?:?] at org.openhab.io.neeo.internal.net.HttpRequest.sendPostJsonCommand(HttpRequest.java:96) ~[?:?] at org.openhab.io.neeo.internal.NeeoApi.deregisterApi(NeeoApi.java:495) ~[?:?] at org.openhab.io.neeo.internal.NeeoApi.registerApi(NeeoApi.java:424) ~[?:?] at org.openhab.io.neeo.internal.NeeoApi.connect(NeeoApi.java:398) ~[?:?] at org.openhab.io.neeo.internal.NeeoApi.lambda$0(NeeoApi.java:351) ~[?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?] Caused by: java.io.IOException: IOException invoking http://10.255.0.201:3000/v1/api/unregisterSdkDeviceAdapter: java.lang.InterruptedException at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?] at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) ~[?:?] at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?] at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) ~[?:?] at java.lang.reflect.Constructor.newInstance(Constructor.java:480) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1430) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1411) ~[?:?] at org.apache.cxf.io.AbstractWrappedOutputStream.close(AbstractWrappedOutputStream.java:77) ~[?:?] at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:695) ~[?:?] at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63) ~[?:?] ... 21 more Caused by: java.io.IOException: java.lang.InterruptedException at org.apache.cxf.transport.http.HttpClientHTTPConduit$HttpClientWrappedOutputStream.getResponse(HttpClientHTTPConduit.java:460) ~[?:?] at org.apache.cxf.transport.http.HttpClientHTTPConduit$HttpClientWrappedOutputStream.getResponseCode(HttpClientHTTPConduit.java:469) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.doProcessResponseCode(HTTPConduit.java:1631) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponseInternal(HTTPConduit.java:1662) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1604) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1398) ~[?:?] at org.apache.cxf.io.AbstractWrappedOutputStream.close(AbstractWrappedOutputStream.java:77) ~[?:?] at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:695) ~[?:?] at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63) ~[?:?] ... 21 more Caused by: java.lang.InterruptedException at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:386) ~[?:?] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?] at org.apache.cxf.transport.http.HttpClientHTTPConduit$HttpClientWrappedOutputStream.getResponse(HttpClientHTTPConduit.java:440) ~[?:?] at org.apache.cxf.transport.http.HttpClientHTTPConduit$HttpClientWrappedOutputStream.getResponseCode(HttpClientHTTPConduit.java:469) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.doProcessResponseCode(HTTPConduit.java:1631) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponseInternal(HTTPConduit.java:1662) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1604) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1398) ~[?:?] at org.apache.cxf.io.AbstractWrappedOutputStream.close(AbstractWrappedOutputStream.java:77) ~[?:?] at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) ~[?:?] at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:695) ~[?:?] at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63) ~[?:?] ... 21 more ```

…I’ve been trying to figure out why they aren’t closing but I’ve had no luck so far.

EDIT: More specifically, I’m pretty sure it’s because HttpClient is not AutoCloseable (from what I can find on google). My issue is why this just recently started happening. I never had issues with the neeo binding crashing OH before. This seems new. I’ve been wondering if it was a more recent karaf upgrade but I can’t pinpoint.

mtrax · September 9, 2023, 2:37am

I get zero (0)

but I haven’t done much testing , but I did try stopping service alone and only took about 20 secs, and as I said the only time I do see if I’m rebooting after a kernel update.

Lolodomo · September 9, 2023, 6:03am

Generally, the good practice for a binding is to use the common OH HTTP client or if not possible to create a dedicated HTTP client.
If this binding is creating several thousands of HTTP clients, there is a serious conception issue in this binding.

morph166955 · September 9, 2023, 5:09pm

Yeah I spent some time on it last night and came to that same conclusion. Im working on a PR now to fix it.

mtrax · October 5, 2023, 9:24pm

update after v4 has been running for a while. The stop time has been improved alot but its still a long time , today’s reboot it took around 45sec to stop .
Note based on my last look it only appears to be this way during shutdowns, so I’m wondering if its waiting on another service/resource?

So I assume I need to turn on some debugging which I though I tried but couldn’t see anything obvious.

mtrax · April 9, 2024, 7:07am

I have the most recent version on OH and I’m still getting around 30 secs to shutdown during a reboot
ie the biggest delay in a reboot is Openhab
is there any I can do to get it down to at least 10 secs ?

mstormi · April 9, 2024, 8:40am

Similarly, OH fails to shutdown quickly after startup when still busy with parsing .items and .rules.

Could you summarize findings in this thread and open an issue with openhab-core please?
Also check for eventually existing open issues.

mtrax · April 9, 2024, 9:21am

I haven’t done much testing now but my issue is always when I reboot my ubuntu host.
I will pick out the logs from the last reboot.
is there any specific diagnostic info I can get.