[SOLVED] httpClient keeps failing Get-Request after onQueued with Timeout

stefan.hoehn · March 4, 2021, 10:54am

Hi to all binding devs!

I am currently stuck with a strange situation when I am using the jetty httpclient in my Nanoleaf binding:

I am calling a URL with a simple GET request method

httpClient.newRequest(“myurl”).method(HttpMethod.GET).timeout(10, TimeUnit.SECONDS)
request.send();

and it just times out. To prove that it is related to the httpClient

I curled it on the server and it replies in milliseconds
I implemented the same request with Java-Standard ways (URL, URLConnection.openConnection) and it works blazingly fast, too, without any timeout…

I am logging out everything that I provide to the httpClient Request and I am 100% sure that I provide the same that I provide to the standard Java Classes.

I then went further to track down the issue by even adding a RequestListenerAdapter to the Request of the httpClient to log out each of the methods being called internally and it shows that only “onQueued” and “onFailure” are being called. I also checked if the httpClient.isRunning (true).

After more than a day of trying to find out the cause I am running out of ideas as I am not sure if there is a way to tell the library within OH to log out detailed information nor is it possible to do remote debugging in OH, afaik.

If I don’t find the cause I see no other chance to throw out httpClient by using the good old Java library approach, though I am not sure if that has a major downside within openhab and I rather would like to find out the root cause of the issue.

I would be very happy if anyone has an idea what the problem might be or how I can further investigate the issue?

Thanks in advance
Stefan

stefan.hoehn · March 4, 2021, 4:52pm

Not sure if it helps but this is the stacktrace I am getting:

at org.eclipse.jetty.client.HttpDestination$TimeoutTask.onTimeoutExpired(HttpDestination.java:529)
	at org.eclipse.jetty.io.CyclicTimeout$Wakeup.run(CyclicTimeout.java:282)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

I should also mention that even directly after retrieving the httpClient in my derived BaseThingHandlerFactory the httpClient doesn’t work!

What is even more strange:

if I replace the following standard approach of retrieving the openhab provided http client with one I create myself everything works well!

Before

public NanoleafHandlerFactory(@Reference final HttpClientFactory httpClientFactory) {
        this.httpClient = httpClientFactory.getCommonHttpClient();
    }

After by creating my own httpClient (working)

public NanoleafHandlerFactory(@Reference final HttpClientFactory httpClientFactory) {
        this.httpClient = new HttpClient();
        try {
            httpClient.start();
        } catch (Exception e) {
            // just dirty logging for testing purposes.
            logger.debug("{}", ExceptionUtils.getFullStackTrace(e));
        }
    }

So my guess is that there is some kind of bug within OH3?

hilbrand · March 4, 2021, 8:35pm

The common http client is shared among other bindings. So it could be something is blocking the client when you want to use it. Or the client is stopped, which should not be done. Do you have any other bindings running?

stefan.hoehn · March 4, 2021, 8:44pm

I have lots of bindings running. It is happening right after I am starting openhab. However one binding could be suspicious which is the unify binding which is not working. Either this binding happens to have the same problem or it might be the cause of the problem. As it not working anyway I’ll give that a try.

However, what worries me is that one binding could block the resource of another binding.

Other than that the bindings I am using are: astro, chromecast, exec, gpstracker, http, hue, jeelink, lgwebos, mqtt, network, onkyo, serial, shelly, squeezebox, unifi, yamaha receiver, zwave.

Update: I reinstalled the Unifi binding and it is working now but the httpClient is still not working (even when the unifi binding was uninstalled).
To completely exclude the effect of any other binding I would need to uninstall all of them but I am currently on my production system where I don’t want to do that. I may revive one of my old RP3s and install a fresh OH3 on it to have a clean dev system to try out which is a good idea anyway.

Still I am open for any further idea to find out why the client is not working or how I could check why it is maybe blocked (the fact that it is onQueue and then failing at least points to that).

thanks
Stefan

Kai · March 7, 2021, 3:54pm

That’s a situation we cannot do anything about - after all, Java is running a single process and hence all resources are shared across all bundles. Not constraining the resources doesn’t make sense either as this would only hide a potential resource leak.

So yes, as @hilbrand says, the shared httpClient uses a threadpool for the requests. If the pool is exhausted, other requests will have to wait until previous requests finish. The shared client must not be used for any long-polling or otherwise blocking requests. So you’d indeed should check, which other requests are already running in your system and fix those that tend to block the threads.

The thread pool is configured to a maximum of 40 threads, which in general doesn’t sound like an issue to me. Nonetheless, you can test if it helps to increase it by adding

webclient:maxThreadsShared=100

to your services/runtime.cfg file.

stefan.hoehn · March 9, 2021, 9:54pm

I have now done 2 things

I have set up another completely fresh OH3 test environment which only runs the nanoleaf binding and additionally my production environment which runs several bindings
I have added the a new line in the code which prints out the instance of the httpclient I am using and all “state” information I can get

httpclient= HttpClient@1a1fb02{STARTED} running=true blocking=false failed=false stopped=false, stopping= false

I will now monitor both and I hope to see anything on those values when it finally fails.

The other thing I am thinking about is that I am using a long-polling request to get the touch information from the controller. Could it be that I am “blocking” myself by starting a long polling request (without an answer from the controller because it isn’t touched) and that may block the same client instance for another parallel request I am doing to get the current state of the panels or to control the panel?

Kai · March 10, 2021, 11:49am

Maybe, but I am actually not sure. Reading java - How does Jetty HttpClient use Threads? - Stack Overflow, it feels as if there shouldn’t be any issue even for long-polling requests as they should not block threads from the pool. So I really wonder whether the threadpool is really exhausted, it sounds unlikely to me.
I briefly thought it might be due to maxConnectionsPerDestination being maybe too low (e.g. only allowing 1 or 2 connections to a specific server, so that a long-polling request would make the other requests being queued). But it seems that this value is set to 64 by default, so it does not seem to be an issue.
TL;DR: I don’t know what causes the issue, so I am looking forward to the results of your analysis.

stefan.hoehn · March 11, 2021, 12:05pm

I am starting to get desperate since I am debugging this for days already. I am finally down to the point that I need internal logging information from jetty. The following trace log shows that I have 4 controller running.

ALL use the exact same http-client.
ALL request the same URL only with different IPs and of course different auth tokens
ALL show the exact httpclient-state (running etc)
ALL 4 URLs work fine when I curl them or when I would call them with URLConnection

However, 3 fail while 1 works correctly and I see NO difference at all


2021-03-11 10:45:25.093 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - RequestBuilder: Sending Request 192.168.178.176:16021 /api/v1/vhBH1Uu4l07O7cssfqo9vVdzaSseAxkF/  op: /  method: GET
2021-03-11 10:45:25.093 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - RequestBuilder: Sending Request 192.168.178.175:16021 /api/v1/QPZACP5y9swuuuqcJwhYIeIlpOdr4n8U/  op: /  method: GET
2021-03-11 10:45:25.094 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - RequestBuilder: Sending Request 192.168.178.177:16021 /api/v1/zYQjBNzkOnH0PbTJaaapidZYJDLDu9hk/  op: /  method: GET
2021-03-11 10:46:17.610 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - RequestBuilder: Sending Request 192.168.178.159:16021 /api/v1/iOVj8veAF1sssowB9s2f9uTXYBr5N2d/  op: /  method: GET

2021-03-11 10:45:25.094 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - httpclient= HttpClient@16c05fb{STARTED} running=true blocking=false failed=false stopped=false, stopping= false
2021-03-11 10:45:25.096 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - httpclient= HttpClient@16c05fb{STARTED} running=true blocking=false failed=false stopped=false, stopping= false
2021-03-11 10:45:25.096 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - httpclient= HttpClient@16c05fb{STARTED} running=true blocking=false failed=false stopped=false, stopping= false
2021-03-11 10:46:17.612 [INFO ] [nding.nanoleaf.internal.OpenAPIUtils] - httpclient= HttpClient@16c05fb{STARTED} running=true blocking=false failed=false stopped=false, stopping= false

2021-03-11 10:45:25.096 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Sending Request http://192.168.178.176:16021/api/v1/vhBH1Uu4l07O7cssfqo9vVdzaSseAxkF/ no query parameters
2021-03-11 10:45:25.096 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Sending Request http://192.168.178.177:16021/api/v1/zYQjBNzkOnH0PbTJaaapidZYJDLDu9hk1/ no query parameters
2021-03-11 10:45:25.097 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Sending Request http://192.168.178.175:16021/api/v1/QPZACP5y9swuuuqcJwhYIeIlpOdr4n8U/ no query parameters
2021-03-11 10:46:17.613 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Sending Request http://192.168.178.159:16021/api/v1/iOVj8veAF1sssowB9s2f9uTXYBr5N2d/ no query parameters

2021-03-11 10:45:25.097 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Request method:GET uri:http://192.168.178.176:16021/api/v1/vhBH1Uu4l07O7cEofqo9vVdzaSseAxkF/ params{}
2021-03-11 10:45:25.098 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Request method:GET uri:http://192.168.178.175:16021/api/v1/QPZACP5y9swuuuqcJwhYIeIlpOdr4n8U/ params{}
2021-03-11 10:45:25.097 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Request method:GET uri:http://192.168.178.177:16021/api/v1/zYQjBNzkOnH0PbTJaaapidZYJDLDu9hk/ params{}
2021-03-11 10:46:17.615 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - Request method:GET uri:http://192.168.178.159:16021/api/v1/iOVj8veAF1sssowB9s2f9uTXYBr5N2d/ params{}


2021-03-11 10:45:35.103 [WARN ] [al.handler.NanoleafControllerHandler] - Status update failed for controller nanoleaf:controller:F0ED4F9351AF : Failed to send OpenAPI request
2021-03-11 10:45:35.111 [WARN ] [al.handler.NanoleafControllerHandler] - Status update failed for controller nanoleaf:controller:D81E7A7E424E : Failed to send OpenAPI request
2021-03-11 10:45:35.111 [WARN ] [al.handler.NanoleafControllerHandler] - Status update failed for controller nanoleaf:controller:FA2F28A53D2C : Failed to send OpenAPI request
2021-03-11 10:46:17.608 [DEBUG] [al.handler.NanoleafControllerHandler] - Update channels for controller nanoleaf:controller:5301E2D13CD1
2021-03-11 10:46:17.630 [TRACE] [nding.nanoleaf.internal.OpenAPIUtils] - API response from Nanoleaf controller: {"name":"5301E2D13CD1","serialNo":"S19422A2758","manufacture.....

It shows that the last of the 4 gives a result.

So I added the following to the log4j2.xml

<Logger level="TRACE" name="org.eclipse.jetty"/>

but it doesn’t help seeing the jetty logs. Can someone help me setting up this logging correctly, so I can view the logs of Jetty?

Just an addition: Restarting OH again seemed to do make it working. I see the logs now.

TIA
Stefan

stefan.hoehn · March 13, 2021, 2:45pm

Finally I was able to find a solution even though until today I do not know the real root cause for the behaviour.

What I finally did:

I more or less investigated all situations when the httpClient wouldn’t work anymore
I boiled down to an internal Timeout Exception
I investigated the httpClient by checking all flags like running, cancelled, etc. and all look fine
My code only uses THE one httpClient I am given by OH (I can tell by the java instanceId)
You can use httpClient.dump() to investigate the internals of the queued request and it provides more than a hundred lines of information. Though even when looking at that everything seems to look fine from what I can tell (and believe me I was pretty thorough in analysing it)

"Solution":

So finally I decided to catch the specific situation when the httpClient would like to serve the requests anymore. In that case I do a httpClient.stop() and then a httpClient.start() and guess what, I works again.

To me it looks like a bug within the httpClient but at least this makes the binding much more stable now.

mhilbush · March 13, 2021, 3:01pm

Since the client is a shared resource, what’s the effect of doing a stop/start on other bindings that are using the client, especially those that may have a transaction in flight?

stefan.hoehn · March 13, 2021, 3:09pm

Hi Mark,

Sure, it “might” affect other bindings as well if by chance they used the same httpClient-instance but I would assume that then the other binding does have the same issue like I have in my binding.
As I said I still have no clue what is really happening inside the httpClient but what I can say that if a client-instance starts to get a problem I can at least fix the issue with that restarting-approach. Believe me, I have spent hours if not days to analyse what the actual root cause is and definitely wished I would know but this is the only solution I have come up with and it seems to be finally pretty stable in my prod environment which runs quite a number of bindings in parallel.

cheers
Stefan

mhilbush · March 13, 2021, 3:18pm

After reading what you’ve done to try to diagnose the issue, I definitely can feel your pain.

However, if doing a start/stop of the shared client breaks other bindings (i’m not saying it does or doesn’t, I just don’t know), I can’t see how this could be considered an acceptable solution. But I’ll defer to @Kai or @hilbrand, who know a lot more than me about the Jetty client.

stefan.hoehn · March 13, 2021, 3:30pm

After reading what you’ve done to try to diagnose the issue, I definitely can feel your pain.
Yep, sure, very appreciated, I am more than happy if Kai and Hibrand have a good idea how to approach

Thanks, Mark

Btw, Kai was very much involved in an offline thread from the beginning where I kept him up to date with what I did. I doubt his changes on the binding did create the issue but more or less made them transparent. My guess it was only hidden before because for some time I experienced some instability already but I was never able to track that down or had the time to analyze it to that point.

system · April 24, 2021, 7:30am

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.