Timeout on http call using HttpUtil

markus7017 · January 19, 2020, 1:27pm

I developing a binding (Shelly), which uses a HTTP REST API to retrieve device information for various purpose and control the device. Works fine in general. However, multiple users see http Timeouts in the log, which then cause a command to fail or the thing rapidly changing between ONLINE and OFFLINE state (timeout results in setting thing to OFFLINE state). One user even has this problem persistant, even can’t add discover Shelly things, because the call for initially retrieving device settings fails.

I use the OH build-in HttpUtil class

            httpResponse = HttpUtil.executeUrl(HttpMethod.GET, url, headers, null, "", SHELLY_API_TIMEOUT_MS);

The timeout is set to 6sec, even increasing it to 30sec doesn’t help.

I found this post:

seems to be the same symptom, but but this one is related to a https related problem, whereas I’m using generic http, so no certificates (should be) are involved. The device also doesn’t support https.

I my setup I see those timeouts very rarely (implemented a channel to count them), but other users are reporting relevant issues. I’m using OH 2.4 on a RPI, other users 2.5. The problem is reported for different device types and firmware versions. WiFi signal strength is verified and good. Manuel execution of the URLs in a browser or with curl don’t show these problems. Also the Shelly mobile and Web Apps don’t report errors.

The binding has 3 use cases for the http calls

Retrieve device information on discovery (as part of the createResult() call)
Executing channel commands (handleCommand() of the thing handler)
Status updates within a thread scheduled by scheduler.scheduleWithFixedDelay()

I though already about a problem around synchronization or thread context, but was not able to proof this. However, sometimes I see in the debugger that a call to HttpUtil.executeUrl() takes longer that just a sec, what I wouldn’t expect on a local WLAN.

@martinvw Maybe you have an idea?

Next step to find a work around could be to replace HttpUtil by generic JDK calls (I had this at the beginning, but then changed to HttpUtil).

tmrobert8 · January 21, 2020, 3:31am

FYI - I just use JAX RS for http since I need alot more than what HttpUtil can provide (ie filters). With that you can set a connection/read timeout as well.

The issue with that is that the version of JAX that openHAB includes doesn’t have a standard way of setting those timeouts - so you have to assume that the JAX client is a jersey client and set configuration properties specific to it. If openHAB upgraded the version (to 2.1) - then we’d have a nice client free way of doing it…

Here’s the example from sony:

import org.glassfish.jersey.client.ClientConfig;
import org.glassfish.jersey.client.ClientProperties;

// NOTE: assumes jersey client (no JAX compliant way of doing this)
// NOTE2: jax 2.1 has a way but we don't use that
final ClientConfig configuration = new ClientConfig();
configuration.property(ClientProperties.CONNECT_TIMEOUT, 15000);
configuration.property(ClientProperties.READ_TIMEOUT, 15000);

client = ClientBuilder.newClient(configuration);

markus7017 · January 21, 2020, 6:36am

thanks for the info, this could be an option

martinvw · January 22, 2020, 4:27pm

@markus7017 the most important thing is whether those timeout really occurs after your configured timeout? Is the timeout given after 30 seconds when you configure 30 seconds?

Do you have a reference to the error you/they see in the log?

No timeout’s or no / never slow responses?

AFAIK jetty uses a connection pool so it could be a limiting factor, but I would hope the timeout does not include the time waiting for a connection from the pool but this could be a smoking gun to look into.

Maybe it’s too easy to ask, but could it be that the binding sends so many requests to the devices that it causes them to lack/slowdown. Maybe they only have single thread to process all the incoming requests. Could the problem be related to using both the app, the binding and maybe another product which talks to the shelly api?

Just to check my assumptions (based on how you described it) the shelly has a local API inside the network on the if I recall correctly ESP chip?

tmrobert8 · January 22, 2020, 6:29pm

I don’t know shelly - but this was a big issue for the sony binding as it is very easy to overwhelm the sony processor…

markus7017 · January 22, 2020, 8:17pm

timeout occurs after the configured value, I tried 3…30sec
when no commands are issued the binding does 1 API call per 60sec (default)
no timeout issues in the browser or with curl, response is always received
could happen on a get or set call, always 1 call, no sequences
signal strength was verified and is good

martinvw · January 22, 2020, 8:55pm

Received quick/instant or in the end?

markus7017 · January 22, 2020, 9:36pm

receive quick/instant

martinvw · January 22, 2020, 9:38pm

Maybe there is a miscommunication with headers about the length, so http-util remains waiting for a few bytes which never come. In a normal browser this might not be visible.

Which could also be related to coding/encoding problems.

markus7017 · January 22, 2020, 9:48pm

puh, how to debug that?

tmrobert8 · January 22, 2020, 11:44pm

Try fiddler or if you really want - wireshark - either will generally show you what’s going on between endpoints

markus7017 · January 23, 2020, 8:28am

that came also into my mind, but’s I see it in my setup very rarely, other users are reporting this problem. I’ll ask.

Nevertheless, I tried changing HttpUtil to

       URL location = new URL(url);
        HttpURLConnection request = (HttpURLConnection) location.openConnection();
        request.setRequestMethod(HttpMethod.GET);

        request.setConnectTimeout(SHELLY_API_TIMEOUT_MS); // set timeout to 15 seconds
        BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
        StringBuilder response = new StringBuilder();
        Integer responseCode = request.getResponseCode();
        if (responseCode != 200) {
            httpResponse = responseCode.toString();
        }
        String inputLine;
        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();
        httpResponse = response.toString();

this runs with 6 devices, but after a while I get this exception.

Bildschirmfoto 2020-01-23 um 09.21.18

So it smells like a protocol issue - strange

markus7017 · January 27, 2020, 7:31am

The Shelly QA team did stress testing (10 requests per sec) without issues. Any other idea?

Running the above situation the user reported more or less no timeouts: Shelly Binding - #813 by Nightman

The tcpdump shows a sequence like this m(timestamp matching the one in openhab.log with the timeout message):

@martinvw Any idea appreciated

martinvw · January 27, 2020, 8:54am

Can you just send me the dump, so that I can analyze it myself? Normally I would use the follow TCP-stream to get the summary I can not easily follow it from such a screenshot. I only work with Wireshark a few times per year

I would, however, say that red-on-black part looks like a problem on the shelly side of things.

Do we also have a TCP-dump with the HttpUtil method? Because I still cannot believe that Jetty would be completely wrong on breaking.

markus7017 · January 27, 2020, 10:27am

sent you a PM with the dump file

martinvw · January 27, 2020, 11:01am

Okay, all knowledge was indeed a bit rusty but refreshed now.

What I see when opening it in Wireshark:

Filtering on tcp.stream eq 18 gives us a successful call to open:

It does a SYN to setup the TCP connection, it fails, so a second later it sends a new SYN, this succeeds and the rest of the communication is handled in 0,1 seconds.

Next one, filtering on tcp.stream eq 19 shows something interesting, the communication initiation starts good and the response is received after 0,02 seconds however the connection is NOT closed, that happens on second 35 of the dump. And could indicate a flaw in the connection handling in the (new) code. The call is to: /shelly/event/shellyswitch25-f36f94/roller/0?type=roller_open does that maybe follow a different path?

Next one, filtering on tcp.stream eq 20 - tcp.stream eq 27 show us failed attempts to setup a connection with the Shelly device. The first one at 5.578 seconds in the dump, later on we do succeed in getting a response on the SYN.

Looking at tcp.stream eq 28 (20 seconds after the first attempt and the connection from stream 19 is still open) our connection succeeds and we receive a response in 0,1 seconds.

Next we have a call to fetch some settings this one takes 0,5 seconds and contains a retransmission because the openHAB system believes the ACK was not received. Most other calls look okay.

So sometimes the shelly does not listen SYN’s so we cannot establish a connection this could be linked to the connection which somehow remains stuck.

When load testing the device I would combine one unclosed transaction with a lot of smaller requests and maybe some settings calls in parallel because they seem heavy.

markus7017 · January 27, 2020, 5:51pm

that’s good input, thanks
I’ll forward your analysis to the Shelly QA team

That is the only strange peace for me, because code-wise if should close the session after each request.

martinvw · January 27, 2020, 6:07pm

Would it then be that request which caused the timeout error, so that being a request which looks to be waiting for something. Is there a way to find the more likely call to trigger the error for which we have the stacktrace in the Shelly thread.

markus7017 · January 28, 2020, 8:27am

I think it is not related to a specific URI

Java shows connect timeouts, which would correspond to your analysis
Example 1 from Shelly Binding thread
Example 2 from Shelly Binding thread

martinvw · January 28, 2020, 1:52pm

Is is the dump from the first example?