Too many HTTP connections?

Hi!
I have 3 LIFX Lightbulbs and they have recently started to slow down in response to an openhab command (~6sec).
After some tinkering, I deactivated my network binding that is checking around 100 devices in my network using ping and measures latency.
Now, the LIFX-Latency is gone and everything is responsive again.
Is there a limit on how many open connections I can have and can this be increased?

First things first, the Network binding doesn’t use HTTP. It uses ping and arping.

Secondly, when the LIFX are reacting slowing, is it just those or is everything slow? What’s the computer doing at these times (run htop to see how much CPU and RAM is being used).

Finally, openHAB makes a horrible networking monitoring tool. That’s not what it’s made to do. There are times where knowing whether a device is reachable over the network might impact your home automations (e.g. when receiving a command to open a garage door, making sure OH can reach the garage door and if not generating an alert so the user knows why the door didn’t open). But the use of bindings like Network and SystemInfo should be limited to those use cases. Use a proper networking or system monitoring tool to monitor the health and status of your system overall. It will work better, faster, and provide you better information than openHAB ever could. And it’ll be less work to set it up.

If you just want to see what devices are online on your network and latencey and such, Fing, Zabbix, and Nagios are all pretty good choices.

I did check that, looking at the amound of threads. CPU is low, Ram consumption is low, 290 Threads is OK… I also see HDD I/O is low. So I assume, the computer is running ok. Also, I have a combination of other lights like shelly, those turn on immediately, just the LIFX is slow.

My concern is that if a binding is used so heavily and is using a OS-command (ping), that it might be able to slow down other bindings that need to use OS-Features (maybe LiFX is using ping as well). Like a wait or so. I have no details, but maybe one session(openhab) can only send one ping at a time? I can replicate it, network binding on -> lifx-slow…

I looked into other tools, check-mk as well, but ended up with openhab as one reason is probably because I know it best and can analyse it via grafana automatically… and the other reason is that Ican monitor zwave-zigbee or other “devices” as well, independend of their protokoll. I could use the unifi-binding or a little script if the network binding is not suited… or maybe point the finger to a problem and the developer finds something to improve.

It seems reasonable to use openHAB to do some basic network monitoring, at ping level.
The exceptional part of your usage is having 100 devices instead of 3 or 4.
Which might well highlight some blocking behavior in some part of the mechanism. Give us a clue, how often are you pinging these things?

hmm… I use default… that is 5 sec wait time, one minute refreshintervall, 1 retry. I assume those 100 are send parallel though, otherwise it is difficult to keep the refreshinterval… but I don’t see spikes in my threads (at least in htop).

if I were writing the binding i’d be doing one at a time in sequence, and treat the interval as minimum refresh. But mostly they’d each be over and done with in a few milliseconds, and no queue built up.
I’ve no idea how it really works.

Even with one-at-a-time, I would expect them to come in bursts, as they’d all get scheduled for much the same time at initialization. You might see that simply by noting Item change timings in events.log

What kind of router do you have? 100+ devices is a lot for a cheap router to handle and you may be hitting a hardware limitation.

Also I know you don’t want to change the interval, but to see the result it is worth doing for fault finding reasons. If you don’t experiment then you don’t learn and you won’t progress.

Further thought; indirect effects. Updating, what, 200 Items per minute is not a huge load. But a very real load. Are you persisting these? Further workload, this time tying up I/O resources. None of these are obvious showstopper but it all adds up.

Sometimes intelligent configuring can make a lot of difference.
Do you really need 100 latency values? (Why!)
Do you need to persist 100 unchanging ONLINE states every minute? Etc.

Thanks for the Input!

I have unify switches and unifi access points (about 30 clients pro NanoHD). Routing is done internally in the Proxmox Server to reach the subnet of the LIFX lights, and this server doesn’t show any load.

I am all up for it, I don’t actually “need” those pings, I can live without them, but I want them because I can:) So I’m not really looking for a workaround, but the source of the problem.

Interesting thought. In peak times, I am updating about 500 items per second, which can be handled quite nice by the system. I don’t think this is the issue also because other devices (shelly lights) are not effected, only the LIFX connection.

Yeah, I’m not there yet :smiley: For now, it is more like: put everything I can in there and figure out if that was helpful later :slight_smile: I don’t need any latency values nor do I need 99% of the online statuses. But it was installed to find the problem with unresponsve shelly devices (before a patch fixed that).

I can do workarounds, I can open MQTT sessions from most devices and work with last-wishes or so. Can also do it with scripts, or use the unifi binding… but for now, I was interested in the “why”.

I will analyse this a bit, currently, the binding is disabled, maybe there is a difference between enable, enable with 10 and enable with 100 devices. Maybe only offline devices will slow the process . Will keep you updated when I do have progress.

The LIFX LAN protocol is based on UDP which has no guaranteed delivery. So if there is also a lot of UDP ping traffic in your network, the devices routing the UDP traffic may opt for dropping excess UDP packages. This may cause the binding having to send packages several times for them to arrive (causing slowness) or not arrive at all (causing unreliability). LIFX lights also have a limit on how much packages per second they can handle. LIFX advises at most 20 messages per second. The older LIFX lights can handle less messages per second than newer lights.

1 Like

I wonder what the timeout period is before a retry is invoked. Seconds, likely?

It retries when there is no ACK after 250ms and max retries 3 times. I’ve tested this with 40+ lights in my network where its very common that packages are dropped. If you wait longer between retries you get annoyed by slow light response times whenever packages are dropped.

Very occasionally I still have a light that doesn’t properly respond. I still have to investigate if that can be improved by slightly increasing the max retries. But if the light was momentarily offline when lights are switched increasing the max retries will not solve it. :frowning:

That’s quite a short timeout really, although I’m sure it’ll be sensible for most home networks which is after all the target audience.
I get the feeling this isn’t the average home network though.

I am always impressed how quickly a relevant developer is finding and joining a discussion :slight_smile: so thanks @wborn to drop in!

I had a little contact with the yeelights’ protocoll, they use UDP as well, and they send like a thousand UDP requests per second… just to make sure one comes through… that is a bad practice I guess, but it seems to be working. I did tcpdump that, didn’t count if “all” have been receipt, but the screen filled up quickly on the sending and receiving side, so I don’t think packages are dropped here.

I would consider my home network to be professional hardware build with medium knowledge. I prefer ip-based protocolls over zigbee/zwave, so I have all outlets, switches and lights using IP, so that grow quickly - therefore bigger investments in hardware that can support the load, but besides this, it is still just a little home here :slight_smile:

Based on this input I will probably ssh into the different devices in my network (openhab itself, proxmox, switch, accesspoints) and see (tcpdump) where the UPD packets are lost to confirm/deny that theory.

So, little update,
I did a little tcpdump inside openhab to see if the UDP Packages are even created.
There is a constant 3sec udp package that is polling the state of the light. All good.
then, I activate some network things. the 3sec polling is paused for a couple of seconds, then back to normal, 3sec polling.
then I activated some more network things… same thing, there is a pause… but then back to normal, 3 sec polling.
It is around 65 network things when there is no “back to normal” anymore. the polling is between 10 and 60 secs…
If I move back to 60,50. it is not recovering from that easily. But back to around 10, the 3sec polling comes back.
I mixed the network things I added to see if “certain” things are the problem, but I suspect it is the number, not the things.

So I monitored my OpenHAB more. The UPDs are not created, but why?

Pidstat -d 10 is showing that my nvmeSSD is not doing a lot, i have about 100kb read/write speed with 0% IO delay.
htop is 3% cpu with 1of2G Memory used, showing 51,255 threads, with or without the networks, no problem here.

I am running out of ideas. Maybe I have to switch to a different concept :confused:

It appears you’re using Linux. I do know that you can run into things like port exhaustion in Windows, I’m wondering if the same thing is possible in Linux. The only issue is that it would be per remote host vs collective remote hosts. Maybe some sort of other limit on the network side? UDP isn’t a ‘connection’ so it wouldn’t be anything surrounding that. It could be in the networking gear though if you have firewalls segregating things and or NATs.

Thread management, blocking methods, and possibly a combination of the two could explain it as well, as you hint at. You could possibly get to a point where things can’t be ‘processed’ fast enough and before you’re finished with one cycle, the next one has been added and things now start to exacerbate. This would also explain why even when you lower the number, things don’t immediately clear up.

Can you set the max RAM in Linux for Java like you do in Windows? I would assume so.

Also, 5 seconds is an insane timeout, especially on an internal network. If something doesn’t ping back within 1, something’s up. I’d suspect that you could actually get away with 500ms or even 100ms.

I do have a “network” of devices, from firewall to NAT etc. But because the UDP packages are not sent (did tcp dump on the openhab-host) i think this cannot be the problem.

I could, but the current RAM is not exhaused, at least, I don’t see it. heap size is increased, although I don’t have the number ready right now…

I agree, but some devices (like shelly devices) have latencies from >1sec… not often, maybe 2/3 times a day, but they have. I don’t want offline-warnings if that occurs.

I missed the part where zero UDP are being sent. Hmmmm.

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.