CPU Load jumps up in several steps

sstadlberger · November 29, 2020, 2:23pm

Platform information:
- Hardware: Mac mini (Late 2012) / 2,3 GHz Intel Core i7 / 16GB RAM
- OS: MacOS 10.14.6
- Java Runtime Environment: Zulu: 8.50.0.51
- openHAB version: 2.5.10

Hi All,

I’m experiencing a peculiar problem. I habe a long running installation of OpenHAB which I’ve migrated from Docker on a Synology to run natively on a Mac. Everything is configured via text files and the migration went without any problems and the system works as previously.

I now noted that OpenHAB maxes aus the CPU on the Mac mini, it didn’t do that in the Docker installation which had a way weaker CPU.

What’s interesting is that the CPU load, after running normally for a while, jumps up in several steps:

(the initial spikes are just the initialisation)

When I sorted the output of threads --list by CPU time the following threads where on top (everything else was less then 0.1% of CPU time:

Id      Name                        State           CPU time    CPU %
244     ESH-httpClient-common-244   RUNNABLE        7568939     16,7%
744     ESH-httpClient-common-744   RUNNABLE        4359601     9,6%
228     ESH-httpClient-common-228   RUNNABLE        3768833     8,3%
227     ESH-httpClient-common-227   TIMED_WAITING   3523188     7,8%
235     ESH-httpClient-common-235   RUNNABLE        3305965     7,3%
1995    ESH-httpClient-common-1995  TIMED_WAITING   3154545     7,0%
226     ESH-httpClient-common-226   RUNNABLE        3034625     6,7%
229     ESH-httpClient-common-229   RUNNABLE        2997592     6,6%
242     ESH-httpClient-common-242   RUNNABLE        2780304     6,1%
241     ESH-httpClient-common-241   RUNNABLE        2502290     5,5%
234     ESH-httpClient-common-234   RUNNABLE        2297561     5,1%
6255    ESH-httpClient-common-6255  TIMED_WAITING   2053549     4,5%
5198    ESH-httpClient-common-5198  TIMED_WAITING   1797113     4,0%
6494    ESH-httpClient-common-6494  TIMED_WAITING   1439466     3,2%
270     HTTP Refresh Service        RUNNABLE        91702       0,2%
106     OH-EventHandlerExecutor-1   WAITING         67410       0,1%
108     pool-6-thread-1             WAITING         59677       0,1%
264     Stanza Listener Thread      TIMED_WAITING   35329       0,1%
70      OH-OSGiEventManager         TIMED_WAITING   34587       0,1%

I’ve been logging the CPU load every two seconds to pinpoint the exact time of the load jump, but there was not anything at that time in the logs that could explain it (just values constantly updating at the normal pace that happens 24/7).

I can restart OpenHAB every once in a while to get rid of the CPU load, but that is not really a long term solution.

I’d appreciate any help. Thanks,
Stefan

Andrew_Rowe · November 29, 2020, 2:35pm

Although I can’t offer any quick solution to your issue, I can suggest listing which bindings you use and number of items/things ect… how big is the system?

sstadlberger · November 29, 2020, 2:49pm

Hi Andrew,

thanks for the helpful suggestion. I think it’s a system with 150 things and 500 items.

I’m using the following bindings, etc.:

action="pushover"
binding="daikin,exec,expire1,http1,hue,modbus,mqtt,neato,network,nuki,robonect,snmp,unifi,astro,freeathome,homeconnect"
misc="market,hueemulation"
package="standard"
persistence="influxdb,mysql"
service.pid="org.openhab.addons"
transformation="exec,javascript,jsonpath,map"
ui="dashboard,basic,paper,habpanel,homebuilder,restdocs"

The system has been running stable for years in this configuration on way weaker hardware with only a minimal CPU load. Only after I migrated it to a new machine the issue popped up. I made no changes during the migration. I only copied the configuration files over and started the service.

Thanks,
Stefan

Andrew_Rowe · November 29, 2020, 3:42pm

The Mac is running version 2.5.10
What version was your install on the docker setup using?

Usually it is memory consumption people come in the forum complaining about but your cpu usage seems to have the same sort of effect and I’m guessing troubleshooting it would have to follow a similar path. That is quite a long list of bindings. Standard practice is to unload the bindings one by one or unload them all and re-add them one by one until the offending binding is isolated.
Yeah, I know… yuck
I’m guessing this is a production install and this type of troubleshooting may be painful. How long have you been using the install?
There are a few bindings which would stand out as possible offenders. The number of httpClient in your listing, to me, points to something as well. Do you have any rules which make calls to a web site which might be returning slowly or not at all?

Wolfgang_S · November 29, 2020, 5:14pm

are there specific things that happened at the time when the increase of CPU load started to the next step ? Are there any entries in openhab.log, events.log for these specific times ?

Andrew_Rowe · November 29, 2020, 5:34pm

from first post

sstadlberger · November 29, 2020, 6:24pm

The same version. I only wanted to do a straight migration without any changes to keep the variables small.

Is there a way to quickly unload (and load) a binding via the console? Ideally I want to just start unload bindings when the issue is occurring to see if I can quickly can identify the culprit when the CPU load drops after unloading.

Yes it is my home setup that is constantly used. I‘ve been using it for about three years.

Which ones do you have in mind?

No rules, but a have three items which call a local (local as in my network) API. Those items haven‘t been used since the migration. And I have a cached weather call (once per hour) to the Darksky API. As far as I can tell it is working correctly. And I also did a manual check to see if the API is responsive, which it is.

What’s strange from my perspective is that the jumps in CPU usage are not in a regular pattern (e.g. scheduled rules).

I‘m still happy with the migration over all as the new system is faster and more responsive than the old one especially with rules and MQTT. And that‘s also the case when the CPU load is very high.

And thanks again that you took your time to respond to my issue. I really appreciate that!

Thanks,
Stefan

Andrew_Rowe · November 29, 2020, 7:31pm

Does the system eventually become unresponsive?
you said

Is this normal behavior? I have no idea because I know very little about Macs. Maybe this is normal or maybe something on the host computer is causing the spikes although the main consumer of cpu seems to be ESH-httpClient-common which is some component of OpenHAB I would guess. I don’t know the inner workings of OpenHAB well enough to do more then speculate

blind stab in the dark… http1
(especially because ESH-httpClient-common smells like http1… ESH = Eclispe Smart Home… old)
2nd guess hueemulation should be ok if is new version for 2.5.10
All these I know nothing about
daikin,modbus,neato,nuki,robonect,unifi,freeathome,homeconnect

Wolfgang_S · November 30, 2020, 8:32pm

checkout this one:

sstadlberger · December 1, 2020, 9:56am

Thanks, that’s very helpful!

sstadlberger · December 1, 2020, 10:10am

I started to go through the bindings one by one to identify the culprit. I’m focusing on the ones I know are using HTTP behind the scenes and which don’t provide a transparent output (i.e. it’s not immediately obvious when they fail or are stuck).

First one to check was Nuki, but the issue reappeared within half an hour after disabling the Nuki binding.

Next was Daikin. After disabling this one there were no issues for the whole day and evening yesterday. But when I checked again this morning the CPU usage spiked again over the night, but it was less severe.

I’ll continue to monitor it to see if, when and how the usage goes up again. I want to find out if disabling the Daikin binding did actually cause a later/slower rise of the CPU usage (i.e. there’s a general issue with all bindings using http) or if this was just a fluke (so far, the first CPU usage spike appears usually within 30-60 minutes).

Thanks again,
Stefan

sstadlberger · December 5, 2020, 1:06pm

Quick Update:

I hooked up JProfiler to the JVM to inspect the stuck processes und noticed that they were using SSL. This removed a couple of possibilities like HabPanel (I use a separate SSL proxy for that).

In the end it turned out to be the Neato binding. Since removing it the system is running stable at about 5% CPU instead of 700%.

I‘m still curious why the Neato binding (and only the Neato binding) suddenly had issues when I changed the server. I‘m not actively using the binding right now anyway so it‘s not a big issue when it‘s deactivated. I‘m also working on an offline control for the Neato Botvacs so I guess the issue will be solved in another way eventually.

When I have more time I might look into what the issue with the Neato binding was specifically.

Thanks again for all your helpful comments, they pointed me in the right direction and helped me solve the issue much quicker.

Thanks,
Stefan

Lolodomo · December 5, 2020, 1:14pm

You could create an issue in Github for the Neato binding to trace this problem and let the binding maintainer informed of this problem. Maybe he will find a fix.

sstadlberger · December 11, 2020, 4:02pm

I’ll do that as soon as I have more time and can provide a reproducible case. My setup includes three vacuums so it may take a while to figure out the root cause which can be debugged.