[SOLVED] High cpu usage in OH3.1

wawa79 · January 3, 2022, 12:27am

I am experiencing high CPU usage in OH3.1 after a few hours. CPU usage is normally around 1 to 10% and suddenly increases and remains at 100% forever.
My system runs on an Intel Celeron N3450 4 cores / Windows 10-64.

I tried the following with no success:

clearing the cache/tmp folders and restarted the system several times,
stopping the bindings one by one using bundle:stop xxx, expecting a CPU usage drop that would have helped me to find the binding responsible for the issue,
stopping org.openhab.core.model.script.runtime in case my scripts or threads would be looping infinitely,
stopping ALL bundles (including core) to determine if one of them could be the cause,
setting the log to DEBUG for all bindings (and even ROOT) to see some log flooding,
use systeminfo binding to detect when the CPU increase happens but unfortunately, monitoring the Java PID for process#load channel always returns 0 and cpu#load1 is never updated

Any advice on other troubleshooting methods (excluding reinstall of OH that will be the last option…) is really appreciated! Thank you in advance for your suggestions!

rubens · January 3, 2022, 1:25am

My system runs on an Intel Celeron N3450 4 cores / Windows 10-64.

RAM?

Just asking cause I had to migrate from a raspi 3b to a vm when cpu usage was in the 140++% most of the times. The system was acting wiered in many ways and simply wasn’t usable.

Assigned 2 gib of RAM to the vm and suddenly the sun was shining again, cpu usage down to single digits (mostly), scripts running again.

I felt the RAM was more important than cpu power

(running on debian here, memory management and cpu usage reporting is probably different than on win*)

rlkoshak · January 3, 2022, 1:50am

Definitely look at RAM for sure. Also be way that there are some known problems with the systeminfo binding in 3.2, likely carrots over from 3.1 where it runs away consuming CPU and RAM until OH crashes.

You’ve pretty much taken all the stops I would suggest so at this point the best I can offer is to upgrade to 3.2 and if the problem persists fine an issue.

One thing to now though. If the CPU was already pegged when you started disabling bundles, the fact that the CPU didn’t immediately drop is no guarantee that that bundle was the error cause of the problem in the first place. It might be that a misbehaving bundle could cause a different bundle to run amok.

Good luck! These tires of problems are really hard to diagnose.

matt1 · January 3, 2022, 4:11am

Upgrade to 3.2 stable, there are 6 months of bug fixes your missing out on.

Check what is happening with the java heap space. If it is Getting low then it will do garbage collection and that uses CPU.

Also disable rules, there may be a rule that has an issue causing the load.

wawa79 · January 3, 2022, 8:46am

Thank you for your suggestions.
What is quite strange is that responsiveness of OH is not impacted (unless if the issue is on my setup since OH 1.8…).
I discovered the problem when investigating on Web connection drops from WAN. I discovered the RAM over-usage but investigations finally took me to nginx: nginx was crashing due to log files sizes reaching their limit (I fixed this by setting a rotation on nginx logs).

OH process is using less than 1GB RAM and the system has 4GB. Windows is only using 230kB of the 1GB swap it allocates and reports 65-75% of memory usage so I don’t think low RAM is the reason.

Having systeminfo binding working would have helped me to find a relation between some faulty scripts or bindings and raise of CPU. But if the binding is known to have issues in 3.2, I won’t try it.

At least this issue will be a good reason to speed up migration to OH3.2

wawa79 · January 4, 2022, 10:00pm

I finally found a way to troubleshoot the problem using a workaround to replace the systeminfo binding.
I created a python script to monitor the cpu usage of the OH Java process (I am scripting from Eventghost which handles the Mqtt communication)

import os
import psutil
import time
import threading

def mem_mon(pid):
     print("mem_mon() started")
     mqtt_server = 'my.oh.ip.adress'
     mqtt_port = '1883'
     while getattr(t, "do_run", True):
         try:
             process = psutil.Process(pid)
         except:
             break
         process.cpu_percent()
         time.sleep(5)
         
         eg.plugins.MQTTthreads.publishMQTT(u'eg-client',
         mqtt_server, mqtt_port,
         u'/openHAB/out/SYS_System_ProcessorLoad/state',
         "{}: {}MB {}%".format(pid, process.memory_info()[0] / 1024 ** 2, process.cpu_percent()), 
         0, False, u'', True, u'mqttaccount', u'mqttpassword', False,u'', u'', u'', u'TLSv1.2')
     print("mem_mon() stopped")
 
 pid = eg.event.suffix.split(".")[1]
 if eg.event.prefix == u"MQTT" and  pid != u"0" :
     try:
         t
         print("Stopping current monitor...")
         t.do_run = False
         t.join()
     except:
         pass
     print("Starting moitor for PID {}...".format(pid))
     t = threading.Thread(target=mem_mon, args=[int(pid)])
     t.start()
 elif eg.event.prefix == u"MQTT" and pid == u"0":
     print("Stopping monitor...")
     try:
         t.do_run = False
         t.join()
     except:
         pass

The value is returned to OH using an MQTT channel and linked to an item.

String SYS_System_ProcessorLoad             "Charge processeur"                         { channel="mqtt:topic:28bdcded6f:mem_monitor:value", autoupdate="false" }

The item state is logged in OH when it exceeds 50% using a rule.

@rule("test.mem_monitor", description=u"Memory monitor", tags=[])
@when("Item SYS_System_ProcessorLoad received update")
def test_mem_monitor(event):
    cpu = float(event.itemState.toString().split(" ")[2].replace("%", ""))
    if cpu <= 25 :
        test_mem_monitor.log.debug(cpu)
    elif cpu > 25 and cpu < 50:
        test_mem_monitor.log.warning(cpu)
    elif cpu >= 50:
        test_mem_monitor.log.error(cpu)

Looking at the log after the cpu over-usage occurred allows identifying which bindings / scripts where triggered right before the issue.
Bingo: one of my threads started in a rule fires an exception which is not logged.

May that help anyone having this type of issue.

matt1 · January 4, 2022, 10:18pm

Do you mind posting the rule that was the cause? Good to hear it is sorted. The rules now have a pause button which is very handy to disable a rule for a period of time.

wawa79 · January 4, 2022, 11:39pm

Indeed, the threading issue was not directly in a rule but in a Jython library used by the rule.
That library implements a Python Run(Thread) class which provides a convenient execution scheduler:

Run.schedule(myfunction, [args]).delay(delay_secs).repeat(repeat_secs).stopafter(stop_aftersecs) to schedule a call in the future (typically SendCommand, PostUpdate or any callable function)
Run[myfunc].stop() to cancel any on-going scheduled function

I use it as a replacement for the java Timer class which does not behave as expected from Jython.

Andrew_Rowe · January 5, 2022, 12:08am

ok, so solved right?
please put [SOLVED] in your thread title so folks know this is a resolved issue

hafniumzinc · January 5, 2022, 7:08am

The OP can just click the tick icon on the post which solves the issue. This will add a solved symbol to the title. No need to manually edit the title.

Andrew_Rowe · January 5, 2022, 12:36pm

even better!

system · February 16, 2022, 4:37am

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.