The demo effect at play. As I am typing up the bug report, I can’t reproduce it anymore.
Well, at least it makes the trigger a whole lot simpler
The demo effect at play. As I am typing up the bug report, I can’t reproduce it anymore.
Well, at least it makes the trigger a whole lot simpler
Circling back to this issue; I’ve been setting both options on my vim instance when editing the rules, which seems to go OK most of the time, but sometimes the process still runs away. A couple of threads eat up all CPU, load skyrockets, and openHAB slowly dies.
There is no double loading going on anymore.
The rule is also as simple as printing a log line with an item type.
Any idea how I can troubleshoot this and turn it into a valid bug report if needed?
Perhaps create a shell script file that repeatedly does this and if it managed to consistently trigger the problem, that could be submitted in the problem report?
Also try doing it with RulesDSL or some other scripting languages to try to pinpoint whether the fault is in jruby or elsewhere in the core.
here is a link to a thread in this forum about chasing down a mem leak
I suspect this isn’t a mem leak. I have seen openhab using up the cpu in my dev / test instance too. It must have entered some sort of endless loop, possibly in the file watcher area. I haven’t investigated this any further though, so this is just a guess.
If someone can create a reproducible test case, it would greatly help the core developers in getting to the bottom of it.
Ok, doing some more testing, and I can now quite reliably kill my system.
To start of with the code use, just a simple ‘log this every minute’
JRuby
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
Rule DSL
rule "Log every minute"
when
Time cron "0 * * * * ?"
then
logInfo("CronTest", "Test rule executed")
end
First thing I noticed, was that when you first create the file, by either vim (with backup options disabled), cat <<EOF
, or creating a temp file and copying it to the directory. Both JRuby and Rule DSL trigger the “double loading”.
2022-08-22 10:46:11.872 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:46:19.999 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:44:58.428 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
2022-08-22 10:44:59.660 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
If we later on add somthing to the file, and I choose to do this by using tee
, we do not get the double entry, just a single one. (Example: echo '# foo | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb
)
I’ve done the creation, deletion, and adding lines with Rule DSL, and I could not get the system to run away. I’ve also noticed that loading the rule a quite a bit quicker. In case of the “double loading” you can see the difference is timestamps as well.
Trying this same rule as JRuby, it’s almost a surefire way of getting it to go out of control. Sometimes it requires two tries. In this case it does not matter of you copy/cat the file in (its maybe a bit more reliable in triggering the issue), or adding a line or two to the file.
There will be a couple of threads all with 100% CPU usage, and load is skyrocketing. I’ve tried to strace a couple of the threads to see whats going on, and they all seem to loop with the more of less the following output.
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781332861}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781464652}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781598152}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781730985}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781862609}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781995401}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782126983}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782260650}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782392274}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782524524}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782658190}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782791773}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782923814}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783058355}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783190230}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783326229}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783460937}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783594562}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783726895}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783865061}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784068477}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784181018}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784277809}) = 0
futex(0xb6028978, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
Love to hear your thoughts on this @jimtng , let me know if you want me to grab more extensive strace’s and if you have preferred output options.
EDIT:
I created a test.rb file with this in it:
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
Then ran this command several few times (up arrow, enter, repeatedly)
I tried this on my production system (openhab docker, on ubuntu 20) but couldn’t make my openhab to spike to 100%. It just reported that it detected change and reloaded the script. I noticed the openhab process went up to 25 - 29% briefly then down to 0-5% (its normal state)
Is that the same test you did?
I’ll be afk for a week (going camping with no laptop), I’d suggest posting this as an issue over on Issues · boc-tothefuture/openhab-jruby · GitHub so we can keep track of it better.
Yes, that usually does it.
I cobbled together a simple test script:
#!/bin/bash
check_load() { top -b -n1 -H | head -n 20; }
echo -e '>>> Starting testing\n'
# Print a baseline CPU usage
check_load
echo -e '\n\n>>> Writing testing script\n'
cat <<EOF > /etc/openhab/automation/jsr223/ruby/personal/test.rb
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
EOF
echo -e '>>> Waiting for script to be loaded (10s), log show the following: \n'; sleep 10
tail -n 5 /var/log/openhab/openhab.log
echo -e '\n\n>>> Checking CPU usage\n'
check_load
echo -e '\n\n>>> We will now wait for the CPU to settle down (60s), if its not already in run-away\n'; sleep 60
check_load
echo -e '\n\n>>> Poke the script file again using the tee command\n'
echo '# touching script file' | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb; sleep 2; echo
tail -n 3 /var/log/openhab/openhab.log
echo -e '\n\n>>> Waiting for a bit (30s), and lets see our CPU usage\n'; sleep 30
check_load
And this results in the following output"
>>> Starting testing
top - 18:14:49 up 7:22, 3 users, load average: 0.82, 2.12, 2.12
Threads: 462 total, 1 running, 461 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.6 us, 4.7 sy, 0.0 ni, 92.3 id, 0.0 wa, 0.4 hi, 0.0 si, 0.0 st
MiB Mem : 1990.4 total, 47.1 free, 806.3 used, 1137.1 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27711 root 20 0 10192 2728 2120 R 29.6 0.1 0:00.20 top
18326 root 20 0 0 0 0 I 3.7 0.0 0:00.09 kworker/5:1-events
25601 openhab 20 0 767912 612348 15064 S 3.7 30.0 0:00.01 pool-27-thread-
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.49 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
>>> Writing testing script
>>> Waiting for script to be loaded (10s), log show the following:
2022-08-22 18:14:29.590 [INFO ] [chine.state_machine_-_washingmachine] - Device is now Off
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
>>> Checking CPU usage
top - 18:14:59 up 7:22, 3 users, load average: 1.15, 2.15, 2.13
Threads: 463 total, 3 running, 460 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.3 us, 5.3 sy, 0.0 ni, 78.3 id, 0.0 wa, 0.5 hi, 0.5 si, 0.0 st
MiB Mem : 1990.4 total, 47.2 free, 806.1 used, 1137.1 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27717 openhab 20 0 767912 612700 15064 R 99.9 30.1 0:00.61 OH-file-process
14386 root 20 0 9016 3976 1864 S 25.0 0.2 1:16.86 htop
27720 root 20 0 10192 2808 2200 R 20.0 0.1 0:00.08 top
25267 openhab 20 0 767912 612700 15064 S 10.0 30.1 0:08.59 RXTXPortMonitor
27579 openhab 20 0 767912 612700 15064 S 5.0 30.1 0:00.50 Thread-386
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.49 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
>>> We will now wait for the CPU to settle down (60s), if its not already in run-away
top - 18:15:59 up 7:23, 3 users, load average: 3.31, 2.66, 2.31
Threads: 455 total, 1 running, 454 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.9 us, 5.0 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.5 hi, 0.0 si, 0.0 st
MiB Mem : 1990.4 total, 46.9 free, 806.3 used, 1137.2 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27750 root 20 0 10192 2772 2164 R 12.5 0.1 0:00.14 top
13373 root 20 0 0 0 0 I 4.2 0.0 0:03.86 kworker/u16:1-events_unbound
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.51 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.63 migration/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root rt 0 0 0 0 S 0.0 0.0 0:00.53 migration/1
>>> Poke the script file again using the tee command
# touching script file
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
>>> Waiting for a bit (30s), and lets see our CPU usage
top - 18:16:32 up 7:23, 3 users, load average: 5.39, 3.25, 2.52
Threads: 460 total, 9 running, 451 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.9 us, 7.4 sy, 0.0 ni, 3.2 id, 0.0 wa, 1.1 hi, 0.5 si, 0.0 st
MiB Mem : 1990.4 total, 45.9 free, 807.3 used, 1137.2 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1136.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22083 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:13.44 GC Thread#3
22084 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.86 GC Thread#4
22085 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.86 GC Thread#5
22330 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.64 GC Thread#7
22064 openhab 20 0 769676 613404 15064 R 91.3 30.1 1:12.99 GC Thread#0
22082 openhab 20 0 769676 613404 15064 R 91.3 30.1 1:12.55 GC Thread#2
22081 openhab 20 0 769676 613404 15064 R 87.0 30.1 1:12.87 GC Thread#1
22329 openhab 20 0 769676 613404 15064 R 65.2 30.1 1:13.06 GC Thread#6
14386 root 20 0 9016 3976 1864 S 26.1 0.2 1:20.72 htop
27774 root 20 0 10192 2728 2120 R 21.7 0.1 0:00.13 top
3302 root 20 0 12720 6256 4592 S 4.3 0.3 0:06.75 sshd
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
It almost feels like the Odroid XU4 is not enough horsepower to run openHAB, especially when loading JRuby scripts.
Looking at the last top
output, it seems that GarbageCollector is doing the damage, and I can confirm those are the top users every time the system is running away, and the strace
output seems to be the same I checked with htop
Hi,
I wanted to circle back to this one.
Turn out that @Andrew_Rowe wasn’t to far off. It turned out to be a Java heapspace limit in the end. Basically I was running the Java process out of memory, triggering the garbage collection to go wild.
It became clear when I added some more devices, after which openHAB basically came to a halt and started spewing out-of-memory errors. I increased the Xms
and Xmx
values, after which everything began running smoothly.
I’ve done quite some testing with scripts in the meanwhile. Which do still do the double loading when using vim without any options, but never brought openHAB down again
Thanks for putting the time and effort into this @jimtng, I am sorry it turned out to be a memory limitation that made us go down a rabbit-hole.