Astro actions in jruby

mvandamn · August 4, 2022, 7:44pm

The demo effect at play. As I am typing up the bug report, I can’t reproduce it anymore.

Well, at least it makes the trigger a whole lot simpler

mvandamn · August 19, 2022, 7:22pm

Circling back to this issue; I’ve been setting both options on my vim instance when editing the rules, which seems to go OK most of the time, but sometimes the process still runs away. A couple of threads eat up all CPU, load skyrockets, and openHAB slowly dies.
There is no double loading going on anymore.

The rule is also as simple as printing a log line with an item type.

Any idea how I can troubleshoot this and turn it into a valid bug report if needed?

jimtng · August 19, 2022, 10:22pm

Perhaps create a shell script file that repeatedly does this and if it managed to consistently trigger the problem, that could be submitted in the problem report?

Also try doing it with RulesDSL or some other scripting languages to try to pinpoint whether the fault is in jruby or elsewhere in the core.

Andrew_Rowe · August 20, 2022, 12:05am

here is a link to a thread in this forum about chasing down a mem leak

jimtng · August 20, 2022, 1:55am

I suspect this isn’t a mem leak. I have seen openhab using up the cpu in my dev / test instance too. It must have entered some sort of endless loop, possibly in the file watcher area. I haven’t investigated this any further though, so this is just a guess.

If someone can create a reproducible test case, it would greatly help the core developers in getting to the bottom of it.

mvandamn · August 22, 2022, 11:01am

Ok, doing some more testing, and I can now quite reliably kill my system.

To start of with the code use, just a simple ‘log this every minute’

JRuby

require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end

Rule DSL

rule "Log every minute"
when
  Time cron "0 * * * * ?"
then
  logInfo("CronTest", "Test rule executed")
end

First thing I noticed, was that when you first create the file, by either vim (with backup options disabled), cat <<EOF, or creating a temp file and copying it to the directory. Both JRuby and Rule DSL trigger the “double loading”.

2022-08-22 10:46:11.872 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:46:19.999 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


2022-08-22 10:44:58.428 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
2022-08-22 10:44:59.660 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'

If we later on add somthing to the file, and I choose to do this by using tee, we do not get the double entry, just a single one. (Example: echo '# foo | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb)

I’ve done the creation, deletion, and adding lines with Rule DSL, and I could not get the system to run away. I’ve also noticed that loading the rule a quite a bit quicker. In case of the “double loading” you can see the difference is timestamps as well.

Trying this same rule as JRuby, it’s almost a surefire way of getting it to go out of control. Sometimes it requires two tries. In this case it does not matter of you copy/cat the file in (its maybe a bit more reliable in triggering the issue), or adding a line or two to the file.

There will be a couple of threads all with 100% CPU usage, and load is skyrocketing. I’ve tried to strace a couple of the threads to see whats going on, and they all seem to loop with the more of less the following output.

clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781332861}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781464652}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781598152}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781730985}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781862609}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781995401}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782126983}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782260650}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782392274}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782524524}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782658190}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782791773}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782923814}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783058355}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783190230}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783326229}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783460937}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783594562}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783726895}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783865061}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784068477}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784181018}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784277809}) = 0
futex(0xb6028978, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0

Love to hear your thoughts on this @jimtng , let me know if you want me to grab more extensive strace’s and if you have preferred output options.

jimtng · August 22, 2022, 12:14pm

EDIT:

I created a test.rb file with this in it:

require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end

Then ran this command several few times (up arrow, enter, repeatedly)

I tried this on my production system (openhab docker, on ubuntu 20) but couldn’t make my openhab to spike to 100%. It just reported that it detected change and reloaded the script. I noticed the openhab process went up to 25 - 29% briefly then down to 0-5% (its normal state)

Is that the same test you did?

I’ll be afk for a week (going camping with no laptop), I’d suggest posting this as an issue over on Issues · boc-tothefuture/openhab-jruby · GitHub so we can keep track of it better.

mvandamn · August 22, 2022, 4:24pm

Yes, that usually does it.

I cobbled together a simple test script:

#!/bin/bash

check_load() { top -b -n1 -H | head -n 20; }

echo -e '>>> Starting testing\n'

# Print a baseline CPU usage
check_load

echo -e '\n\n>>> Writing testing script\n'

cat <<EOF > /etc/openhab/automation/jsr223/ruby/personal/test.rb
require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end
EOF

echo -e '>>> Waiting for script to be loaded (10s), log show the following: \n'; sleep 10

tail -n 5 /var/log/openhab/openhab.log

echo -e '\n\n>>> Checking CPU usage\n'

check_load

echo -e '\n\n>>> We will now wait for the CPU to settle down (60s), if its not already in run-away\n'; sleep 60

check_load

echo -e '\n\n>>> Poke the script file again using the tee command\n'

echo '# touching script file' | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb; sleep 2; echo

tail -n 3 /var/log/openhab/openhab.log

echo -e '\n\n>>> Waiting for a bit (30s), and lets see our CPU usage\n'; sleep 30

check_load

And this results in the following output"

>>> Starting testing

top - 18:14:49 up  7:22,  3 users,  load average: 0.82, 2.12, 2.12
Threads: 462 total,   1 running, 461 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.6 us,  4.7 sy,  0.0 ni, 92.3 id,  0.0 wa,  0.4 hi,  0.0 si,  0.0 st
MiB Mem :   1990.4 total,     47.1 free,    806.3 used,   1137.1 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27711 root      20   0   10192   2728   2120 R  29.6   0.1   0:00.20 top
18326 root      20   0       0      0      0 I   3.7   0.0   0:00.09 kworker/5:1-events
25601 openhab   20   0  767912 612348  15064 S   3.7  30.0   0:00.01 pool-27-thread-
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.49 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.62 migration/0
   12 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1


>>> Writing testing script

>>> Waiting for script to be loaded (10s), log show the following:

2022-08-22 18:14:29.590 [INFO ] [chine.state_machine_-_washingmachine] - Device is now Off
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


>>> Checking CPU usage

top - 18:14:59 up  7:22,  3 users,  load average: 1.15, 2.15, 2.13
Threads: 463 total,   3 running, 460 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.3 us,  5.3 sy,  0.0 ni, 78.3 id,  0.0 wa,  0.5 hi,  0.5 si,  0.0 st
MiB Mem :   1990.4 total,     47.2 free,    806.1 used,   1137.1 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27717 openhab   20   0  767912 612700  15064 R  99.9  30.1   0:00.61 OH-file-process
14386 root      20   0    9016   3976   1864 S  25.0   0.2   1:16.86 htop
27720 root      20   0   10192   2808   2200 R  20.0   0.1   0:00.08 top
25267 openhab   20   0  767912 612700  15064 S  10.0  30.1   0:08.59 RXTXPortMonitor
27579 openhab   20   0  767912 612700  15064 S   5.0  30.1   0:00.50 Thread-386
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.49 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.62 migration/0


>>> We will now wait for the CPU to settle down (60s), if its not already in run-away

top - 18:15:59 up  7:23,  3 users,  load average: 3.31, 2.66, 2.31
Threads: 455 total,   1 running, 454 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  5.0 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
MiB Mem :   1990.4 total,     46.9 free,    806.3 used,   1137.2 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27750 root      20   0   10192   2772   2164 R  12.5   0.1   0:00.14 top
13373 root      20   0       0      0      0 I   4.2   0.0   0:03.86 kworker/u16:1-events_unbound
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.51 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.63 migration/0
   12 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0   0.0   0:00.53 migration/1


>>> Poke the script file again using the tee command

# touching script file

2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


>>> Waiting for a bit (30s), and lets see our CPU usage

top - 18:16:32 up  7:23,  3 users,  load average: 5.39, 3.25, 2.52
Threads: 460 total,   9 running, 451 sleeping,   0 stopped,   0 zombie
%Cpu(s): 87.9 us,  7.4 sy,  0.0 ni,  3.2 id,  0.0 wa,  1.1 hi,  0.5 si,  0.0 st
MiB Mem :   1990.4 total,     45.9 free,    807.3 used,   1137.2 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1136.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
22083 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:13.44 GC Thread#3
22084 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.86 GC Thread#4
22085 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.86 GC Thread#5
22330 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.64 GC Thread#7
22064 openhab   20   0  769676 613404  15064 R  91.3  30.1   1:12.99 GC Thread#0
22082 openhab   20   0  769676 613404  15064 R  91.3  30.1   1:12.55 GC Thread#2
22081 openhab   20   0  769676 613404  15064 R  87.0  30.1   1:12.87 GC Thread#1
22329 openhab   20   0  769676 613404  15064 R  65.2  30.1   1:13.06 GC Thread#6
14386 root      20   0    9016   3976   1864 S  26.1   0.2   1:20.72 htop
27774 root      20   0   10192   2728   2120 R  21.7   0.1   0:00.13 top
 3302 root      20   0   12720   6256   4592 S   4.3   0.3   0:06.75 sshd
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd

It almost feels like the Odroid XU4 is not enough horsepower to run openHAB, especially when loading JRuby scripts.

Looking at the last top output, it seems that GarbageCollector is doing the damage, and I can confirm those are the top users every time the system is running away, and the strace output seems to be the same I checked with htop

mvandamn · October 26, 2022, 7:18am

Hi,

I wanted to circle back to this one.

Turn out that @Andrew_Rowe wasn’t to far off. It turned out to be a Java heapspace limit in the end. Basically I was running the Java process out of memory, triggering the garbage collection to go wild.

It became clear when I added some more devices, after which openHAB basically came to a halt and started spewing out-of-memory errors. I increased the Xms and Xmx values, after which everything began running smoothly.
I’ve done quite some testing with scripts in the meanwhile. Which do still do the double loading when using vim without any options, but never brought openHAB down again

Thanks for putting the time and effort into this @jimtng, I am sorry it turned out to be a memory limitation that made us go down a rabbit-hole.