Astro actions in jruby

That would be plain vim, but you might have a point there. Just touch’ing the file won’t trigger the double entries.

I’ve been using vscode and it doesn’t cause the double notify / double loading. Try vim inotify twice - Google Suche and see whether it’s related to your openhab crashing.

I can now confirm :slight_smile:

sed -i 's/astro_t/astro_time/' test.rb
sed -i 's/astro_time/astro_t/' test.rb

No issues whatsoever. A single ScriptFileWatcher line per command.

vim -c ':%s/astro_t/astro_time/g' -c ':wq' test.rb
vim -c ':%s/astro_time/astro_t/g' -c ':wq' test.rb

Double log entries, and the system is going down.

I’ve been editing the script files using vim for quite some time now, and with these astro actions, its the first time this manifests itself.
I guess it’s good to know, and might also be related to the limited CPU power of the device openHAB is running on.

Sorry for sending you all down a rabbit-hole!

1 Like

It’s hardly obvious. Well done for spotting and reporting the double rule load, few would.

Guess the underlying issue is some horror about (re) loading rules before last attempt has finished, and likely about Actions in general or Astro in particular.

It would be “nice” to get that fixed, because I’ll bet it can manifest in other nasty obscure ways. So I’d recommend if you could log a Github issue for posterity, outlining how you could reproduce this.
I doubt it’ll get looked at to be honest,but it might help someone else one day

I have encountered this double loading although not related to vim. It happens programmatically when moving a (temp) script file from a different mountpoint in a linux filesystem to conf/automation/jsr223/ruby/personal/. I just implemented a workaround by creating that temp file in the same mountpoint before moving it. Since this was easily avoidable I didn’t bother opening an issue with openhab-core and just moved on.

Come to think of it, I have had CPUs going crazy (caused by openhab) on my development laptop too but I didn’t link the two issues together, and instead just moved development into a linux server. I’m not sure that I experienced or noticed the double loading on my laptop, so I’m not sure if they’re related.

I can confirm that this happened on my system too, except my openhab process didn’t crash nor did it cause a cpu spike. My system has lots of RAM so perhaps that helped. I am also using an openhab 3.4-snapshot from a few days ago.

Furthermore, I tried this on a RulesDSL file in conf/rules/test.rules and yes, openhab loaded the rule twice there too.

This vim command will not trigger the double loading:

vim -c ':set nowritebackup' -c ':set nobackup' -c ':%s/astro_t1/astro_t2/g' -c ':wq' test.rules 

I thought it might be nice to share were I was going with this. Maybe someone else can take inspiration from it :slight_smile:

This is part of my bedroom automation, closing the blinds when the sun’s down and the lights are turned on.

require 'openhab'

BEDROOM_LIGHTS = Bedroom
  .equipments(Semantics::Lightbulb)
  .members
  .points(Semantics::Control)

rule 'Turn off lights when sleep tracking has started' do
  changed SleepasAndroid_Event,
    from: ->f { ! ['sleep_tracking_started', 'sleep_tracking_paused'].include?(f) },
    to:   ->t {   ['sleep_tracking_started', 'sleep_tracking_paused'].include?(t) }

  delay 5.seconds

  run do
    # Turn the lights off
    BEDROOM_LIGHTS.ensure.off

    # Close the blnnds
    BlindsBedroomWindow_Vane.ensure.command(0)
  end
end

rule 'Close blinds when it is dark outside and lights are on' do
  changed BEDROOM_LIGHTS, to: ->t { t.on? }

  run do |event|
    logger.info("#{event.item} turned on while dark outside, closing the blinds")
    BlindsBedroomWindow_Position.ensure.down
    BlindsBedroomWindow_Vane.ensure.command(0)
  end

  only_if { actions("astro","astro:sun:local").getElevation(nil) <= -6 }
end

Interesting use of lambda! I assume it’s because your lights can either be a switch or a dimmer, so using lambda -> t { t.on? } covers both cases?

Would you mind sharing which blinds you’re using?

I’m also wondering why you aren’t simply using this trigger:

rule 'Turn off lights when sleep tracking has started' do
  changed SleepasAndroid_Event, to: %w[sleep_tracking_started sleep_tracking_paused]

You are completely right. I have a switched light, and a ceiling dimmer in the bedroom. The on? method works on both.

Sure! I have Luxaflex (HunterDouglas) Powerview blinds around the house, the vertical venetian type. In the bedroom they have room darkening fabric, which works really well :slight_smile:

For one I totally forgot about the word-array operator in ruby, and the fact you can pass the to: parameter an array. But! there is also a issue I ran into;
I’d like to trigger the rule only at the very start of sleep tracking. When SleepAsAndroid starts the session, it will pause tracking for a couple of minutes, sending the sleep_tracking_started and sleep_tracking_paused in very quick succession. While both end up fine on the MQTT broker, openHAB seems to have trouble keeping up, and just sometimes skips the first sleep_tracking_started state.

Without this bug, it could be just simply to: "sleep_tracking_started".

If openhab skips it, I would call that a bug in openhab! In any case, even if you got only one of them, won’t it work the same anyway?

The demo effect at play. As I am typing up the bug report, I can’t reproduce it anymore.

Well, at least it makes the trigger a whole lot simpler :smiley:

1 Like

Circling back to this issue; I’ve been setting both options on my vim instance when editing the rules, which seems to go OK most of the time, but sometimes the process still runs away. A couple of threads eat up all CPU, load skyrockets, and openHAB slowly dies.
There is no double loading going on anymore.

The rule is also as simple as printing a log line with an item type.

Any idea how I can troubleshoot this and turn it into a valid bug report if needed?

Perhaps create a shell script file that repeatedly does this and if it managed to consistently trigger the problem, that could be submitted in the problem report?

Also try doing it with RulesDSL or some other scripting languages to try to pinpoint whether the fault is in jruby or elsewhere in the core.

here is a link to a thread in this forum about chasing down a mem leak

I suspect this isn’t a mem leak. I have seen openhab using up the cpu in my dev / test instance too. It must have entered some sort of endless loop, possibly in the file watcher area. I haven’t investigated this any further though, so this is just a guess.

If someone can create a reproducible test case, it would greatly help the core developers in getting to the bottom of it.

Ok, doing some more testing, and I can now quite reliably kill my system.

To start of with the code use, just a simple ‘log this every minute’

JRuby

require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end

Rule DSL

rule "Log every minute"
when
  Time cron "0 * * * * ?"
then
  logInfo("CronTest", "Test rule executed")
end

First thing I noticed, was that when you first create the file, by either vim (with backup options disabled), cat <<EOF, or creating a temp file and copying it to the directory. Both JRuby and Rule DSL trigger the “double loading”.

2022-08-22 10:46:11.872 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:46:19.999 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


2022-08-22 10:44:58.428 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
2022-08-22 10:44:59.660 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'

If we later on add somthing to the file, and I choose to do this by using tee, we do not get the double entry, just a single one. (Example: echo '# foo | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb)

I’ve done the creation, deletion, and adding lines with Rule DSL, and I could not get the system to run away. I’ve also noticed that loading the rule a quite a bit quicker. In case of the “double loading” you can see the difference is timestamps as well.

Trying this same rule as JRuby, it’s almost a surefire way of getting it to go out of control. Sometimes it requires two tries. In this case it does not matter of you copy/cat the file in (its maybe a bit more reliable in triggering the issue), or adding a line or two to the file.

There will be a couple of threads all with 100% CPU usage, and load is skyrocketing. I’ve tried to strace a couple of the threads to see whats going on, and they all seem to loop with the more of less the following output.

clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781332861}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781464652}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781598152}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781730985}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781862609}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781995401}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782126983}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782260650}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782392274}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782524524}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782658190}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782791773}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782923814}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783058355}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783190230}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783326229}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783460937}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783594562}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783726895}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783865061}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784068477}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784181018}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784277809}) = 0
futex(0xb6028978, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0

Love to hear your thoughts on this @JimT , let me know if you want me to grab more extensive strace’s and if you have preferred output options.

1 Like

EDIT:

I created a test.rb file with this in it:

require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end

Then ran this command several few times (up arrow, enter, repeatedly)

I tried this on my production system (openhab docker, on ubuntu 20) but couldn’t make my openhab to spike to 100%. It just reported that it detected change and reloaded the script. I noticed the openhab process went up to 25 - 29% briefly then down to 0-5% (its normal state)

Is that the same test you did?

I’ll be afk for a week (going camping with no laptop), I’d suggest posting this as an issue over on Issues · boc-tothefuture/openhab-jruby · GitHub so we can keep track of it better.

Yes, that usually does it.

I cobbled together a simple test script:

#!/bin/bash

check_load() { top -b -n1 -H | head -n 20; }

echo -e '>>> Starting testing\n'

# Print a baseline CPU usage
check_load

echo -e '\n\n>>> Writing testing script\n'

cat <<EOF > /etc/openhab/automation/jsr223/ruby/personal/test.rb
require 'openhab'

rule 'Log every minute' do
  cron '0 * * * * ?'
  run { logger.info "Test rule executed" }
end
EOF

echo -e '>>> Waiting for script to be loaded (10s), log show the following: \n'; sleep 10

tail -n 5 /var/log/openhab/openhab.log

echo -e '\n\n>>> Checking CPU usage\n'

check_load

echo -e '\n\n>>> We will now wait for the CPU to settle down (60s), if its not already in run-away\n'; sleep 60

check_load

echo -e '\n\n>>> Poke the script file again using the tee command\n'

echo '# touching script file' | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb; sleep 2; echo

tail -n 3 /var/log/openhab/openhab.log

echo -e '\n\n>>> Waiting for a bit (30s), and lets see our CPU usage\n'; sleep 30

check_load

And this results in the following output"

>>> Starting testing

top - 18:14:49 up  7:22,  3 users,  load average: 0.82, 2.12, 2.12
Threads: 462 total,   1 running, 461 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.6 us,  4.7 sy,  0.0 ni, 92.3 id,  0.0 wa,  0.4 hi,  0.0 si,  0.0 st
MiB Mem :   1990.4 total,     47.1 free,    806.3 used,   1137.1 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27711 root      20   0   10192   2728   2120 R  29.6   0.1   0:00.20 top
18326 root      20   0       0      0      0 I   3.7   0.0   0:00.09 kworker/5:1-events
25601 openhab   20   0  767912 612348  15064 S   3.7  30.0   0:00.01 pool-27-thread-
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.49 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.62 migration/0
   12 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1


>>> Writing testing script

>>> Waiting for script to be loaded (10s), log show the following:

2022-08-22 18:14:29.590 [INFO ] [chine.state_machine_-_washingmachine] - Device is now Off
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


>>> Checking CPU usage

top - 18:14:59 up  7:22,  3 users,  load average: 1.15, 2.15, 2.13
Threads: 463 total,   3 running, 460 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.3 us,  5.3 sy,  0.0 ni, 78.3 id,  0.0 wa,  0.5 hi,  0.5 si,  0.0 st
MiB Mem :   1990.4 total,     47.2 free,    806.1 used,   1137.1 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27717 openhab   20   0  767912 612700  15064 R  99.9  30.1   0:00.61 OH-file-process
14386 root      20   0    9016   3976   1864 S  25.0   0.2   1:16.86 htop
27720 root      20   0   10192   2808   2200 R  20.0   0.1   0:00.08 top
25267 openhab   20   0  767912 612700  15064 S  10.0  30.1   0:08.59 RXTXPortMonitor
27579 openhab   20   0  767912 612700  15064 S   5.0  30.1   0:00.50 Thread-386
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.49 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.62 migration/0


>>> We will now wait for the CPU to settle down (60s), if its not already in run-away

top - 18:15:59 up  7:23,  3 users,  load average: 3.31, 2.66, 2.31
Threads: 455 total,   1 running, 454 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  5.0 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
MiB Mem :   1990.4 total,     46.9 free,    806.3 used,   1137.2 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1137.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
27750 root      20   0   10192   2772   2164 R  12.5   0.1   0:00.14 top
13373 root      20   0       0      0      0 I   4.2   0.0   0:03.86 kworker/u16:1-events_unbound
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0   0.0   0:03.51 rcu_preempt
   11 root      rt   0       0      0      0 S   0.0   0.0   0:00.63 migration/0
   12 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0   0.0   0:00.53 migration/1


>>> Poke the script file again using the tee command

# touching script file

2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'


>>> Waiting for a bit (30s), and lets see our CPU usage

top - 18:16:32 up  7:23,  3 users,  load average: 5.39, 3.25, 2.52
Threads: 460 total,   9 running, 451 sleeping,   0 stopped,   0 zombie
%Cpu(s): 87.9 us,  7.4 sy,  0.0 ni,  3.2 id,  0.0 wa,  1.1 hi,  0.5 si,  0.0 st
MiB Mem :   1990.4 total,     45.9 free,    807.3 used,   1137.2 buff/cache
MiB Swap:    995.2 total,    995.0 free,      0.2 used.   1136.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
22083 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:13.44 GC Thread#3
22084 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.86 GC Thread#4
22085 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.86 GC Thread#5
22330 openhab   20   0  769676 613404  15064 R  95.7  30.1   1:12.64 GC Thread#7
22064 openhab   20   0  769676 613404  15064 R  91.3  30.1   1:12.99 GC Thread#0
22082 openhab   20   0  769676 613404  15064 R  91.3  30.1   1:12.55 GC Thread#2
22081 openhab   20   0  769676 613404  15064 R  87.0  30.1   1:12.87 GC Thread#1
22329 openhab   20   0  769676 613404  15064 R  65.2  30.1   1:13.06 GC Thread#6
14386 root      20   0    9016   3976   1864 S  26.1   0.2   1:20.72 htop
27774 root      20   0   10192   2728   2120 R  21.7   0.1   0:00.13 top
 3302 root      20   0   12720   6256   4592 S   4.3   0.3   0:06.75 sshd
    1 root      20   0   31516   7804   5416 S   0.0   0.4   0:08.97 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd

It almost feels like the Odroid XU4 is not enough horsepower to run openHAB, especially when loading JRuby scripts.

Looking at the last top output, it seems that GarbageCollector is doing the damage, and I can confirm those are the top users every time the system is running away, and the strace output seems to be the same I checked with htop

Hi,

I wanted to circle back to this one.

Turn out that @Andrew_Rowe wasn’t to far off. It turned out to be a Java heapspace limit in the end. Basically I was running the Java process out of memory, triggering the garbage collection to go wild.

It became clear when I added some more devices, after which openHAB basically came to a halt and started spewing out-of-memory errors. I increased the Xms and Xmx values, after which everything began running smoothly.
I’ve done quite some testing with scripts in the meanwhile. Which do still do the double loading when using vim without any options, but never brought openHAB down again :slight_smile:

Thanks for putting the time and effort into this @JimT, I am sorry it turned out to be a memory limitation that made us go down a rabbit-hole.

3 Likes