That would be plain vim
, but you might have a point there. Just touch’ing the file won’t trigger the double entries.
I’ve been using vscode and it doesn’t cause the double notify / double loading. Try vim inotify twice - Google Suche and see whether it’s related to your openhab crashing.
I can now confirm
sed -i 's/astro_t/astro_time/' test.rb
sed -i 's/astro_time/astro_t/' test.rb
No issues whatsoever. A single ScriptFileWatcher line per command.
vim -c ':%s/astro_t/astro_time/g' -c ':wq' test.rb
vim -c ':%s/astro_time/astro_t/g' -c ':wq' test.rb
Double log entries, and the system is going down.
I’ve been editing the script files using vim for quite some time now, and with these astro actions, its the first time this manifests itself.
I guess it’s good to know, and might also be related to the limited CPU power of the device openHAB is running on.
Sorry for sending you all down a rabbit-hole!
It’s hardly obvious. Well done for spotting and reporting the double rule load, few would.
Guess the underlying issue is some horror about (re) loading rules before last attempt has finished, and likely about Actions in general or Astro in particular.
It would be “nice” to get that fixed, because I’ll bet it can manifest in other nasty obscure ways. So I’d recommend if you could log a Github issue for posterity, outlining how you could reproduce this.
I doubt it’ll get looked at to be honest,but it might help someone else one day
I have encountered this double loading although not related to vim. It happens programmatically when moving a (temp) script file from a different mountpoint in a linux filesystem to conf/automation/jsr223/ruby/personal/
. I just implemented a workaround by creating that temp file in the same mountpoint before moving it. Since this was easily avoidable I didn’t bother opening an issue with openhab-core and just moved on.
Come to think of it, I have had CPUs going crazy (caused by openhab) on my development laptop too but I didn’t link the two issues together, and instead just moved development into a linux server. I’m not sure that I experienced or noticed the double loading on my laptop, so I’m not sure if they’re related.
I can confirm that this happened on my system too, except my openhab process didn’t crash nor did it cause a cpu spike. My system has lots of RAM so perhaps that helped. I am also using an openhab 3.4-snapshot from a few days ago.
Furthermore, I tried this on a RulesDSL file in conf/rules/test.rules
and yes, openhab loaded the rule twice there too.
This vim command will not trigger the double loading:
vim -c ':set nowritebackup' -c ':set nobackup' -c ':%s/astro_t1/astro_t2/g' -c ':wq' test.rules
I thought it might be nice to share were I was going with this. Maybe someone else can take inspiration from it
This is part of my bedroom automation, closing the blinds when the sun’s down and the lights are turned on.
require 'openhab'
BEDROOM_LIGHTS = Bedroom
.equipments(Semantics::Lightbulb)
.members
.points(Semantics::Control)
rule 'Turn off lights when sleep tracking has started' do
changed SleepasAndroid_Event,
from: ->f { ! ['sleep_tracking_started', 'sleep_tracking_paused'].include?(f) },
to: ->t { ['sleep_tracking_started', 'sleep_tracking_paused'].include?(t) }
delay 5.seconds
run do
# Turn the lights off
BEDROOM_LIGHTS.ensure.off
# Close the blnnds
BlindsBedroomWindow_Vane.ensure.command(0)
end
end
rule 'Close blinds when it is dark outside and lights are on' do
changed BEDROOM_LIGHTS, to: ->t { t.on? }
run do |event|
logger.info("#{event.item} turned on while dark outside, closing the blinds")
BlindsBedroomWindow_Position.ensure.down
BlindsBedroomWindow_Vane.ensure.command(0)
end
only_if { actions("astro","astro:sun:local").getElevation(nil) <= -6 }
end
Interesting use of lambda! I assume it’s because your lights can either be a switch or a dimmer, so using lambda -> t { t.on? }
covers both cases?
Would you mind sharing which blinds you’re using?
I’m also wondering why you aren’t simply using this trigger:
rule 'Turn off lights when sleep tracking has started' do
changed SleepasAndroid_Event, to: %w[sleep_tracking_started sleep_tracking_paused]
You are completely right. I have a switched light, and a ceiling dimmer in the bedroom. The on?
method works on both.
Sure! I have Luxaflex (HunterDouglas) Powerview blinds around the house, the vertical venetian type. In the bedroom they have room darkening fabric, which works really well
For one I totally forgot about the word-array operator in ruby, and the fact you can pass the to:
parameter an array. But! there is also a issue I ran into;
I’d like to trigger the rule only at the very start of sleep tracking. When SleepAsAndroid starts the session, it will pause tracking for a couple of minutes, sending the sleep_tracking_started
and sleep_tracking_paused
in very quick succession. While both end up fine on the MQTT broker, openHAB seems to have trouble keeping up, and just sometimes skips the first sleep_tracking_started
state.
Without this bug, it could be just simply to: "sleep_tracking_started"
.
If openhab skips it, I would call that a bug in openhab! In any case, even if you got only one of them, won’t it work the same anyway?
The demo effect at play. As I am typing up the bug report, I can’t reproduce it anymore.
Well, at least it makes the trigger a whole lot simpler
Circling back to this issue; I’ve been setting both options on my vim instance when editing the rules, which seems to go OK most of the time, but sometimes the process still runs away. A couple of threads eat up all CPU, load skyrockets, and openHAB slowly dies.
There is no double loading going on anymore.
The rule is also as simple as printing a log line with an item type.
Any idea how I can troubleshoot this and turn it into a valid bug report if needed?
Perhaps create a shell script file that repeatedly does this and if it managed to consistently trigger the problem, that could be submitted in the problem report?
Also try doing it with RulesDSL or some other scripting languages to try to pinpoint whether the fault is in jruby or elsewhere in the core.
here is a link to a thread in this forum about chasing down a mem leak
I suspect this isn’t a mem leak. I have seen openhab using up the cpu in my dev / test instance too. It must have entered some sort of endless loop, possibly in the file watcher area. I haven’t investigated this any further though, so this is just a guess.
If someone can create a reproducible test case, it would greatly help the core developers in getting to the bottom of it.
Ok, doing some more testing, and I can now quite reliably kill my system.
To start of with the code use, just a simple ‘log this every minute’
JRuby
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
Rule DSL
rule "Log every minute"
when
Time cron "0 * * * * ?"
then
logInfo("CronTest", "Test rule executed")
end
First thing I noticed, was that when you first create the file, by either vim (with backup options disabled), cat <<EOF
, or creating a temp file and copying it to the directory. Both JRuby and Rule DSL trigger the “double loading”.
2022-08-22 10:46:11.872 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:46:19.999 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 10:44:58.428 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
2022-08-22 10:44:59.660 [INFO ] [el.core.internal.ModelRepositoryImpl] - Loading model 'test.rules'
If we later on add somthing to the file, and I choose to do this by using tee
, we do not get the double entry, just a single one. (Example: echo '# foo | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb
)
I’ve done the creation, deletion, and adding lines with Rule DSL, and I could not get the system to run away. I’ve also noticed that loading the rule a quite a bit quicker. In case of the “double loading” you can see the difference is timestamps as well.
Trying this same rule as JRuby, it’s almost a surefire way of getting it to go out of control. Sometimes it requires two tries. In this case it does not matter of you copy/cat the file in (its maybe a bit more reliable in triggering the issue), or adding a line or two to the file.
There will be a couple of threads all with 100% CPU usage, and load is skyrocketing. I’ve tried to strace a couple of the threads to see whats going on, and they all seem to loop with the more of less the following output.
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781332861}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781464652}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781598152}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781730985}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781862609}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=781995401}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782126983}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782260650}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782392274}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782524524}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782658190}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782791773}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=782923814}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783058355}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783190230}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783326229}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783460937}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783594562}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783726895}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=783865061}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784068477}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784181018}) = 0
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=4555, tv_nsec=784277809}) = 0
futex(0xb6028978, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0xb6028960, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, FUTEX_BITSET_MATCH_ANY) = 0
Love to hear your thoughts on this @JimT , let me know if you want me to grab more extensive strace’s and if you have preferred output options.
EDIT:
I created a test.rb file with this in it:
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
Then ran this command several few times (up arrow, enter, repeatedly)
I tried this on my production system (openhab docker, on ubuntu 20) but couldn’t make my openhab to spike to 100%. It just reported that it detected change and reloaded the script. I noticed the openhab process went up to 25 - 29% briefly then down to 0-5% (its normal state)
Is that the same test you did?
I’ll be afk for a week (going camping with no laptop), I’d suggest posting this as an issue over on Issues · boc-tothefuture/openhab-jruby · GitHub so we can keep track of it better.
Yes, that usually does it.
I cobbled together a simple test script:
#!/bin/bash
check_load() { top -b -n1 -H | head -n 20; }
echo -e '>>> Starting testing\n'
# Print a baseline CPU usage
check_load
echo -e '\n\n>>> Writing testing script\n'
cat <<EOF > /etc/openhab/automation/jsr223/ruby/personal/test.rb
require 'openhab'
rule 'Log every minute' do
cron '0 * * * * ?'
run { logger.info "Test rule executed" }
end
EOF
echo -e '>>> Waiting for script to be loaded (10s), log show the following: \n'; sleep 10
tail -n 5 /var/log/openhab/openhab.log
echo -e '\n\n>>> Checking CPU usage\n'
check_load
echo -e '\n\n>>> We will now wait for the CPU to settle down (60s), if its not already in run-away\n'; sleep 60
check_load
echo -e '\n\n>>> Poke the script file again using the tee command\n'
echo '# touching script file' | tee -a /etc/openhab/automation/jsr223/ruby/personal/test.rb; sleep 2; echo
tail -n 3 /var/log/openhab/openhab.log
echo -e '\n\n>>> Waiting for a bit (30s), and lets see our CPU usage\n'; sleep 30
check_load
And this results in the following output"
>>> Starting testing
top - 18:14:49 up 7:22, 3 users, load average: 0.82, 2.12, 2.12
Threads: 462 total, 1 running, 461 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.6 us, 4.7 sy, 0.0 ni, 92.3 id, 0.0 wa, 0.4 hi, 0.0 si, 0.0 st
MiB Mem : 1990.4 total, 47.1 free, 806.3 used, 1137.1 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27711 root 20 0 10192 2728 2120 R 29.6 0.1 0:00.20 top
18326 root 20 0 0 0 0 I 3.7 0.0 0:00.09 kworker/5:1-events
25601 openhab 20 0 767912 612348 15064 S 3.7 30.0 0:00.01 pool-27-thread-
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.49 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
>>> Writing testing script
>>> Waiting for script to be loaded (10s), log show the following:
2022-08-22 18:14:29.590 [INFO ] [chine.state_machine_-_washingmachine] - Device is now Off
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
>>> Checking CPU usage
top - 18:14:59 up 7:22, 3 users, load average: 1.15, 2.15, 2.13
Threads: 463 total, 3 running, 460 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.3 us, 5.3 sy, 0.0 ni, 78.3 id, 0.0 wa, 0.5 hi, 0.5 si, 0.0 st
MiB Mem : 1990.4 total, 47.2 free, 806.1 used, 1137.1 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27717 openhab 20 0 767912 612700 15064 R 99.9 30.1 0:00.61 OH-file-process
14386 root 20 0 9016 3976 1864 S 25.0 0.2 1:16.86 htop
27720 root 20 0 10192 2808 2200 R 20.0 0.1 0:00.08 top
25267 openhab 20 0 767912 612700 15064 S 10.0 30.1 0:08.59 RXTXPortMonitor
27579 openhab 20 0 767912 612700 15064 S 5.0 30.1 0:00.50 Thread-386
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.49 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
>>> We will now wait for the CPU to settle down (60s), if its not already in run-away
top - 18:15:59 up 7:23, 3 users, load average: 3.31, 2.66, 2.31
Threads: 455 total, 1 running, 454 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.9 us, 5.0 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.5 hi, 0.0 si, 0.0 st
MiB Mem : 1990.4 total, 46.9 free, 806.3 used, 1137.2 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1137.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27750 root 20 0 10192 2772 2164 R 12.5 0.1 0:00.14 top
13373 root 20 0 0 0 0 I 4.2 0.0 0:03.86 kworker/u16:1-events_unbound
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
9 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
10 root 20 0 0 0 0 I 0.0 0.0 0:03.51 rcu_preempt
11 root rt 0 0 0 0 S 0.0 0.0 0:00.63 migration/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root rt 0 0 0 0 S 0.0 0.0 0:00.53 migration/1
>>> Poke the script file again using the tee command
# touching script file
2022-08-22 18:14:49.073 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
2022-08-22 18:14:56.935 [INFO ] [rulesupport.loader.ScriptFileWatcher] - Loading script '/etc/openhab/automation/jsr223/ruby/personal/test.rb'
>>> Waiting for a bit (30s), and lets see our CPU usage
top - 18:16:32 up 7:23, 3 users, load average: 5.39, 3.25, 2.52
Threads: 460 total, 9 running, 451 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.9 us, 7.4 sy, 0.0 ni, 3.2 id, 0.0 wa, 1.1 hi, 0.5 si, 0.0 st
MiB Mem : 1990.4 total, 45.9 free, 807.3 used, 1137.2 buff/cache
MiB Swap: 995.2 total, 995.0 free, 0.2 used. 1136.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22083 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:13.44 GC Thread#3
22084 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.86 GC Thread#4
22085 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.86 GC Thread#5
22330 openhab 20 0 769676 613404 15064 R 95.7 30.1 1:12.64 GC Thread#7
22064 openhab 20 0 769676 613404 15064 R 91.3 30.1 1:12.99 GC Thread#0
22082 openhab 20 0 769676 613404 15064 R 91.3 30.1 1:12.55 GC Thread#2
22081 openhab 20 0 769676 613404 15064 R 87.0 30.1 1:12.87 GC Thread#1
22329 openhab 20 0 769676 613404 15064 R 65.2 30.1 1:13.06 GC Thread#6
14386 root 20 0 9016 3976 1864 S 26.1 0.2 1:20.72 htop
27774 root 20 0 10192 2728 2120 R 21.7 0.1 0:00.13 top
3302 root 20 0 12720 6256 4592 S 4.3 0.3 0:06.75 sshd
1 root 20 0 31516 7804 5416 S 0.0 0.4 0:08.97 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
It almost feels like the Odroid XU4 is not enough horsepower to run openHAB, especially when loading JRuby scripts.
Looking at the last top
output, it seems that GarbageCollector is doing the damage, and I can confirm those are the top users every time the system is running away, and the strace
output seems to be the same I checked with htop
Hi,
I wanted to circle back to this one.
Turn out that @Andrew_Rowe wasn’t to far off. It turned out to be a Java heapspace limit in the end. Basically I was running the Java process out of memory, triggering the garbage collection to go wild.
It became clear when I added some more devices, after which openHAB basically came to a halt and started spewing out-of-memory errors. I increased the Xms
and Xmx
values, after which everything began running smoothly.
I’ve done quite some testing with scripts in the meanwhile. Which do still do the double loading when using vim without any options, but never brought openHAB down again
Thanks for putting the time and effort into this @JimT, I am sorry it turned out to be a memory limitation that made us go down a rabbit-hole.