Rule engine not running although bundle active?

mstormi · January 8, 2025, 10:18am

I have a weird problem with the rules engine not running

2025-01-08 11:16:17.521 [DEBUG] [org.openhab.core.automation.internal.RuleEngineImpl    ] - Rule engine not yet started - not executing rule 'meter-3',
2025-01-08 11:16:17.523 [DEBUG] [org.openhab.core.automation.internal.RuleEngineImpl    ] - Rule engine not yet started - not executing rule 'Rechnung-3'

although the console says it does:

openhab> bundle:list|grep -i rule
166 x Active x  80 x 4.3.1                 x openHAB Core :: Bundles :: Automation Script RuleSupport
205 x Active x  80 x 4.3.1                 x openHAB Core :: Bundles :: Model Rules
206 x Active x  80 x 4.3.1                 x openHAB Core :: Bundles :: Model Rule IDE
207 x Active x  80 x 4.3.1                 x openHAB Core :: Bundles :: Model Rules Runtime

Actually NO rules are working. FWIW, I only use DSL rules.
Restarting, cache cleaning, even reinstalling oh packages did not help.
Any idea?

vanja · January 8, 2025, 11:12am

can you show the 2 problematic rules mentioned in the debug?
are you using cron?

mstormi · January 8, 2025, 12:11pm

it’s just 2 example log lines, actually NO rule is working

Will · January 8, 2025, 12:12pm

@vanja thanks the for linking.

@mstormi I run into the issue. @vanja linked my thread.

My questions to troubleshoot would be:

do you have processes (java ?) running at 100% CPU ?
can you run top -H -p $(pgrep -x java) to see what is using up time ? Does it look like the screenshot below ?

image1222×132 24.5 KB

I like the from @vanja asking if you use a cron that could be erroneous ?
Here is a magic oneline that helps:

grep -rEi 'Time cron' . | grep -vE ':\s*//'  | awk '{print $0}' | column -t

mstormi · January 8, 2025, 12:17pm

The OH install is actually fine except for the rules engine which is down.
Hell, I even purged-up-&downgraded the oh package (4.3.0/.1).
No change.
Rules are fine, too. Taken from a backup that used to work, plus they work on several other machines I have running.

vanja · January 8, 2025, 12:17pm

Can you remove all rules and see if OH starts to runtime 100 ?

Will · January 8, 2025, 12:29pm

I read you and that sounds A LOT like my path.

My rules were also running fine. Initially I though the upgrade to 4.3.1 broke it but I think the restart was the issue.

I could solve the issue by:

mkdir rules/disabled
mv *.rules/disabled
then restarting, that should run now
then bring back rules in small batches. Once OH settles down, do restart it to validate. Move to the next batch. This is how I was left with one VERY simple rule that never was an issue but blocked OH on restart.

Here is a python script that may help check cron, it may need tweaking for the OH specifics though.

file: rules/check-cron.py

import os
import re
from croniter import croniter
from datetime import datetime

# ANSI escape code for red text
RED = '\033[31m'
RESET = '\033[0m'

# Function to validate cron expressions
def validate_cron(cron_expr):
    try:
        croniter(cron_expr, datetime.now())
        return True
    except Exception:
        return False

# Function to search for "Time cron" in files and validate cron expressions
def search_and_validate_crons(root_dir='.'):
    # Regex to find lines containing 'Time cron' and extract cron expression
    cron_pattern = re.compile(r'Time cron\s+"([^"]+)"')

    # Walk through all files in the given directory
    for root, dirs, files in os.walk(root_dir):
        for file in files:
            file_path = os.path.join(root, file)

            # Read file and search for cron expressions
            with open(file_path, 'r', encoding='utf-8') as f:
                for line in f:
                    # Search for cron expressions
                    match = cron_pattern.search(line)
                    if match:
                        cron_expr = match.group(1)
                        # Validate the cron expression
                        if validate_cron(cron_expr):
                            print(f"{file_path}: OK - {cron_expr}")
                        else:
                            print(f"{file_path}: {RED}NOT OK{RESET} - {cron_expr}")

# Run the script
if __name__ == "__main__":
    search_and_validate_crons()

Usage:

pip install croniter
cd <your_rules_folder>
python check-cron.py

mstormi · January 8, 2025, 2:48pm

Yes, removing some rules makes the engine start up.
However on next or next-but-one restart, the engins starts but this time the dashboard is empty and the “pages” admin menu is missing, too.
I had seen that before but I’m not getting to the bottom of why this is happening.

mstormi · January 8, 2025, 4:42pm

It does not seem to be a specific rule that causes the problem, it’s somewhat random with which rules it appears so I think it’s not related to the rule code but merely to something like memory usage or so.

rlkoshak · January 8, 2025, 5:36pm

I’ve opened an issue to merge the three threads which are reporting this behavior. Please participate there with information gathering and testing so we can hopefully get this fixed quickly.

github.com/openhab/openhab-core

OH 4.3.0/4.3.1 is never getting past startlevel 30 for some users

opened 05:33PM - 08 Jan 25 UTC

rkoshak

bug

## Expected Behavior When OH starts the rule engine starts and begins execu…ting rules. ## Current Behavior It's been reported at least three times now on the forum: - https://community.openhab.org/t/rule-cron-blocking-openhab/161469 - https://community.openhab.org/t/rule-engine-does-not-start-after-openhab-restart-due-to-high-amount-of-rules/161440 - https://community.openhab.org/t/rule-engine-not-running-although-bundle-active/161517 The first link has the most thorough investigation. I'm not positive that it's the same root problem for all three as they differ in details. In all three cases the problem appeared when upgrading to 4.3.1, but backporting to 4.3.0 does not fix the problem. So it might be a regression from before 4.3.0 release. File based Rules DSL rules seem to be involved in all three cases. ## Possible Solution ## Steps to Reproduce (for Bugs) This is trickly because clearly it's not happening for everyone. ## Context Users who face this have no rules running. Mitigation is to go back to 4.2. ## Your Environment * Version used: (e.g., openHAB and add-on versions) * Environment name and version (e.g. Chrome 111, Java 17, Node.js 18.15, ...): * Operating System and version (desktop or mobile, Windows 11, Raspbian Bullseye, ...):

mstormi · January 9, 2025, 10:07am

thanks Rich. I’ll start investigating.
I have the problem on at least one installation, 4.3.1 or 4.3.0 doesn’t matter.
Went back to 4.2.3 there, that seems(?) to have solved the problem for now.

Then again, I have two other installations with 4.3.0 and 4.3.1 that do not show the problem.
They all run almost the same config so I wouldn’t think the difference is related to bindings with mem leaks or the OH software version.

As a first premature shot, the difference seems to be 64 bit OS & Java on the latter. cc @ndye
Java process on Raspi was huge ~4GB while on those systems where it works it was only 1,2GB.

@Will your screenshot shows a large Java, too. What hardware and OS are you on, what’s your output of java -version?

vanja · January 9, 2025, 11:01am

can you remove all cron based rules and test? Maybe it’s something with the quartz…

jimtng · January 9, 2025, 11:07am

A bug was found in CronAdjuster, a PR submitted. But as far as I know, it only happens with “incorrect” cron field combinations such as setting date = 31, month = 2. This issue seems to have been around for at least 5 years.

I’m curious to know if there are other bugs causing the issue you’re having.

mstormi · January 9, 2025, 11:14am

No cannot, I went back to 4.2.3.
But on 4.3 I had the issue without any rule that used cron, too.

mstormi · January 9, 2025, 9:40pm

Anyone to see this, check is if the rules engine is running. In my case it wasn’t although in bundle:list was displayed to be ‘Active’.
Use log:set debug org.openhab.core.automation that’ll show you when it does not run.

I’m currently inclined to believe it’s only happening if you run in 64bit.
Let me know your HW, OS version (and if 32 or 64 bit!) and the output of java -version.

Will · January 9, 2025, 10:01pm

I am not sure what you mean by “large Java” but here are the infos:

java -version
openjdk version "17.0.13" 2024-10-15
OpenJDK Runtime Environment (build 17.0.13+11-Debian-2deb12u1)
OpenJDK 64-Bit Server VM (build 17.0.13+11-Debian-2deb12u1, mixed mode, sharing)

It is a RPi4 running OpenHabian.
If by large Java you mean the high CPU, what you see on the screenshot is totally NOT normal and untypical.

Now that I fixed the issue, it looks much quieter:

Inside the Java process, everything is mostly sleeping as it should now:

mstormi · January 9, 2025, 10:29pm

I mean the VIRT column which is the virtual process size. 5.6G ! Even the in-memory fraction (RES column) is 1.5G.

I, too, have a RPi4 of course with openHABian, too, but with a 32bit OS and Java.
I’m at only 1.2G and I don’t have the problem there.
On my other box with the 64bit OS+Java and a similar config, I’m at 4+ G and I DO have the problem there.

I’m wondering if some limit was hit along the way and if that caused our issue.
Maybe the rules engine thread tried to allocate even more memory but failed because it ran out of virtual space, then it might have gotten hung in that state.

Will · January 9, 2025, 10:42pm

I still have the same value today 5610492 but my system is working fine.
I could joke and say that I have many light bulbs but I don’t think my system is particularly huge. I am probably in the range of 100-200 items. I did install many addons to test but again nothing crazy and I cleaned up recently.

Some more details in case it rings a bell:

top - 23:39:18 up 3 days,  6:20,  1 user,  load average: 0.11, 0.08, 0.08
Tasks: 140 total,   1 running, 139 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  0.5 sy,  0.0 ni, 98.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3841.8 total,     52.3 free,   2255.9 used,   1614.6 buff/cache
MiB Swap:   3072.0 total,   3071.5 free,      0.5 used.   1585.9 avail Mem

Actually now that you mention it, it would be nice to cut down that swap and probably would fix a few issues I ran into with slow i/o.

bjscheue · January 10, 2025, 1:04pm

I had reported a similar behavior about two weeks ago (Rules no longer executing after upgrade to OH 4 - #4 by ldm). Thanks, @mstormi, for pointing me here.

As I stated in that other thread, I got these problems after restoring a --full backup from OH 3 into OH 4.3. I later tried using a backup that was made without the --full option, and since then 4.3 has been running fine here, with hundreds of rules. Depending on how you got to your current installation, it might be worth to create a backup (without --full), do a clean/purged re-installation (again), restore the backup and see what that brings?

The two backups I experimented with were taken back-to-back, just a few seconds apart, from the same source installation, one with --full, the other one without. So that is definitely what made the difference in my case.

For completeness: My installation is working now (at least for the time being). It’s a relatively large setup with close to 1000 items and almost 10k lines of rules in DSL code. I do not see excessive CPU usage, but I do see the huge memory footprint mentioned in this thread. In my case it’s 8.4G of virtual process size (on a 64 bit Ubuntu x86 machine, Zulu 17 JRE).

rlkoshak · January 10, 2025, 4:48pm

I don’t think the problems you experienced after doing this is related to the problem seen here. The restore of a full backup from a different version of OH, especially a different major version of OH, is never going to work. The full backup includes the actual installed bindings files. An OH 3 binding is never going to work on an OH 4 and it is known that a “bad” binding can cause OH to not come up or fail in all sorts of strange and interesting ways.

During an upgrade, the cache (which is where the add-ons get installed to) is deletec forcing OH to install the bindings anew as part of the upgrade process. When you restore a backup, you skip that process and fill up the cache with a bunch of jar files that can’t work.

This is why the supported and recommended approach is to always restore a backup to the same version of OH the backup was taken from. Then, if desired, upgrade OH post the restore. Then OH will do a bunch of stuff including clearing the cache to make your configs work with the new version of OH.