Rules stop executing after a while

Ah, got it. Wasn’t sure what the etiquette was for this type of recurring issue. I didn’t want to needlessly create new threads for similar items. Thanks for letting me know. I’ll start a new thread.

It’s actually quite severe, at least in my case. ALL rules stop running.

Read (OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea to see if that applies…

please link your new thread

As luck would have it, I setup a watchdog rule and I’ve been trying to collect clear logs. However, the system kindly decided to behave the last few days!

Will create a new thread once the rules stop running again and I can get some better logging in place.

1 Like

I’ve been fighting with this “rules suddenly stop working” issue for several months now as well and am becoming very frustrated and disillusioned with OpenHabian.

I set up some very simple rules in Paper UI to operate switiches that turn lights ON/OFF in my home. The rules make use of Astro Channel events such as local Sun Civil Dusk and local Sun Sunrise events to run a command that turns ON/OFF Z-Wave and Wemo Wifi Switches that are connected to lights.

I use an Aeon Labs Aeotec Z-Wave Z-Stick, Gen5 Z-Wave Plus - ZW090 and a Raspberry PI 2 and OpenHabian 4.19.42-v7+ #1219 SMP Tue May 14 21:20:58 BST 2019 armv7l. I’m running OpenHabian from an SD Card.

The rules work great for anywhere from 1 up to 3 weeks. Then they just stop working. ALL OF THEM.
At this point when I go to ssh into my OpenHabian Raspberry PI, I can’t. Paper UI is also hosed at this point. I have to reboot the Raspberry PI.

After a reboot, OpenHabian Paper UI comes back on line and I’m able to ssh in again.

However at this point the rules appear to be corrupt/broken.

If I edit the rules by making a minor change to the rule description and savie it, then manually run the rule with the “play button” icon, it appears to work again. Sometimes however, the rule while functional with the “play button” doesn’t trigger based on the Channel event or fixed time and I need to delete all the rules and recreate them again. Then things work for a while again before the entire cycle repeats.

So it appears that every couple of weeks or so OpenHabian is crashing/hanging and some sort of corruption is occurring with the rules.

This is really frustrating and I’m not sure where to start to debug the issue.

Guidance and suggestion for debugging and tracing the root cause would be appreciated.

I’m not doing anything fancy. Very simple and straight forward rules. I have seen several posts here by others experiencing what appears to be similar behavior for the rules.

Thanks in advance for your help.

…Steve

Dying SD-card maybe…
What is the age of your card ?

That suggests you are using the Experimental Rules engine. I’m not sure how robust that is.

The PaperUI Rules are labeled as Experimental for a reason. They are not super robust yet. The problem where you need to make an edit and resave the Rule is a known issue and I believe it has been fixed in the snapshots if not in 2.5 M3. If not, there is an issue open for it.

But the problem you are experiencing is more fundamental than that and isn’t directly related to your Rules. Something is crashing your whole RPi if you can’t ssh into it. The source of the problem could be a failing SD card. It could also be a memory leak in OH or something else you are running. I do know that if you try to use Grafana to create static image charts on an RPi it will consume all the memory and then crash your OH.

This is going to take some investigation and I recommend you open a new thread for that.

It’s a couple of months old. I will make an image copy on a new SD and try that to see if I can run without experiencing this issue for a longer period of time, more than 2-3 weeks.

Can you clarify what is “experimental” about the Paper UI Rules. The UI itself, or the underlying Rule engine implementation. I can see my rules in the REST API after adding them via the Paper UI, so is the Rule REST API also “experimental”? Should I be deploying my rules via the REST API? If the Paper UI and REST API are both experimental… what is the recommended way of writing/deploying rules using the “Non-Experimental” Rules interfaces/implementations?

I’m happy to open a new thread and will do that. I posted here because the rule behavior seemed similar to what I am experiencing, granted less the freezing of OpenHab and the inability to SSH in.

Would be great if someone could help me methodically diagnose this issue. I’m happy to navigate the logs, share configuration details, enable trace level debugging etc., but I need some guidance on where to start.

Thanks.

…Steve

I believe the recommended way of writing rules is to use text based *.rules files. At least that’s how I’ve been doing it:

The UI itself and some parts of how the roles are managed in the jsondb

The underlying engine is undergoing some changes and getting some additional but it’s otherwise rock solid.

PaperUI uses the rest API itself.

I am using OH since version 1.3 and I never got more than maybe 3 days of continuous usability. I only managed to get it running by adding several watchdog mechanisms that fire a reboot if something misbehaves. This includes internal, rule-based watchdogs (check if values that should change stay the same over some longer period), a rule-based homematic trigger-ack-watchdog, uptime watchdog (yes, I reboot intentionally after 1,5 days now, since this is safer on avergae than letting in run longer), a file-based watchdog (all rule files contain a time-based continuous trigger to touch a /tmp-file with the same name of the rule file, but only if my internal uptime counter increases. A cron job verifies the updates and reboots if one file gets old). There is even more - for each time I had to manually reboot (becaus all other watchdogs did not trigger), I tried to figure out an additional watchdog for this specific scenario. I estimate that 10% of my rules and more than 50% of my OH configuration time is spent around watchdogs and error analysis.

Long story short, I can now keep my OH 2.4 running for weeks without manual reboot, but I rarely get even one day of continuous uptime. This results in frustration, because after a reboot, it may take even more reboots until the system is up and fine again. This means downtime of my home automation, and you don’t want that with a wife and kids…

I decided to try some bottom up analysis and took most rules I had offline, as well as most bindings, especially the two self-written. I updated what was left to new stable bindings. I have a quite stripped-down system (without watchdogs now), and I got my first crash today after 1135min uptime. No events anymore, “nearly” no rule execution (apparently timer work). No exception in debug logging, but I got this in journalctl ( journalctl -u openhab2 -b):

Nov 28 15:42:43 odroid karaf[1147]: Exception in thread "ESH-OSGiEventManager" java.lang.IncompatibleClassChangeError: vtable stub
Nov 28 15:42:43 odroid karaf[1147]:         at java.util.AbstractCollection.toArray(AbstractCollection.java:199)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.String.split(String.java:2378)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.String.split(String.java:2422)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.events.AbstractEventFactory.getTopicElements(AbstractEventFactory.java:108)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.getItemName(ItemEventFactory.java:140)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.createStateChangedEvent(ItemEventFactory.java:128)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.createEventByType(ItemEventFactory.java:84)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.events.AbstractEventFactory.createEvent(AbstractEventFactory.java:51)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.createESHEvent(EventHandler.java:121)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.handleEvent(EventHandler.java:95)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.handleEvent(EventHandler.java:72)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler.lambda$0(ThreadedEventHandler.java:67)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.Thread.run(Thread.java:748)

In all fairness, I have to say that I changed a rules and an items file hours before the crash, so might be related.

See (OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea. This is a classic symptom of long running rules or rules that get stuck because of locks.

I’ve never seen that exception before but I rarely look in journalctl.

All I can say beyond that is I typically get months of uptime on my OH instances. The only reason I need to restart it is to update it.

It is good to hear that this is theoretically possible to have that long uptimes.

I switched away from Thread::sleeps years ago. I have one left, in my reboot rule, in case the createTimer to execute reboot “after logging” fails.

I even changed the eMMC card on my odroid once, and I still have not ruled out hardware issues.
Java is Java™ SE Runtime Environment (build 1.8.0_201-b09) (is Oracle still recommended?) on Ubuntu 16.04.

32-bit Zulu has been the recommended Java on an RPi for quite some time now (64-bit version had some issue with the serial devices). Most users run with Zulu on other platforms too. For awhile OpenJDK was recommend against because there was a known problem with OH, but I don’t know if that’s still the case. I don’t know of any specific problem with Oracle Java beyond the license but I suspect the minority of oh that’s are running on it.

Will switch.

I had the following exceptions in journal in the night (second one was fatal again, no events since then):

Nov 29 03:52:18 odroid karaf[1161]: Exception in thread "EventAdminAsyncThread #12" java.lang.IllegalMonitorStateException
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:151)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Nov 29 03:52:18 odroid karaf[1161]:         at java.lang.Thread.run(Thread.java:748)
Nov 29 08:47:20 odroid karaf[1161]: Exception in thread "EventAdminAsyncThread #11" java.lang.NullPointerException
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.enq(AbstractQueuedSynchronizer.java:590)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Nov 29 08:47:20 odroid karaf[1161]:         at java.lang.Thread.run(Thread.java:748)

Yikes… this is shocking

Agreed, I’ve used OpenHAB a little over a year, starting with 2.3. With a battery backup running on a small desktop PC running Deb 9, I go months without having to ever restart. Months

OK, that’s what I’m wondering is have you ever tested on different hardware because that is just way over the top, I can’t believe you didn’t give up.

Which odroid model are you using? There’s an issue in the dual core architecture not playing nicely with Java for the xu4, causing OH to crash regularly. Problem impacts at least odroid xu4 but maybe other models as well. The workaround documented here is to disable one of the cores for Java

Either way, the issue I mention above only applies to broad OH, not just rules.

EDIT:
Found the reason :upside_down_face:. In every single rule file I mentioned, I had a rule having the same “display name”. It seems that openHAB has an issue with equal rule names if they aren’t in the same .rules file. Anyway its working just fine.

Anyone ever had the problem that only the last saved/modified *.rules file works? I have a lot of “older” rules files that are working continously. But some of my younger rules files are making problems. Not using Thread.Sleep or Locks in any of my rules. I’m using openHAB quite a few years now and I’ve never had an issue like that.

Example:

ABC.rules gets modified

  • All rules in ABC.rules will trigger (simple item changed rules)
  • All other *.rules files are working.
  • No rules in DEF.rules are triggering.

DEF.rules gets modified

  • All rules in DEF.rules will trigger (simple item changed rules)
  • All other *.rules files are working.
  • No rules in ABC.rules are triggering.

Anyone has an idea? Installation is just a few months old. So the SD-Card.