Rules stop executing after a while

I believe the recommended way of writing rules is to use text based *.rules files. At least that’s how I’ve been doing it:

The UI itself and some parts of how the roles are managed in the jsondb

The underlying engine is undergoing some changes and getting some additional but it’s otherwise rock solid.

PaperUI uses the rest API itself.

I am using OH since version 1.3 and I never got more than maybe 3 days of continuous usability. I only managed to get it running by adding several watchdog mechanisms that fire a reboot if something misbehaves. This includes internal, rule-based watchdogs (check if values that should change stay the same over some longer period), a rule-based homematic trigger-ack-watchdog, uptime watchdog (yes, I reboot intentionally after 1,5 days now, since this is safer on avergae than letting in run longer), a file-based watchdog (all rule files contain a time-based continuous trigger to touch a /tmp-file with the same name of the rule file, but only if my internal uptime counter increases. A cron job verifies the updates and reboots if one file gets old). There is even more - for each time I had to manually reboot (becaus all other watchdogs did not trigger), I tried to figure out an additional watchdog for this specific scenario. I estimate that 10% of my rules and more than 50% of my OH configuration time is spent around watchdogs and error analysis.

Long story short, I can now keep my OH 2.4 running for weeks without manual reboot, but I rarely get even one day of continuous uptime. This results in frustration, because after a reboot, it may take even more reboots until the system is up and fine again. This means downtime of my home automation, and you don’t want that with a wife and kids…

I decided to try some bottom up analysis and took most rules I had offline, as well as most bindings, especially the two self-written. I updated what was left to new stable bindings. I have a quite stripped-down system (without watchdogs now), and I got my first crash today after 1135min uptime. No events anymore, “nearly” no rule execution (apparently timer work). No exception in debug logging, but I got this in journalctl ( journalctl -u openhab2 -b):

Nov 28 15:42:43 odroid karaf[1147]: Exception in thread "ESH-OSGiEventManager" java.lang.IncompatibleClassChangeError: vtable stub
Nov 28 15:42:43 odroid karaf[1147]:         at java.util.AbstractCollection.toArray(AbstractCollection.java:199)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.String.split(String.java:2378)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.String.split(String.java:2422)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.events.AbstractEventFactory.getTopicElements(AbstractEventFactory.java:108)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.getItemName(ItemEventFactory.java:140)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.createStateChangedEvent(ItemEventFactory.java:128)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.items.events.ItemEventFactory.createEventByType(ItemEventFactory.java:84)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.events.AbstractEventFactory.createEvent(AbstractEventFactory.java:51)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.createESHEvent(EventHandler.java:121)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.handleEvent(EventHandler.java:95)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.EventHandler.handleEvent(EventHandler.java:72)
Nov 28 15:42:43 odroid karaf[1147]:         at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler.lambda$0(ThreadedEventHandler.java:67)
Nov 28 15:42:43 odroid karaf[1147]:         at java.lang.Thread.run(Thread.java:748)

In all fairness, I have to say that I changed a rules and an items file hours before the crash, so might be related.

See (OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea. This is a classic symptom of long running rules or rules that get stuck because of locks.

I’ve never seen that exception before but I rarely look in journalctl.

All I can say beyond that is I typically get months of uptime on my OH instances. The only reason I need to restart it is to update it.

It is good to hear that this is theoretically possible to have that long uptimes.

I switched away from Thread::sleeps years ago. I have one left, in my reboot rule, in case the createTimer to execute reboot “after logging” fails.

I even changed the eMMC card on my odroid once, and I still have not ruled out hardware issues.
Java is Java™ SE Runtime Environment (build 1.8.0_201-b09) (is Oracle still recommended?) on Ubuntu 16.04.

32-bit Zulu has been the recommended Java on an RPi for quite some time now (64-bit version had some issue with the serial devices). Most users run with Zulu on other platforms too. For awhile OpenJDK was recommend against because there was a known problem with OH, but I don’t know if that’s still the case. I don’t know of any specific problem with Oracle Java beyond the license but I suspect the minority of oh that’s are running on it.

Will switch.

I had the following exceptions in journal in the night (second one was fatal again, no events since then):

Nov 29 03:52:18 odroid karaf[1161]: Exception in thread "EventAdminAsyncThread #12" java.lang.IllegalMonitorStateException
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:151)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Nov 29 03:52:18 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Nov 29 03:52:18 odroid karaf[1161]:         at java.lang.Thread.run(Thread.java:748)
Nov 29 08:47:20 odroid karaf[1161]: Exception in thread "EventAdminAsyncThread #11" java.lang.NullPointerException
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.enq(AbstractQueuedSynchronizer.java:590)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1261)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:457)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Nov 29 08:47:20 odroid karaf[1161]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Nov 29 08:47:20 odroid karaf[1161]:         at java.lang.Thread.run(Thread.java:748)

Yikes… this is shocking

Agreed, I’ve used OpenHAB a little over a year, starting with 2.3. With a battery backup running on a small desktop PC running Deb 9, I go months without having to ever restart. Months

OK, that’s what I’m wondering is have you ever tested on different hardware because that is just way over the top, I can’t believe you didn’t give up.

Which odroid model are you using? There’s an issue in the dual core architecture not playing nicely with Java for the xu4, causing OH to crash regularly. Problem impacts at least odroid xu4 but maybe other models as well. The workaround documented here is to disable one of the cores for Java

Either way, the issue I mention above only applies to broad OH, not just rules.

EDIT:
Found the reason :upside_down_face:. In every single rule file I mentioned, I had a rule having the same “display name”. It seems that openHAB has an issue with equal rule names if they aren’t in the same .rules file. Anyway its working just fine.

Anyone ever had the problem that only the last saved/modified *.rules file works? I have a lot of “older” rules files that are working continously. But some of my younger rules files are making problems. Not using Thread.Sleep or Locks in any of my rules. I’m using openHAB quite a few years now and I’ve never had an issue like that.

Example:

ABC.rules gets modified

  • All rules in ABC.rules will trigger (simple item changed rules)
  • All other *.rules files are working.
  • No rules in DEF.rules are triggering.

DEF.rules gets modified

  • All rules in DEF.rules will trigger (simple item changed rules)
  • All other *.rules files are working.
  • No rules in ABC.rules are triggering.

Anyone has an idea? Installation is just a few months old. So the SD-Card.