[SOLVED] Most rules stop working after a few days/week/ on Openhab 2.4

My rules randomly stop everyday/week and I don’t know if this is just a 2.4 issue or if a small change I made at some point in the last month is the cause. It’s kind of a paint to figure out what might be wrong.

  • Platform information:
    • Hardware: VM uppted to 8gb/2cpu/500Gb - upgraded ram and cpu, and kill X, giving me like 6Gb free, so there should be more than enough resource.
    • OS: Openhabian. Openhab 2.4.01 (Only after upgrade)
    • Java Runtime Environment: java version “1.8.0_191”
      Java™ SE Runtime Environment (build 1.8.0_191-b12)
      Java HotSpot™ 64-Bit Server VM (build 25.191-b12, mixed mode)
    • openHAB version: 2.4.01
  • Issue of the topic: Lots of runes stop running - daily occurrence.
  • Please post configurations (if applicable): Very complicated setup, lots of rules and external scripts using every method imaginable , probably in your nightmares. But it’s all been rock solid or “worked” exactly as programmed for years, until this last update. I simply don’t have time to go through thousands of LSP warning
    in dozens of rules. I need a way to narrow down what the issue is.
    • It effects items that should be changed in rules.
    • It seems like most of the rules just stop working, one timmer keeps on running, but it seems strange how everything has just stopped working.
  • There isn’t anything obvious in the logs, just the fact the rules stop running.

My main issue is I have no clue what the problem is, I can’t try a fix or disable my light rules,. while waiting a day to a week to see if it’s fixed it. So I really need to know how to properly debug and find out why multiples rules just stop running.

Here are my threads and top. I uninstall red node after and killed x so there shouldn’t be any problem there.

"CM Configuration Updater" Id=25 in WAITING on lock=java.util.LinkedList@46fef272
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:79)
    at java.lang.Thread.run(Thread.java:748)

"CM Event Dispatcher" Id=26 in WAITING on lock=java.util.LinkedList@141914bc
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:79)
    at java.lang.Thread.run(Thread.java:748)

"fileinstall-/usr/share/openhab2/addons" Id=39 in TIMED_WAITING on lock=org.apache.felix.fileinstall.internal.DirectoryWatcher@262fcfd3
    at java.lang.Object.wait(Native Method)
    at org.apache.felix.fileinstall.internal.DirectoryWatcher.run(DirectoryWatcher.java:320)

"pool-1-thread-1" Id=43 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@74deba23
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
    at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
    at sun.nio.fs.AbstractWatchService.poll(AbstractWatchService.java:108)
    at org.apache.karaf.jaas.modules.properties.AutoEncryptionSupport.run(AutoEncryptionSupport.java:77)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"HTTP Refresh Service" Id=267 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.openhab.core.service.AbstractActiveService$RefreshThread.pause(AbstractActiveService.java:186)
    at org.openhab.core.service.AbstractActiveService$RefreshThread.run(AbstractActiveService.java:170)

"MQTT Rec: dellDebra" Id=269 in RUNNABLE (running in native)
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:224)
    at java.io.DataInputStream.readByte(DataInputStream.java:265)
    at org.eclipse.paho.client.mqttv3.internal.wire.MqttInputStream.readMqttWireMessage(MqttInputStream.java:65)
    at org.eclipse.paho.client.mqttv3.internal.CommsReceiver.run(CommsReceiver.java:107)
    at java.lang.Thread.run(Thread.java:748)

"MQTT Snd: dellDebra" Id=270 in WAITING on lock=java.lang.Object@71c96327
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.eclipse.paho.client.mqttv3.internal.ClientState.get(ClientState.java:743)
    at org.eclipse.paho.client.mqttv3.internal.CommsSender.run(CommsSender.java:100)
    at java.lang.Thread.run(Thread.java:748)

"MQTT Call: dellDebra" Id=271 in WAITING on lock=java.lang.Object@204a7b3a
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.eclipse.paho.client.mqttv3.internal.CommsCallback.run(CommsCallback.java:150)
    at java.lang.Thread.run(Thread.java:748)

"openHAB-job-scheduler_QuartzSchedulerThread" Id=112 in TIMED_WAITING on lock=java.lang.Object@ac2edd9
    at java.lang.Object.wait(Native Method)
    at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:311)

"main" Id=1 in WAITING on lock=java.util.concurrent.atomic.AtomicReference@51b7329
    at java.lang.Object.wait(Native Method)
    at org.eclipse.osgi.container.SystemModule.waitForStop(SystemModule.java:168)
    at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle.waitForStop(EquinoxBundle.java:250)
    at org.eclipse.osgi.launch.Equinox.waitForStop(Equinox.java:181)
    at org.apache.karaf.main.Main.awaitShutdown(Main.java:641)
    at org.apache.karaf.main.Main.main(Main.java:188)

"Framework Event Dispatcher: org.eclipse.osgi.internal.framework.EquinoxEventPublisher@1b45c0e" Id=12 in WAITING on lock=org.eclipse.osgi.framework.eventmgr.EventManager$EventThread@2cb1d36b
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.getNextEvent(EventManager.java:400)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:336)

"Start Level: Equinox Container: dd64a1de-4327-4887-8db7-a727c3527481" Id=13 in WAITING on lock=org.eclipse.osgi.framework.eventmgr.EventManager$EventThread@345f5119
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.getNextEvent(EventManager.java:400)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:336)

"Karaf Lock Monitor Thread" Id=14 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.karaf.main.Main.doMonitor(Main.java:404)
    at org.apache.karaf.main.Main.access$100(Main.java:74)
    at org.apache.karaf.main.Main$3.run(Main.java:377)

"Karaf Shutdown Socket Thread" Id=15 in RUNNABLE (running in native)
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at org.apache.karaf.main.ShutdownSocketThread.run(ShutdownSocketThread.java:65)

"fileinstall-/var/lib/openhab2/etc" Id=29 in TIMED_WAITING on lock=org.apache.felix.fileinstall.internal.DirectoryWatcher@4c101bba
    at java.lang.Object.wait(Native Method)
    at org.apache.felix.fileinstall.internal.DirectoryWatcher.run(DirectoryWatcher.java:320)

"Bundle File Closer" Id=37 in WAITING on lock=org.eclipse.osgi.framework.eventmgr.EventManager$EventThread@3f3db11c
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.getNextEvent(EventManager.java:400)
    at org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.java:336)

"SCR Component Actor" Id=41 in WAITING on lock=java.util.LinkedList@48e746d0
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.felix.scr.impl.ComponentActorThread.run(ComponentActorThread.java:83)
    at java.lang.Thread.run(Thread.java:748)

"pool-3-thread-1" Id=47 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@38c429fb
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
    at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
    at sun.nio.fs.AbstractWatchService.poll(AbstractWatchService.java:108)
    at org.apache.karaf.jaas.modules.properties.AutoEncryptionSupport.run(AutoEncryptionSupport.java:77)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"Thread-23" Id=60 in RUNNABLE (running in native)
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
      - locked java.net.PlainDatagramSocketImpl@6b805f5f
    at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
      - locked java.net.PlainDatagramSocketImpl@6b805f5f
    at java.net.DatagramSocket.receive(DatagramSocket.java:812)
      - locked java.net.DatagramPacket@5ac2c292
      - locked java.net.DatagramSocket@74790f9f
    at org.openhab.binding.milight.internal.protocol.MilightDiscover.run(MilightDiscover.java:205)

"Dir Watcher" Id=62 in WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@69699c5d
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
    at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
    at sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118)
    at org.eclipse.smarthome.core.service.WatchQueueReader.run(WatchQueueReader.java:210)
    at java.lang.Thread.run(Thread.java:748)

"qtp1561008917-72" Id=72 in RUNNABLE (running in native)
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
      - locked sun.nio.ch.Util$3@19562341
      - locked java.util.Collections$UnmodifiableSet@1943750a
      - locked sun.nio.ch.EPollSelectorImpl@5d6d93de
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
    at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:396)
    at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:333)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
    at org.eclipse.jetty.io.ManagedSelector$$Lambda$158/1203234312.run(Unknown Source)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
    at java.lang.Thread.run(Thread.java:748)

"qtp1561008917-73-acceptor-0@5460aa4d-0.0.0.0:8443@44c3df5a{SSL,[ssl, http/1.1]}{0.0.0.0:8443}" Id=73 in RUNNABLE (running in native)
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
      - locked java.lang.Object@4397e2
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:369)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:639)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
    at java.lang.Thread.run(Thread.java:748)

"qtp1561008917-77-acceptor-0@5340419e-default@2097ec48{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}" Id=77 in RUNNABLE (running in native)
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
      - locked java.lang.Object@7c8cb003
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:369)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:639)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
    at java.lang.Thread.run(Thread.java:748)

"Thread-30" Id=80 in RUNNABLE (running in native)
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
      - locked java.net.PlainDatagramSocketImpl@35097566
    at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
      - locked java.net.PlainDatagramSocketImpl@35097566
    at java.net.DatagramSocket.receive(DatagramSocket.java:812)
      - locked java.net.DatagramPacket@3d6ae0a8
      - locked java.net.DatagramSocket@6561b3e8
    at org.openhab.binding.milight.internal.protocol.MilightDiscover.run(MilightDiscover.java:205)

"Thread-33" Id=88 in RUNNABLE (running in native)
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
      - locked java.net.PlainDatagramSocketImpl@1c72b67a
    at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
      - locked java.net.PlainDatagramSocketImpl@1c72b67a
    at java.net.DatagramSocket.receive(DatagramSocket.java:812)
      - locked java.net.DatagramPacket@7fd62c2
      - locked java.net.DatagramSocket@6381956e
    at org.openhab.binding.milight.internal.protocol.MilightDiscover.run(MilightDiscover.java:205)

"ESH-OSGiEventManager" Id=92 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1432fae
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
    at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler.lambda$0(ThreadedEventHandler.java:61)
    at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler$$Lambda$211/1621258594.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:748)

......
"Karaf ssh console user openhab" Id=592287 in RUNNABLE
    at sun.misc.Unsafe.unpark(Native Method)
    at java.util.concurrent.locks.LockSupport.unpark(LockSupport.java:141)
    at java.util.concurrent.SynchronousQueue$TransferStack$SNode.tryMatch(SynchronousQueue.java:265)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:383)
    at java.util.concurrent.SynchronousQueue.offer(SynchronousQueue.java:913)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
    at org.apache.felix.gogo.runtime.CommandSessionImpl$JobImpl.start(CommandSessionImpl.java:795)
      - locked org.apache.felix.gogo.runtime.CommandSessionImpl$JobImpl@13862838
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:300)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:169)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:149)
    at org.apache.felix.gogo.runtime.CommandSessionImpl.execute(CommandSessionImpl.java:179)
    at org.apache.karaf.shell.impl.console.ConsoleSessionImpl.doExecute(ConsoleSessionImpl.java:462)
    at org.apache.karaf.shell.impl.console.ConsoleSessionImpl.run(ConsoleSessionImpl.java:402)
    at java.lang.Thread.run(Thread.java:748)

openhab+ 14764  0.0  0.0  38628  3608 pts/1    R+   22:59   0:00 ps aux --sort -rss
root      8670  0.0  0.0 375684  3600 ?        Ssl   2018   0:03 /usr/lib/udisks2/udisksd --no-debug
root      1471  0.0  0.0 320180  3568 ?        S     2018   0:08 /usr/sbin/smbd
lightdm  28461  0.0  0.0  44992  3416 ?        Ss    2018   0:00 /usr/bin/dbus-daemon --session --address=systemd: --nofo
lightdm  28463  0.0  0.0  44992  3404 ?        S     2018   0:00 /usr/bin/dbus-daemon --config-file=/usr/share/defaults/a
homegear  1461  0.2  0.0 1465924 3400 ?        SLsl  2018  96:38 /usr/bin/homegear -u homegear -g homegear -p /var/run/ho
openhab+  8681  0.0  0.0 368752  3248 ?        Sl    2018   0:00 /usr/lib/gvfs/gvfsd-trash --spawner :1.17 /org/gtk/gvfs/
openhab+  8578  0.0  0.0 390812  3108 ?        Ssl   2018   0:23 xfsettingsd --display :0.0 --sm-client-id 2e348dda8-1236
message+   637  0.0  0.0  45872  2932 ?        Ss    2018   0:32 /usr/bin/dbus-daemon --system --address=systemd: --nofor
openhab+  8713  0.0  0.0  21352  2560 pts/2    Ss+   2018   0:00 /bin/bash
root      1434  0.0  0.0 320196  2552 ?        Ss    2018   0:03 /usr/sbin/smbd
root       678  0.0  0.0 119712  2516 ?        Ssl   2018   0:30 /usr/sbin/findserver -mqtt localhost:1883 -mqttadmin fin
root      1436  0.0  0.0 310020  2488 ?        S     2018   0:02 /usr/sbin/smbd
openhab+  8574  0.0  0.0 189272  2376 ?        S     2018   0:00 Thunar --sm-client-id 29c48a3a9-2aab-4b63-a6a3-6c92f93a3
root       635  0.0  0.0 250116  2344 ?        Ssl   2018   0:17 /usr/sbin/rsyslogd -n
root      8585  0.0  0.0 310344  2300 ?        Ssl   2018   0:00 /usr/lib/upower/upowerd
mosquit+   674  0.0  0.0  42288  2276 ?        S     2018  11:30 /usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf
root       828  0.0  0.0  20484  2124 ?        S     2018   0:00 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp
root       829  0.0  0.0  20480  2060 ?        S     2018   0:00 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp
openhab+  2224  0.0  0.0  56676  2012 ?        S     2018   0:02 /usr/lib/x86_64-linux-gnu/xfce4/xfconf/xfconfd
avahi      657  0.0  0.0  47140  1892 ?        Ss    2018   1:21 avahi-daemon: running [debra.local]
root       390  0.0  0.0  46172  1884 ?        Ss    2018   0:01 /lib/systemd/systemd-udevd
openhab+  2221  0.0  0.0 220216  1860 ?        Sl    2018   0:00 /usr/lib/at-spi2-core/at-spi2-registryd --use-gnome-sess
root       357  0.0  0.0 134060  1852 ?        Ss    2018  20:42 /usr/bin/vmtoolsd
lightdm  28448  0.0  0.0 169108  1640 ?        S     2018   0:00 (sd-pam)
openhab+  1170  0.0  0.0  45452  1472 ?        Ss    2018   0:04 /usr/bin/dbus-daemon --session --address=systemd: --nofo
root       723  0.0  0.0  69952  1352 ?        Ss    2018   0:00 /usr/sbin/sshd -D
root       625  0.0  0.0  36212  1308 ?        Ss    2018   1:26 /usr/sbin/inetd
root       385  0.0  0.0 107200  1232 ?        Ss    2018   0:00 /sbin/lvmetad -f
www-data   767  0.0  0.0 430516  1220 ?        Sl    2018   6:24 /usr/sbin/apache2 -k start
Debian-+  1446  0.0  0.0  58212  1112 ?        Ss    2018   0:00 /usr/sbin/exim4 -bd -q30m
openhab+  2104  0.0  0.0  45120   972 ?        S     2018   0:00 /usr/bin/dbus-daemon --config-file=/usr/share/defaults/a
root       631  0.0  0.0  29664   964 ?        Ss    2018   0:12 /usr/sbin/cron -f
www-data   766  0.0  0.0 561540   804 ?        Sl    2018   6:25 /usr/sbin/apache2 -k start
openhab  17489  0.0  0.0   5880   800 ?        S     2018   0:18 tail -n 200 -F /var/log/openhab2/openhab.log /var/log/op
root      9028  0.0  0.0  44992   788 ?        Ss    2018   0:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-addres
root      1437  0.0  0.0 310036   760 ?        S     2018   0:02 /usr/sbin/smbd
root       630  0.0  0.0 422460   660 ?        Ssl   2018   0:00 /usr/sbin/ModemManager
openhab+ 11604  0.0  0.0  11092   336 ?        Ss   Jan11   0:00 ssh-agent -s
openhab+ 15904  0.0  0.0  11092   336 ?        Ss   18:46   0:00 ssh-agent -s
openhab+   927  0.0  0.0  11092   332 ?        Ss    2018   0:00 ssh-agent -s
openhab+  3712  0.0  0.0  11092   332 ?        Ss   Jan01   0:00 ssh-agent -s
openhab+  4813  0.0  0.0  11092   332 ?        Ss   22:15   0:00 ssh-agent -s
openhab+ 16653  0.0  0.0  11092   332 ?        Ss    2018   0:00 ssh-agent -s
openhab+ 17708  0.0  0.0  11092   332 ?        Ss    2018   0:00 ssh-agent -s
openhab+ 18647  0.0  0.0  11092   332 ?        Ss    2018   0:00 ssh-agent -s
openhab+ 20029  0.0  0.0  11092   332 ?        Ss   Jan12   0:00 ssh-agent -s
openhab+ 21523  0.0  0.0  11092   332 ?        Ss    2018   0:00 ssh-agent -s
openhab+ 22117  0.0  0.0  11092   332 ?        Ss   Jan12   0:00 ssh-agent -s
openhab+ 23877  0.0  0.0  11092   332 ?        Ss   21:21   0:00 ssh-agent -s
openhab+  2931  0.0  0.0  11092   328 ?        Ss    2018   0:00 ssh-agent -s
openhab+  6168  0.0  0.0  11092   328 ?        Ss   Jan11   0:00 ssh-agent -s
openhab+ 30675  0.0  0.0  11092   328 ?        Ss   Jan02   0:00 ssh-agent -s
openhab+  8645  0.0  0.0 290472   292 ?        S<l   2018   0:00 /usr/bin/pulseaudio --start --log-target=syslog
root      1049  0.0  0.0 245684   284 ?        Sl    2018   0:00 lightdm --session-child 14 17
nobody     675  0.0  0.0 128664   208 ?        Ssl   2018   1:20 /usr/sbin/telldusd
openhab+  1453  0.0  0.0  11092   112 ?        Ss    2018   0:07 /usr/bin/ssh-agent x-session-manager
openhab+  1088  0.0  0.0  86964   104 ?        S     2018   0:00 (sd-pam)
root       765  0.0  0.0  75628    72 ?        Ss    2018   1:05 /usr/sbin/apache2 -k start
systemd+   620  0.0  0.0 129348    68 ?        Ssl   2018   0:02 /lib/systemd/systemd-timesyncd
avahi      661  0.0  0.0  47016    24 ?        S     2018   0:00 avahi-daemon: chroot helper
openhab+  1125  0.0  0.0   4288     8 ?        Ss    2018   0:00 /bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserver
root       627  0.0  0.0 153492     4 ?        Ss    2018   0:00 /usr/bin/VGAuthService
root       722  0.0  0.0  36572     4 ?        S     2018   0:00 /usr/sbin/xrdp-sesman
openhab+  1986  0.0  0.0 348696     4 ?        Ssl   2018   0:00 /usr/lib/at-spi2-core/at-spi-bus-launcher
openhab+  8569  0.0  0.0 213124     4 ?        Sl    2018   0:00 gnome-keyring-daemon --start
openhab+  8601  0.0  0.0 284392     4 ?        Ssl   2018   0:00 /usr/lib/gvfs/gvfsd
openhab+  8640  0.0  0.0 187292     4 ?        Sl    2018   0:00 /usr/lib/dconf/dconf-service
root      9027  0.0  0.0  48132     4 pts/2    S     2018   0:00 dbus-launch --autolaunch 56150b9afef84668a01c4560eac8fd0
root         2  0.0  0.0      0     0 ?        S     2018   0:00 [kthreadd]

Ps aux --srt rss

You are running out of threads
Did you see: Why have my Rules stopped running? Why Thread::sleep is a bad idea

I am experiencing similar issue.
I ran about 4 months long version 2.3 without any changes on my rules-files. Did not have any noticable issues.
I lately, last week, upgraded to 2.4.0 running as docker container. I am experiencing that rules stop working.
All of my rules have this similar snipplet:

rule "Initiate gardenlights.rules"
when
        System started
then
        logInfo("Initiate gardenlights.rules", "finished")
end

Which indicates compiling of this changes rule.
I this particular error case I never see the “finished” message, when I refresh/change the rule. This means the rule is not compiled. Though I can see:

2019-02-06 17:56:25.979 [INFO ] [el.core.internal.ModelRepositoryImpl] - Refreshing model 'gardenlights.rules'

I do not have a good feeling about this

My rules also stop working after a day or so on Openhab 2.4. The rules has been working fine for months in 2.3.

I have the same issue, after upgrading to 2.4 it happened at least twice that rules are not working anymore. I had to restart the openhab2 service and system is back in business.
I cannot find any reason why it behaves like this, all rules seem to be ok.

-ben

From what I’ve gathered from reading some comments on the other threads it seems like they did make a major change in how 2.4 works and that’s caused lots of hassle and issues. So while they will say you have a problem with your rules, it’s only because they probably changed something.
The fix is the same either way. You probably have too many rules running at once or rules that last a long time. So you need to try and remove any sleeps from your rules and make sure they run as fast as possible.
I’ve just moved over anything big over to jython, it seems like that will probably be the main rule engine in openhab v3 anyway.
You can also move your long running rules to a lambda, I don’t think they count or have the same issue as long running rules. But lambdas are frowned upon.

This thread seems to be the place to go for for the solution

Well, the running out of Rules threads problem is highly timing driven. An upgrade or change in OH version is necessarily going to change the timing so situations where it used to work before but doesn’t know often are because either something is taking longer now or you were just lucky before.

As far as the fundamentals are concerned, the issue has existed since the beginning of OH. There are a limited number of Rules that can run at the same time. If you have long running Rules, you run a risk of running out of those threads and all your Rules end up stopping.

Also, not many users write their Rules once and never change them. It seems equally if not more likely that the user changed a Rule or added a Rule that changes the timing enough that it becomes a problem.

executeCommandLine and the sendHttp*Request Actions can be a problem too. And locks are a huge potential problem.

NGRE will become the default. And it supports Jython, JavaScript, and Groovy. It’s quite mature and with the helper libraries it is almost as easy to write Rules in these languages as the Rules DSL. I do not hesitate to recommend working in JSR223.

Lambdas will run in the same thread so they will not solve this issue. You can move your long running code to Timers, but there are only two Timer threads (by default) so you run the risk of just moving your problem instead of solving it.

The reason lambdas are frowned upon are:

  • they are not thread safe, if the same instance of a lambda is called from two separate Rules at the same time, they will stomp over each other
  • exceptions get hidden and you end up with a much less useful error message
  • most of the time they are unnecessary; there are other ways to achieve the same reduction in duplicated code
1 Like

Thanks for the clarification on the various points. I’ve used your response as the new solution.

I’ve been using openhab for years without problems and just assumed when I saw pretty much everyone having the same issue with 2.4, that was the issue. But of course I never checked on any of the other upgrades so maybe every upgrade had similar sorts of posts.

Also I was looking to make a binding but and was browsing the dev threads and I quickly decided that setting up the environment seemed like it was too much hassle at the moment and would wait until the recent issues are resolved. I also wondered upon your post about the issues around the breaking changes of 2.4 and just assumed that you mean these issues, I probably took it out of context.

I had some long sleeps, executes scripts with long timeouts, that all run over the top of each other for a number of seconds, etc that I couldn’t easily fix or change and they didn’t cause any problems even though they seem to break every single rule by wide margins. So I assumed that it was the fact they were lambda functions. In comparison I only ported a few small minor short acting rules over to jython. But you are right I didn’t test sticking them into normal rules, so I’m probably wrong. I would try it out but there is no easy way to test to see if the rules will break things without running them for days.

The post you linked to was written for OH 2.3 after I got tired or writing the same thing over and over since OH 1.8 when I realized this was a potential problem.

Good idea. There are lots of changes that are the result of merging ESH back into the OH baseline so there are some problems getting a dev environment up and running. I’d wait until the end of the month. things should have settled down by then.

If you don’t wash your hands it doesn’t mean you will immediately get sick. It just increases the chances you will get sick (and yuck). It’s the same thing with these rules against long running Rules. You won’t automatically have problems if you violate them, but it does increase the chances that you might run into problems at some point. But when/if you do have problems, it can be difficult to figure out what is causing the problem and a whole lot of work to refactor your Rules to solve it. So wash your hands by avoiding long running rules. :wink:

I’m pretty sure that Jython Rules do not run out of the Rules Execution Threads and have no artificial limit to the number of simultaneous running rules. Obviously, every Rule running will consume memory and CPU so you will eventually run out of resources. But that number will be in the dozens or hundreds instead of 5.

This is the problem always with trying to solve problems that depend on timing.

Sure if I knew at the time of writing them I wouldn’t have copied some design styles from the forums. I just write stuff and see if it works. If it’s complicated I use unit testing. I don’t have the slightest clue how I would write a rule that wouldn’t trigger some of the syntax warning when run or know if it’s possible to write/run a rule without some-kind of warning, so there isn’t much hope when trying to consider how long a rule would work, multiple threads, concurrent runs, etc. that is as advanced as it gets. You can’t even test to see if you are meeting some arbitrary conditions.

If somethings are really bad and shouldn’t be done then the forum probably isn’t the place to tell people. Maybe some kind of linter or change to the lsp to say DON’T USE sleep, etc. Would help prevent the issues arising in the first place. Like you said it’s very hard to debug these kinds of issues.

While it would be great to rewrite all my rules, in practice I only need to port stuff over or rewrite stuff when a release breaks them. So to me it’s more like summiting Killi, there is no basin or sink about, the water is just for drinking if it’s not frozen, so no-one washes there hands after a quick pss. It’s going to take some real sht to get people to wash them.

But there are places where it can be used safely. There are places where it is the only viable approach.

And it’s not only Thread::sleep. ReentrantLocks, executeCommandLine, and the sendHttp*Request Actions can take a long time to run as well. The latter two are core features of OH. Should they generate a warning too?

Ultimately, Rules DSL will be deprecated by OH 3 so I don’t think it is worth spending a whole lot of effort on doing anything to it at this point. Note that deprecated does not mean no longer supported. It just will no longer be the default and many of us who support it on the forum will probably move on to NGRE as the focus.

I’m just pointing out that writing rules requires extremely high levels of programming knowledge and understanding most professional programmers would have problems with. It seems like there is no easy or simple solution especially if the functions causing issues are also core then that requires difficult calls by whoever is writing the rules. So maybe my example was too crude, maybe it’s more like expecting people to perform brain surgery to remove a tumor than just washing their hands. Anyway it sounds like the move to NGRE is probably the correct solution. I don’t know the details of NGRE or how it would work but the issues you mention and how rules work sounds like you want an event driven rule system maybe something like javascript. (Or javascript but with everything asynced and awaited in the background.) Does NGRE/jython solve any of the underlying issues or does it just throw more than 5 threads at the problem?

It’s fundamentally different. It uses the threading native to the JSR223 languages that are supported (JS, Jython, and Groovy).

The base architecture is event driven. Events are published on an event bus. Events can cause things to happen. Those things can include Rules running. Events can occur close together so it is possible that more than one Rule will run at the same time. When more than one Rule runs at the same time there may be side effects and unexpected behaviors.

NGRE is not going to change this. If having to deal with this is beyond what should be expected, then unfortunately OH is always going to expect more than what should be expected.

Ahh, I always assumed JSR223 was some kind of alias for jython. I was trying to point out rules essentially are event driven, all the issues you are pointing out are issues caused by mixing event driven frameworks and non-event driven frameworks. But if you use javascript, those issues disappear since you are using an event driven framework with an event driven language. It’s difficult conceptually mixing two different styles of programming and that’s probably where my issue was.

I came from a C/C++ background and that’s probably why I just didn’t understand how to write rules or understand why when I checked a state it might be old and not the state received. So jython seemed like a more natural fit, but if home automation is fundamentally event driven… I’ve been learning JavaScript recently, while I would give an arm and leg to kill whoever invented it, it seems like this event driven style fits naturally with rules and events. In js those long running function had to be fine running on a single thread and don’t cause any issues. Of course you may have issue with two conflicting rules as you mentioned but I think this thread is about long running rules with sleeps and long timeout execs blocking other thread. That’s simply not a thing in js, well it probably is but it would be difficult for me to do. Whereas I don’t even know how to make python save in that respect, I’d have to use those openhab specific lambada things.

So I disagree with you that openhab has to have these really difficult judgement decisions to make sure it doesn’t crash on the next release. But I’m pretty sure that there are more competent people that can comment on these issues.

I managed to get my spotify python script working with jython, I’ll see if I can port over my javascript tuya scripts to this JS223 thing.

I can only see one thread with 4 post about not being able to get javascript working, has anyone got javascript rules working?

JSR223 or writing rules in JS isn’t going to charge this. The reason why in some cases you cannot use the state of an Item immediately after commanding it or updating it is the result of a deliberate design decision.

Consider this. You have a device that reliably reports its state back to OH. Now we send a command from OH. We will want OH to wait for the device to report back that it has changed to the commanded state, not just assume that the device responded to the command. That round trip is going to take time. The few milliseconds inside the Rule between sending the command and checking the state isn’t enough time.

Also consider that some Item types can receive multiple commands but store their state differently. For example a Rollershutter can be commanded with UP, DOWN, and STOP but the state is a PercentType. The command is completely separate from the update and change. In fact, there could be multiple updates and changes as a result of one UP command, for instance. In the case of STOP, there may not be an update or a change at all. So we can’t delay triggering the Rule until it changes or updates as a result if a command.

To handle cases like this OH operates on an event bus. Updates, commands, charges, and triggers get published on the bus. Bindings, Rules, and persistence subscribe to the events and they each process the events in parallel. A command gets published to the event bus before the update or change.

And why is it all done in parallel? Do you really want to have to wait half a second or so while the charge is saved to the database, the UI is updated, before the rule even starts running? If course not. The latency would be totally unacceptable.

Running rules in javascript is not going to change any of this. Everything I talk about above happens outside the Rules. And the way that Rules are defined and how they are triggered is pretty much identical to Rules DSL. The code is written in a different language but that doesn’t change the fundamentals about how events are generated and processed.

Also, the Javascript engine it runs on is Nashorn, a version built specifically for Java. It is not the same as the Javascript that runs in Node or browsers. It does not have the same sorts of event processing features you find in this other JS implementations.

What did change is the artificial limit of five simultaneous threads. So the number of simultaneous rules that can run before they’re are problems is greatly increased, but it will still be possible to use up all resources.

Yes I have. I recommend Lewie’s library if you want to use it. But if you are more comfirtable in Python, I suggest sticking with that. The reasons you state for wanting to use JS don’t apply.

The point I was making is that the behaviour from a C or a python style of programming is very unusual. Openhab does stuff you don’t expect or understand. But from an event driven language it makes perfect sense. So I’m not criticising how openhab works, but just saying that it’s fundamentally event driven. Mixing in non-event driven language or styles is confusing and will give rise to issues. One of the main features of javascript is that it’s event driven, you don’t want a function to lock up the browser while it’s fetching a page. Which is effectively the problem we are seeing here. So I’m just commenting on that it seems like javascript is the natural fit for an event based system. Yes you can use other languages but people need to be aware that they can’t use them like normal and need to take into account the event based way openahb works.