Rules stop executing after a while

Are the events still occuring? Look in events.log to see that your Items are still changing state and receiving the commands that would trigger your rules.

Do you have any Thread::sleeps, while(true) (or the like), or other long running commands like executeCommandLine or sendHttp*Request in your Rules?

Do you see any spike in memory or CPU usage when the Rules top working?

Yes the events are still occurring, I can see them being logged - just the rules stop triggering on them. They get recorded in the persistence files as well. Cron based rules still fire, and timers work, (just new ones do not get created because the rules that create them never fire).

I do have rules with Thread::sleeps, but they are all to do with remote control commands, none of them are automatic, and don’t seem to be anything to do with this (which only started a few weeks ago). There are a couple of executeCommandLine but they don’t seem to be associated with the issue, I have a few sendHttp requests, but they are all manually triggered (ie click a control, and it sends an http request to home a camera).

No spike in memory or CPU that I can see, but this is always at about 3:30am.

It’s frustrating, as if i could find what is causing it, I could fix it, but the rules just silently stop triggering. No messages, errors or anything, and I can’t find a way to restart rule triggers without stopping and restarting OH as a whole.

If the theory is that I’m running out of threads, or a rule is getting into a deadlock, how can I find the culprit? is there a way to tell how many rule threads are in use at any one time, so I could see when it starts to rise?

I changed the number of rule threads from 5 to 50, but it doesn’t really seem to make a difference.

Thanks,

Just to eliminate them as being a problem (this is the number 1 cause of this behavior) I’d add some logging at the ends of your Rules that have these commands, even the manually triggered ones, just to verify that these Rules are exiting and doing so in a timely manner.

I do think it is unlikely to be the cause but it is the usual cause and I’m hesitant to eliminate it as a possibility without some more confirmation.

There may be some tools (profilers) that can be hooked up to OH while it is running to see the thread counts and such but it has been so long since I’ve done anything like that I don’t know what modern Java programs use. There may be some commands that you can run in the Karaf console to get the thead counts.

But before spending too much time researching and testing on this, I would recommend backing up (always backup), and upgrading to 2.3 to verify that it is still a problem there. This may be something that has been previously identified and fixed. 2.3 is nearing release so it shouldn’t be too unstable. Note: it will require some changes to your Rules, mainly you need to put a space after all [ and before all ].

I have already added the spaces to my rules in anticipation of 2.3 - but I’m waiting for the release, it’ll probably take a few days to sort OH out after the update.

Regarding Thread::sleep, everyone seems to think this is a terrible thing - does it sleep all threads? or just the rule that’s running? How would I replace it? (I know some people say “use timers” but how?)

Here is an example rule where I use it:

rule "Skype State Machine"
when
    Item TV_Skype_State changed
then
    /* States for State machine
        0 - idle
        1 - just switch to Anynet
        2 - incoming Skype call, switch to Anynet and set up for receive
        3 - switch back to HDMI 1 (end of Skype Call?)
    */
    
    logInfo("Skype", "Skype state changed to: " + TV_Skype_State.state)
    // 0 is idle state - do nothing
    if(TV_Skype_State.state == 0) return;
    
    if(TVON.state != ON && TV_Power.state != ON) {
        logInfo("Skype", "But Samsung TV is OFF, so ignore")
        postUpdate(TV_Skype_State, "0")
        return;
    }
    if (TV_Skype_State.state == 1 || TV_Skype_State.state == 2) {
        logDebug("Skype", "Answering Skype Call")
        if (TV_Status.state != "ONLINE") {
            sendCommand(TVTools, "KEY_TOOLS")
            Thread::sleep(250)
            sendCommand(TVKey_Enter, "KEY_ENTER")
            Thread::sleep(250)        //might need this, might not...
            sendCommand(TVKey_DOWN, "KEY_DOWN")
            Thread::sleep(250)
            sendCommand(TVKey_Enter, "KEY_ENTER")
            Thread::sleep(250)
            sendCommand(TVKey_Enter, "KEY_ENTER")
        }
        else
            sendCommand(TVSource_Name, "HDMI-CEC")
        if(TV_Skype_State.state == 2) {
            //wait for source to switch to HDMI-CEC
            // use google home mini 1 to announe incoming call
            say("Incoming Skype Call, Incoming Skype Call")
            var int count = 0
            if (TV_Status.state != "ONLINE") count = 31
            while(TVSource_Name.state != "HDMI-CEC"  && count <= 30) {
                Thread::sleep(100)  //100 ms delay
                count = count + 1
            }
            logDebug("Skype", "Answer Exit Loop: TV Source is: " + TVSource_Name.state)
            Thread::sleep(2000)
            //select video answer button on Skype screen
            sendCommand(TVKey_DOWN, "KEY_DOWN") 
            Thread::sleep(250)
            sendCommand(TVKey_DOWN, "KEY_DOWN")
            Thread::sleep(250)
            sendCommand(TVKey_DOWN, "KEY_DOWN")
            Thread::sleep(250)
            sendCommand(TVKey_LEFT, "KEY_LEFT")
        }
    }
    else if (TV_Skype_State.state == 3){
         logDebug("Skype", "Hangup Skype Call")
         var int count2 = 0
         if (TV_Status.state != "ONLINE") {
             sendCommand(TVInput_HDMI, "KEY_HDMI")
             count2 = 31
         }
         sendCommand(TVSource_Name, "HDMI1/DVI")
         var prev_source = TVSource_Name.state
         //sendCommand(TVInput_HDMI, "KEY_HDMI")
         while(TVSource_Name.state != "HDMI1/DVI"  && count2 <= 30) {
            Thread::sleep(100)  //100 ms delay
            if (TVSource_Name.state != prev_source && TVSource_Name.state != "HDMI1/DVI") {
                prev_source = TVSource_Name.state
                logDebug("Skype", "Hangup: Sending HDMI command")
                sendCommand(TVInput_HDMI, "KEY_HDMI")
            }
            count2 = count2 + 1
         }
         //Thread::sleep(2750)
    	 //sendCommand(TVInput_HDMI, "KEY_HDMI")
         logDebug("Skype", "Hangup Exit Loop: TV Source is: " + TVSource_Name.state)
    }
    else postUpdate(TV_Skype_State, "0")
    logInfo("Skype", "Skype Rule Ended")
end

This rule reacts to an update from an android box, that there is an incoming Skype call, the rule then sends the remote control commands to put the TV on the correct input. The delays are due to the hardware response timing.

How would you do this without Thread::sleep?

Thanks,

In your case it is an appropriate use of sleep.

It only sleeps the one thread but unless you change the Karaf config you only get 5 meaning only 5 rules can run at the same time. When you have lots of Thread::sleeps or long Thread::sleeps you increase the chances that you run out of threads. When you run out of threads, no new rules can start executing until the ones that are blocked by sleeps or long running commands complete and the running rules start to execute. It really becomes a problem when you inadvertently create a feedback loop or other types of choke points.

Symptoms include noticeable latency between the time an event occurs and a rule runs, rules start to run erratically, events are processed out of order, or the rules simply stop executing.

Your pool is 50 so you have a lot larger of a buffer before problems might show up, but if you have a feedback loop or a whole lot of events you could run out even at 50.

I’m not convinced this is the problem, but I’d like to eliminate it.

It largely depends on why you are sleeping and surrounding context.

For example, you could replace “Answering Skype Call” with something like the following:

createTimer(now.plusMillis(1), [ |
  sendCommand(TVTools, "KEY_TOOLS")
  Thread::sleep(250)
  sendCommand(TVKey_Enter, "KEY_ENTER")
  Thread::sleep(250)        //might need this, might not...
  sendCommand(TVKey_DOWN, "KEY_DOWN")
  Thread::sleep(250)
  sendCommand(TVKey_Enter, "KEY_ENTER")
  Thread::sleep(250)
  sendCommand(TVKey_Enter, "KEY_ENTER")
])

This will move the sleeps to the Timer threads which have their own pool so you won’t be consuming a Rule thread and allow your Rule to exit immediately.

“Incoming Skype Call, Incoming Skype Call” gives me some concern. If I read this correctly, you can sleep up to six seconds. So, dealing with this section might require a bit more cleverness, though the same trick as above could work.

Personally, I like to have events occur and separate Rules trigger rather than while loops waiting for something to happen. At least you have a max wait in place. But you can shunt the while loop off to a Timer pretty easily with something like:

  say("Incoming Skype Call, Incoming Skype Call")
  val loopStart = now
  var answerCallTimer = createTimer(now.plusMillis(1), [|
    if(TVSource_Name.state != "HDMI-CEC" && now.minusSeconds(3).isBefore(loopStart) {
        answerCallTimer.reschedule(now.plusMillis(100)) // loop
    }
    else {
        logDebug("Skype", "Answer Exit Loop: TV Source is: " + TVSource_Name.state)
        Thread::sleep(2000)
        //select video answer button on Skype screen
        sendCommand(TVKey_DOWN, "KEY_DOWN") 
        Thread::sleep(250)
        sendCommand(TVKey_DOWN, "KEY_DOWN")
        Thread::sleep(250)
        sendCommand(TVKey_DOWN, "KEY_DOWN")
        Thread::sleep(250)
        sendCommand(TVKey_LEFT, "KEY_LEFT")
    }

There are other approaches that could help as well but I don’t think we need to go down that path.

One approach that doesn’t really alleviate the sleeps but could make the code a little less prone for unexpected interactions (e.g. you have other Rules that need to send the command to the TV) is Design Pattern: Gate Keeper. Instead of having your Rules send command with sleeps scattered throughout your Rules you send commands to a Design Pattern: Proxy Item and a Rule that triggers and centralizes the Thread::sleeps so you can have multiple Rules issuing commands to the TV at the same time and they won’t stomp on each other. For example:

import java.util.concurrent.locks.ReentrantLock
val lock = new ReentrantLock

rule "TV Command"
when
    Item TV_Commands received command
then
    lock.lock // Ensures only one instance of the Rule can run at a time
    try {
        val cmds = TV_Commands.state.toString.split(" ")
        cmds.forEach[ cmd | 
            sendCommand("TV"+cmd.replace("KEY_", "").toLower.charAt(0).toUppercase(), cmd)
            Thread::sleep(250)
        ]
    }
    catch(Exception e) {
        logError("TV", "Error handling TV command: " + e)
    }
    finally {
        lock.unlock
    }
end

rule "Skype State Machine"
when
    Item TV_Skype_State changed
then
    ...
        if (TV_Status.state != "ONLINE") {
            TV_Commands.sendCommand("KEY_TOOLS KEY_ENTER KEY_DOWN KEY_ENTER KEY_ENTER")
        }
    ...  
end

NOTE that I’m using Design Pattern: Associated Items to come up with the Item to send the command to based on the cmd string to be sent (it will require normalization of how you do capitalization of your Item names.

You could get even more clever to remove the Thread::sleep in the Rule thread:

import java.util.concurrent.locks.ReentrantLock
import java.util.List

val lock = new ReentrantLock
val List<String> cmds = createArrayList
var cmdsTimer = null

rule "TV Command"
when
    Item TV_Commands received command
then
    // add the commands to the queue
    cmds.addAll(TV_Commands.state.toString.split(" ") 

    // Start the timer thread to work off the commands
    if(cmdsTimer == null) {
        cmdsTimer = createTimer(now.plusMillis(1), [ |
            try {
                val cmds = TV_Commands.state.toString.split(" ")
                cmds.forEach[ cmd | 
                      sendCommand("TV"+cmd.replace("KEY_", "").toLower.charAt(0).toUppercase(), cmd)
                      Thread::sleep(250)
                ]
            }
            catch(Exception e) {
                logError("TV", "Error handling TV command: " + e)
            }
            finally {
                lock.unlock
            }
        ])
    }

end

thanks,

There are some interesting things to try in your examples. I notice you put your lock before the try{} catch(){} , I usually put mine as the first entry in the try{} section - is your example better? I ask because I have been trying to eliminate locks in case that is the source of the problem, but maybe I’m putting them in the wrong place.

many thanks for putting time into this…

No difference really.

Either place could be a problem, but locks are another thing that can cause you to quickly run out of execution threads. It will be exacerbated even further if you are using locks with sleeps.

Update from last night:

15-May-2018 03:26:40.925 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting ISO timestamp: 2018-05-15T03:26:37.041-04:00 to: McTrack12733Time
15-May-2018 03:26:40.926 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 207.149 to: McTrack12733aM
15-May-2018 03:26:40.927 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 10.6627 to: McTrack12733dP
15-May-2018 03:26:40.927 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 0 to: McTrack12733lux
15-May-2018 03:26:40.928 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 00000064 to: McTrack12733DestinationUid
15-May-2018 03:26:40.928 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 23.2163 to: McTrack12733t
15-May-2018 03:26:40.929 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 1059771153 to: McTrack12733SequenceNumber
15-May-2018 03:26:40.929 [INFO ] [pse.smarthome.model.script.JSON Data] - Posting: 17300 to: McTrack12733MessageId
15-May-2018 03:26:40.929 [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'Convert MCT1 Lux to text value'
15-May-2018 03:26:45.485 [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'MCThings 111BC Basement Entry Battery Monitor Status'
15-May-2018 03:27:00.000 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'Test Rules operation'
15-May-2018 03:27:00.001 [INFO ] [smarthome.model.script.Testing Rules] - rules_operating status: True
15-May-2018 03:28:00.000 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'Test Rules operation'
15-May-2018 03:28:00.001 [INFO ] [smarthome.model.script.Testing Rules] - rules_operating status: False
15-May-2018 03:28:00.002 [INFO ] [eclipse.smarthome.model.script.Rules] - Rules have stopped triggering
15-May-2018 03:28:00.219 [DEBUG] [model.script.actions.ScriptExecution] - Scheduled code for execution at 2018-05-15T15:28:00.218-04:00
15-May-2018 03:29:00.001 [DEBUG] [ntime.internal.engine.ExecuteRuleJob] - Executing scheduled rule 'Test Rules operation'
15-May-2018 03:29:00.001 [INFO ] [smarthome.model.script.Testing Rules] - rules_operating status: False
15-May-2018 03:29:00.002 [INFO ] [eclipse.smarthome.model.script.Rules] - Rules have stopped triggering - so restarting OH2
15-May-2018 03:29:31.437 [DEBUG] [org.eclipse.smarthome.model.rule.ide] - BundleEvent STOPPING - org.eclipse.smarthome.model.rule.ide
15-May-2018 03:29:31.437 [DEBUG] [org.eclipse.smarthome.model.rule.ide] - BundleEvent STOPPED - org.eclipse.smarthome.model.rule.ide
15-May-2018 03:29:31.439 [DEBUG] [g.eclipse.smarthome.model.script.ide] - BundleEvent STOPPING - org.eclipse.smarthome.model.script.ide
15-May-2018 03:29:31.440 [DEBUG] [g.eclipse.smarthome.model.script.ide] - BundleEvent STOPPED - org.eclipse.smarthome.model.script.ide
15-May-2018 03:29:31.874 [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'check for Workshop outlet overload'
15-May-2018 03:29:31.910 [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'Driveway Camera Motion Detect Status'
15-May-2018 03:29:31.910 [INFO ] [arthome.model.script.Driveway Camera] - Driveway Camera Motion Detection status is: ACTIVE
15-May-2018 03:29:32.007 [DEBUG] [ntime.internal.engine.RuleEngineImpl] - Executing rule 'McModule McTrack 12733 Location Tracking'
15-May-2018 03:29:32.008 [INFO ] [e.smarthome.model.script.Last Update] - McTrack12733LocationTimestamp DateTimeItem updated at: Tue 03:29:32

This is driving me crazy - I wish I knew what was happening at 3:30am!

Interestingly, as soon as these stop:

15-May-2018 03:29:31.437 [DEBUG] [org.eclipse.smarthome.model.rule.ide] - BundleEvent STOPPING - org.eclipse.smarthome.model.rule.ide
15-May-2018 03:29:31.437 [DEBUG] [org.eclipse.smarthome.model.rule.ide] - BundleEvent STOPPED - org.eclipse.smarthome.model.rule.ide
15-May-2018 03:29:31.439 [DEBUG] [g.eclipse.smarthome.model.script.ide] - BundleEvent STOPPING - org.eclipse.smarthome.model.script.ide
15-May-2018 03:29:31.440 [DEBUG] [g.eclipse.smarthome.model.script.ide] - BundleEvent STOPPED - org.eclipse.smarthome.model.script.ide

Rules start executing again, even though OH is shutting down and restarting.

And - I have been inspecting system logs, and although I said no unusual load on the system - this does not seem to be true, because I found this:

May 15 03:29:41 proliant nagios: SERVICE ALERT: proliant;Current Load;WARNING;SOFT;1;WARNING - load average: 6.06, 5.55, 3.73

Which is not massively high, but normally the server runs at less than 1.0.

I’m pretty sure those bundles are the rule engine itself. So they shutdown which effectively resets the rules but you have a bunch of events in the queue so the rules start triggering again.

Something is happening at that time. It might be a fluke that it is always happening at 3:30am. Under other circumstances it might happen at midnight or 6 am.

Check the cron jobs on your machine. Perhaps there is something else being triggered that is causing a resource contention or the like.

Though the fact that resetting the Rules really points to there being a problem with your Rules. Despite your having 50 threads I think you might be running out. You can perhaps test this by reducing the number back to 5 and seeing if the rules stop running sooner and upping it to 100 to see if it takes longer.

Otherwise, you need to trace through the events that take place a few minutes prior to 3:30 and figure out if you have a run away feedback loop or something causing you to run out of threads.

I’d still recommend upgrading to the latest 3.2 snapshot before spending any more time on this. If it is a bug it might already be fixed. And if it isn’t, you will have to redo all of the above again on 2.3 to provide needed details to the devs anyway. In the long run it will save you some time, unless the problem is indeed your Rules.

Another thing to do is add some logging to verify that all your locks are being unlocked. That could be another source of problems.

I have added the logging as you suggested for the locks.

It seems from my investigation that something starts running at 3:00am, and slowly the system load goes up, not apparently due to CPU usage, but due to iowait hitting 40%. Once I hit that threshold, rules stop triggering - but when the iowait drops, rules start triggering again.

So OH may be the symptom of high iowait, not the cause.

My current suspect is crashplan, which does ramp up iowait when it’s running.

I’m running it now, and iowait hits 40% and OH stops triggering, I stop it, and iowait drops, and OH starts again…

I will probably bite the bullet and upgrade to 2.3 - it just took me 2 days to get everything working after 2.2 upgrade, so not keen to get into that again if there is some other ongoing issue.

Ha! found this on the crashplan web site:

The default file verification scan schedule, every 1 day at 3:00 AM, works for many users. However, there are situations when you may want to change these settings, as described in the sections below. To change the file verification schedule:

For what it’s worth, the upgrade to 2,3 went a whole lot smoother for me than going from 2.1 to 2.2. The only gotcha was the need for spaces inside the [ ].

I’m glad you found the culprit. It would have taken me quite some time to find that as the source of the problem.

Well it took me forever as well. I have just changed crashplan to verify every 7 days at 15:00, so if I have no problems until then, that will validate that this is the culprit.

If this is the case, then it is worth noting that if iowait goes above 40% (at least on my system), rules stop triggering.

Thanks for the help.

1 Like

OK,

2 days now and no issues. I have disabled crashplan altogether. It seems that this was the culprit.

Thanks to all for the help.

1 Like

Nicholas - I am having a similar problem with rules just stopping to work. How do I disable crashplan?

Depends on your OS, I’m using Ubuntu 14.04, so I use:

sudo service crashplan stop

But it depends on your OS version as to whether it uses upstart or not to start services.

Nick Waterton P.Eng.

National Support Leader - Nuclear Medicine, PET and RP

GE Healthcare

M+1 416 859 8545

F +1 905 248 3089

E nick.waterton@med.ge.com

2300 Meadowvale Boulevard

Mississauga, ON L5N 5P9 Canada

GE imagination at work

thank you for the quick reply - crashplan is not running on my PI. time to keep looking :frowning:

Crashplan has ended their home plans so it is unlikely that this would be causing your problems.

Have you reviewed Why have my Rules stopped running? Why Thread::sleep is a bad idea? I know you are using a bunch of locks and long running commands (sendHttp*Request) which can also cause problems.

Just to explain further what I found.

It’s not so much crashplan running that is the issue, it’s when there are a lot of changes in the filesystem. Crashplan scans the filesystem for changes every so often. If there are a lot of changes, then your iowait starts to climb. If the iowait goes over 40% or so, then OH2 rules stop executing.

Way round this is to exclude files that change a lot from crashplan (like .log files). If that doesn’t help, then what I do is stop OH2 for a few hours, and let crashplan run. It will scan the filesystem (may take a while) and start backing up. Once the backing up starts you can restart OH2, and you should be OK (until you have large changes in your file system again)…

This may have something to do with crashplan running at a nice of 19 (ie super low priority), so it’s iorequests start clogging everything up. I have tried changing the nice of crashplan, but doesn’t really seem to help.

Now why OH2 rules stop executing if iowait goes over 40% - I don’t know.

I have set up a rule that stops crashplan if rules stop executing (on a timer – timers still execute), and restarts OH2 if things don’t get going with just stopping crashplan. It’s a band-aid though.

Nick Waterton P.Eng.

National Support Leader - Nuclear Medicine, PET and RP

GE Healthcare

M+1 416 859 8545

F +1 905 248 3089

E nick.waterton@med.ge.com

2300 Meadowvale Boulevard

Mississauga, ON L5N 5P9 Canada

GE imagination at work