Sometimes rules don't get executed?

Thanks! I have tried it. Renamed my rule and add another rule (to the same file, maybe try adding it to a new file?)
However I can’t see neither of the two rule running (no logging). Plus now I can’t even see any of my rule producing logs…
One thing to add (I don’t know when it happens or it has anything related to it…) I have these switches in a group (OR) , and that group is displayed in a sitemap… However sometimes the group label says that it is OFF, but there are items in it which is ON…

Edit2: for example for the Time Of Day rule (day_state) I have a dummy switch for debugging which can trigger this rule. If I press that switch the rule gets triggered (I can see it in the logs), but not other rules running now… So it seems that the triggers which has to happen ‘automatically’ are not triggering the rules.

What are you using to trigger these other Rules? If you are triggering them using changed but you manually trigger the ToD Rule then those Rules won’t trigger because day_state isn’t changing. Make sure that the Rules are using the appropriate trigger.

Over all I don’t have any ideas. If you are not out of Rule threads and the Item is changing states then the Rule should trigger.

I have around 40 rules, which is triggered by mostly Item changes, so it should work at any time…
Any good method to track available/used rule execution threads?

I think you miss my point. Rules triggered by changes will only trigger when the Item changes state.

If you trigger your time of day Rule by hand, the day_state will not change. Thus those Rules that trigger on changes to day_state will not trigger.

See post 12 by 5iver in the link I posted above.

I see your point, but for eg.: mobile phone’s location changes perodically and I have some rules which is triggered by that change. Sometimes all of these rules stops working, since it has logging and it won’t update the info…
Secondly I have rules which are triggered for eg.: light switches. They also stop working at these times…

Okay, that’s clear and we’re wasting time concentrating on this one rule.

How-to monitor thread usage is here somewhere

Yes and sorry for that, I’m also just found that this happens… hard to spot these errors. So for me (not sure) seems that sometimes all rules stops working and after an hour or so (not sure…) all rules starts to work again… no reboot or restart is required to work again.
Thanks, I’ll try to set-up thread monitor and maybe that helps us to investigate this error.

Oh yes. “Everything stopping” is not at all as clear-cut as it sounds, in an event driven system !

You might add a little rule that logs at ten-minute intervals or suchlike, to try and see what events lead up to seizure.

Yes this is what is strange to me also. No errors or any other sign of problem, just like the ruleEngine stops and everything else works as expected (bindings, etc…)

I will make a rule and try to monitor the threads

Thanks!

So I had a little time to play around with this. I have been watching the ruleEngine with this command:

shell:threads --list |grep "RuleEngine" |wc -l

If I’m right it should return how many threads are used by the rule engine, right? It is really unpredictable. Sometimes none of my rules are running and it returns 6 (which is 5 running threads if you distract the 1 from this caused by this call).

I have also set-up a test rule which runs every 10 minutes. For example, now the status is the following:

  • The test rule is working, every 10 minute, it produces some log.
  • The command above returns only 1 thread is used.
  • My other rules don’t want to work…

After 5 mins:

  • It seems that somewhere the execution is stuck, because all of the rules are executed together (around 100 log entry in my events.log that my rules ran around 20 times) which should ran in the previous 10 minutes. So it seems that it “knew” that the rule should run, but unable to execute them and now executes together.

For example the rule which I inserted in the first post, ran together, which should ran when it detects motion in one of my room:

2018-10-13 20:03:26.006 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.379 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.480 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.490 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.492 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.566 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.888 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.931 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:26.954 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:30.285 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

2018-10-13 20:03:30.402 [INFO ] [smarthome.model.script.wifiled.rules] - Night Light rule started!

…and many more…

After this “catching up” everything works as it should… and it happens after a random period…

Hope it is understandable what I’m trying to say.

Yes, I think so. It’s all rather odd. Keep a lookout for if the “10 minute ticker” rule misses its expected time, that may help reveal the nature of the logjam.

See if you can figure out which is the first rule to appear when the “catching up” happens - it may be something about that one or its triggering that is the logjam.

What kind of host are you running on ? - I’m wondering if there is an external cause, like having trouble fetching rules file from an SD card or something.

I’m using a RPi 3B with openHabian on it. The next I will try I think is to move my whole setup to another RPi and reserve it just for that (now I have some other small scripts running on this RPi as well - perfomance doesn’t seems an issue, but might cause some problem…).

The timer seems to get executed every time, on time (10-150ms after the 10min, so it is really good).
Two rules which seems to “catch up” first:

  • The one I posted here in my original post.
  • I have another rule which takes every phone’s location and uses the Google API to return a postal address (this is good to have, because you know where that phone is without zooming and looking at a map). However his rule has a big, 20sec timeout specified because unfortunately it won’t return anything if I reduce it to, lets say 10sec… I don’t know why, because execution is much faster… But maybe this can cause issues. I only have 4 devices which location is transformed with this rule and I have increased the ruleEngine execution threads to 7, so always it should have at least 3 free thread (the location is polled - with iCloud binding and Owntracks - so it shouldn’t be a problem if I’m correct, because the rule execution is max. 20 secs and the poll time is 10 min…).

Ps.: I have another rule which calculates if the phone is Home or Away. Usually it is ran together with my LocationToString (that’s how I call the rule above, which translates the Location to Postal code) and can it cause any problem if it wants to use 8 thread (4 * 2) and only 7 available and maybe other rules triggered?

Another approach is I’m thinking of, that I remove some rules - maybe this one - and see if it makes any difference. I really don’t know what other options I have…

Did you create some kind of recursion/circular dependency of rules (or items you trigger upon) on each other ?

You can increase the number of threads (org.eclipse.smarthome.threadpool:ruleEngine=20
in runtime.cfg) if you believe that to be an issue (it shouldn’t under normal conditions)

Write performance can also be causing this (if you still write logs to SD card which you shouldn’t be doing).

I have eliminated circular calls, because I had problems with that as well in other rules.
Unfortunately I’m writing the logs to the SD… I also thought of redirecting the var/log to RAM but I didn’t had time for that. Or are there any better solution for this?

You mean because the slow SD card (but as far as I can remember it is not a bad card at all - or it was good when it was new) can cause not executing rules? Because not only the logs are not there, the rule are not executed as well - or it stops executing because it can’t read/write files?

Well repetitive writes to SD can queue up and block threads that are then unavailable to execute rules, slowing everything down. And logging+swapping can corrupt an SD fast so there can even be a problem with a fairly new card. Yes there’s better solutions, tmpfs being one, separate USB or NAS storage are others.

I have a NAS, do you have any experience redirecting the whole var/log folder to it, how will it do performance-wise? Both my RPi and NAS is connected via Ethernet, but my NAS is an older cheap D-Link…

What is a better approach for now? tmpfs or nas?

I don’t like the sound of this, potential multiple copies of the rule hanging up awaiting Google responses.
It rather depends on the triggering conditions to “guarantee” only one run per device.
I wonder how Google reacts if you bang off multiple requests before it responds.
At the least, I would implement some locking so that only one Google lookup is in progress at a time.

Though it sounds like it should simple enough to temporarily omit the lookup in this rule, for elimination purposes.

Yes that’s what I’m afraid of, that maybe this can cause some problem… I have tried today to implement a lock, but I couldn’t find a quick solution which seems to work…

NAS. See this post.
Now while I don’t know if it is to help with your specific rules problem, for sure it’s a good idea to setup your system for less writes, with a UPS and backup in place.

Thanks, I’ll read through that. Hope that fixing my rules and moving logging to somewhere else, would make sense and this “unsolvable” error will be gone…