Design Pattern: Gate Keeper

It’s a weird quirk, state ON is not the same as command ON. It’s a bit more obvious for e.g. Dimmer where state 75% is not the same as command INCREASE.
The system will sort out string “ON” into the appropriate type.

I started with the code at the top, and got an error like above where 5iver said to remove the

Queue<String>

part. Then I got crashes that I determined was the timer by commenting things out. The code wasn’t obvious why it was wrong, and looking at it I could see the timer would run forever (you should make it stop rescheduling when the queue is empty). So instead just use a cron rule if it would run forever, and that worked.

Getting back to the timer part I added logging statements and basically found that maybe the queue function is thread safe or whatever (has it’s own locking I guess), but the if timer is null is NOT. I could see from my logs that two instances of the rule would each see it null and separately start timers. I also added a check before the restart to only do it if not null. I think that’s where the crash came from, the first time it ran the timer was null. That seems a little odd to me, it’s like the timer for “now” started right away and executed before the timer variable for that timer was filled in.

My cron method works. I’m not overly concerned about it running forever, but then it will wait when just one things comes, where the timer method won’t delay the first one. I’m doing this with a 15 second between, because it’s for sending a message to Alexa. That binding doesn’t work if you send multiple messages close together (Alexa textToSpeech function).

I’m trying a different method to avoid the problems above. Instead of looking at the timer being null to start a timer when adding to the queue, I turn a flag switch item on. When that item goes on, it starts the timer. I was also thinking the timer needed to start later than “now” but I haven’t had a crash yet from that. So maybe somehow the multiple threads doing the start timer thing made it hit null for one of them on the reschedule.

I haven’t had a problem with the timer starting at now, so I tried to make it like the originally one to recreate it. I seem to be getting a different error then:

2020-04-29 17:20:32.460 [WARN ] [el.script.internal.actions.TimerImpl] - An error occurred while rescheduling the job 'DEFAULT.Timer 1560 2020-04-29T17:20:32.456-05:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
  org.eclipse.xtext.xbase.impl.XIfExpressionImpl@4596c2ea (conditionalExpression: false)
} ]': Unable to store Job : 'DEFAULT.Timer 1560 2020-04-29T17:20:32.456-05:00: Proxy for org.eclipse.xtext.xbase.lib.Procedures$Procedure0: [ | {
  org.eclipse.xtext.xbase.impl.XIfExpressionImpl@4596c2ea (conditionalExpression: false)
} ]', because one already exists with this identification.

One of the first errors I got was actually because I didn’t have “toString” for the call to Alexa (I can’t understand why it worked with logging if it isn’t a string). Also with the above error I can see that it had multiple timers. This is the code that got the above error:

import java.util.concurrent.ConcurrentLinkedQueue

val commands = new ConcurrentLinkedQueue()
var Timer timer = null

rule "Amazon Echo announcement Controller"
when
    Item AmazonController received command
then
    commands.add(receivedCommand.toString)

    if (timer === null){
      timer = createTimer(now, [ |
        if(commands.peek !== null) {
           val cmd = commands.poll
           logInfo("Testing", "value: " + cmd)
           timer.reschedule(now.plusSeconds(15))
           }
        else{
           logInfo("Testing", "Timer done")
           timer = null
        }
      ])
    }
end

And with logging statements I could see multiple timers were started, all at the same time (so it wouldn’t space out the Alexa commands).

So this is my final method:

import java.util.concurrent.ConcurrentLinkedQueue

val commands = new ConcurrentLinkedQueue()
var Timer timer = null

rule "Amazon Echo announcement Controller"
when
    Item AmazonController received command
then
    commands.add(receivedCommand.toString)
    AmazonControlFlag.sendCommand(ON)
end


rule "Amazon startup"
when
    System started
then
    AmazonControlFlag.sendCommand(OFF)
end


rule "Amazon work"
when
    Item AmazonControlFlag changed from OFF to ON
then
    timer = createTimer(now, [ |
        if(commands.peek !== null) {
           val cmd = commands.poll
           Alexa_Kitchen_TTS_Volume.sendCommand(100)
           Alexa_Kitchen_TTS.sendCommand(cmd.toString)
           timer.reschedule(now.plusSeconds(15))
           }
        else{
           timer = null
           AmazonControlFlag.sendCommand(OFF)
        }
     ])
end


rule "Check for more work"
when
    Item AmazonControlFlag changed from ON to OFF
then
   if(commands.peek !== null) {
       AmazonControlFlag.sendCommand(ON)
   }
end

So when the first rule has multiple ones at the same time, when they all turn the flag on, it will just start the timer once since it triggers when changing from OFF to ON. I’m not sure if a race can happen at the end where a new one puts it on the queue and turns on the flag in between the timer seeing there is no more work and shutting off the flag. Just in case, the last rule checks if the queue is empty, and if not turns the flag on to start a new timer. So I like this better than the cron method because it doesn’t have to wait for the first one and won’t be running all the time.

That’s quite ingenious.
I’ll hazard a guess that you are running openHAB on a fairly potent box. This does throw up timing races that the majority of users never see.
There was a bug that silently dropped createTimers scheduled for the same instant, no-one spotted that for years. (fixed in OH2.5)
On a plodding PI, now+5 is different every time you call it, but on a powerful box you might call that several times and get the “same” answer at the millisecond level.

I’ll make a guess about that error; the timer management subsystem is Quartz, and that is granular down to 1mS only.
With the bug fix above, it is now possible to create timers with the same target time, to the millisecond. I wonder though if they both try to reschedule themselves for the same instant, one blows up?

Depending on context the Rules DSL is able to recognize that it needs to call toString on an Object. For example "My Item is = " + MyItem will work because it can see that you are trying to add to a String. However, MyItem + " = this is my Item" will not work because it sees you are trying to add a String to an Item and there is no method that does that.

When you use the sendCommand Action, it requires two Strings as arguments (name of the Item and command to send to that Item) and in this case it’s not smart enough to recognize that it needs to call toString on the command being sent.

One thing to note about this DP is that this problem does not exist in Scripted Automation. In Scripted Automation (note, I’ve seen hints that even Rules DSL may be running in Scripted Automation for OH 3) two instances of the same Rule cannot run at the same time so the two Timers problem wouldn’t be possible.

Another approach that might have worked but I’ve not had time to try it out (hence my lack of previous replies) would be to put a lock around the creation of the Timer. That would make the creation of the timer an atomic operation so even if two copies of the Rule are running at the same time only one could enter the create Timer code at a time and therefore the Timer would exist by the time the second one acquires the lock.

I’m running 2.5 so would have whatever fix you are talking about. For the box, I’m running on a Synology DS718+ in a docker container.

I was using the item_name.SendCommand format, but I guess even then it wasn’t smart enough to know to do toString. Thanks for the info.

Is scripted automation the JSR223 scripting or something else. I’ve been meaning to ask also: it seems like I’ve seen things that JSR223 scripting as just an alternative to DSL (meaning not experimental), but the top of the document section says you need to have the experimental rule engine installed.

The JSR223 name is an anachronism but for this discussion yes, they are the same. The only reason the “Experimental” hasn’t been removed is because PaperUI is not all that usable for defining any but the most simple Rules. The engine itself is quite mature and stable. The Python version in the OP is using Python in Scripted Automation.

1 Like

Not directly related to the gate keeper, but my discussions above were about races. I think I got a race with this:

rule "Mudroom Door Open"
when
   Item DoorMudroom changed to ON
then
   mr_time = now
   NotifyController.sendCommand("HI Mudroom door opened.")

   mudroom_timer = createTimer(now.plusMinutes(5), [|
        mudroom_timer = null
        NotifyController.sendCommand("HA Mudroom door is left open.")
        ])
end

rule "Mudroom Door Close"
when
   Item DoorMudroom changed from ON to OFF
then
   val long delta = now.millis - mr_time.millis
   sendBroadcastNotification("Mudroom door was open for " + Long.toString(delta) + "ms")

   mudroom_timer.cancel()

end

I actually have a few more statements in that execution block for other stuff. Hmm, I think I should move it so the create timer is first, but would that be fool proof? Basically, the ON and OFF happened close together:

2020-05-18 22:09:22.766 [ome.event.ItemCommandEvent] - Item 'DoorMudroom' received command ON
2020-05-18 22:09:22.766 [nt.ItemStatePredictedEvent] - DoorMudroom predicted to become ON
2020-05-18 22:09:22.770 [vent.ItemStateChangedEvent] - DoorMudroom changed from OFF to ON
2020-05-18 22:09:23.050 [ome.event.ItemCommandEvent] - Item 'DoorMudroom' received command OFF
2020-05-18 22:09:23.050 [nt.ItemStatePredictedEvent] - DoorMudroom predicted to become OFF
2020-05-18 22:09:23.054 [vent.ItemStateChangedEvent] - DoorMudroom changed from ON to OFF
2020-05-18 22:09:23.073 [ome.event.ItemCommandEvent] - Item 'NotifyController' received command HI Mudroom door opened.

The timer fired and gave the notification. I guess canceling a timer doesn’t give an error if it’s null? I see nothing in the openhab.log saying such, but it must have been if the timer didn’t get canceled. I can test the door item in the timer rule, but any other option to prevent such races?

The reason they are so close together is the alarm.com integration with Lutron sucks (and with Alexa too for that matter). Make it turn on a light when the door opens. I open the door and stand there waiting for the light to come on. More than a minute. But, shut the door and then it does both the open and close events. I have the doors controlling virtual light switches to get the events in openHAB. The Alexa integration, you have to say to Alexa “Tell alarm dot com to …” to get it to do anything, and it can’t be used in Alexa routines.

It should. That is why we usually use timer?.cancel. The ? skips the rest of the statement if timer is null. Without the ? you should see an error in the logs saying that null does not have a method cancel() or something like that. Are you looking in openhab.log too?

If the rule ran I can’t imagine how that didn’t cancel the actual Timer or generate an exception. Add some logging to verify that the Rule actually ran. There is almost a half a second between the commands. That should be more than long enough for the timer to have been created before the closing Rule ran unless you are on something below the minimum requirements to run OH (e.g. an RPi 0W).

It wouldn’t hurt to move the createTimer to the top of the Rule, particularly if there are some long running Rules like a sleep, executeCommandLine, or sendHttp*Request.

But over all there should have been an exception in the logs if the Timer was null.

Definitely looked at openhab.log, and there’s no error. Rule didn’t run? Well, the other thing in there is a sendBroadcastNotification, which didn’t get received. But I notice I haven’t received ANY since 3:13PM yesteday (I’m getting rid of them because of the lack of reliability).
The cancel doesn’t change the value of the variable, so it can be canceled again, right? I think the prior timer, that got canceled, was still in the variable, because it only gets set to null in the timer itself. No error if a cancel is given to a timer already canceled?
Curious, can one rule be executed at the same time? If it was a single rule for just “changed” or “received command” or something, will it wait for the rule to finish before the next one starts, or they can overlap?

I’m talking about the Close Rule. There is no sendBroadcaseNotification in that code.

You should always post the full code. Don’t edit it for us. Often we know far better than you do what is and is not relevant and we are perfectly capable of ignore those parts. But when you edit them out, you potentially remove something that is important to understand what is going on.

Yes but that Close Rule is the only place where a Timer gets cancelled, unless you have other Rules working on this Timer that are not shown. While the order of triggering of Rules is not guaranteed, again, the 400 msec or so between the ON and the OFF commands is more than enough to ensure that the Open Rule runs first and the Close Rule second (unless it’s a really under powered machine or otherwise resource constrained (e.g. the CPU is pegged).

It’s often good practice that when you are using null to mean something, in this case that there is no timer, that you should set the variable to null when you cancel it.

In Rules DSL, yes. The same Rule can have more than one instance of itself running at the same time. It is possible to have up to five Rules total running at the same time.

With the new rule engine you can have any number of Rules running at the same time, but only one instance of any specific Rule can run at the same time.

Thus, merging these two Rules into one would not solve the problem.

Add logging to the Rules so you can see when they run in relation to when the events occur. And set the Timer variable to null after you cancel it.

Or you can change your overall approach. Add a check to the Timer to only send the notification if the door is still open.

I’m running on a Synology DS718+ with Docker (the standard 2.5.4 docker image). I edited my post with the full rule. I first put in a timer to see what alarm.com was doing (most normal door open/close are close together). The smallest recently was 155 ms. The event.log from that case:

2020-05-14 15:29:50.679 [ome.event.ItemCommandEvent] - Item 'DoorMudroom' received command ON
2020-05-14 15:29:50.679 [nt.ItemStatePredictedEvent] - DoorMudroom predicted to become ON
2020-05-14 15:29:50.685 [vent.ItemStateChangedEvent] - DoorMudroom changed from OFF to ON
2020-05-14 15:29:50.686 [ome.event.ItemCommandEvent] - Item 'NotifyController' received command HI Mudroom door opened.
2020-05-14 15:29:50.687 [vent.ItemStateChangedEvent] - NotifyController changed from HI Single garage door opened. to HI Mudroom door opened.
2020-05-14 15:29:50.689 [ome.event.ItemCommandEvent] - Item 'NotifyControlFlag' received command ON
2020-05-14 15:29:50.691 [vent.ItemStateChangedEvent] - NotifyControlFlag changed from OFF to ON
2020-05-14 15:29:50.834 [ome.event.ItemCommandEvent] - Item 'DoorMudroom' received command OFF
2020-05-14 15:29:50.834 [nt.ItemStatePredictedEvent] - DoorMudroom predicted to become OFF
2020-05-14 15:29:50.839 [vent.ItemStateChangedEvent] - DoorMudroom changed from ON to OFF

Here the NotifyController update for the open happened in between. In the failing case, that was the next thing after the close (I’ll add that above).

A potential problem with checking in the timer, if the door was closed and opened again when the timer expires, it would trigger even though the door wasn’t open for 5 minutes.

Sure, there’s no reason why that can’t happen
When the door-open rule runs, it issues a command to NotifyController.
That’s an asynchronous thing, your rule doesn’t stop and wait for effects.
It fires off the command onto OH event bus.

Some time later, the event will trigger other processes - autoupdate, other rules, bindings, whatever you have configured. And of course the log entry you are looking at.
You would normally expect all that to happen within 100mS, but the log in particular can have its I/O file writing delayed.
It’s not that uncommon to see evidence of a rule executing before the log of the event that triggered it, so don’t be alarmed.

So far as I can see that is all unrelated to your actual fail-to-cancel problem, treat as red herring.

No matter what happens here, the solution is going to be a work around the limitations of the Rules DSL. So you can check that the door is still open and has been open for the full time and if not reschedule. mr_time is available inside the Timer body. And as is usual in programming, continue to find and deal with the edge cases. That would, IMO, be preferable to resorting to locks in this case.

Or you could use the new Rule engine and merge these two Rules into one. There is a kind of built in locking as only one instance of the Rule can run at a time. If the Rules are merged, as long as you are only dealing with two events that are really close together (open then closed) there shouldn’t be a problem with processing them out of order.

155 ms is getting to the range of potential timing problems. Ultimately you will have to decide whether it’s worth the time to work around this with your current approach or choose another one. But if you want to continue, a key piece of information is going to be the timing of the Rules execution in relation to the events in events.log and for that you need to add some logging to the Rules.

Could you explain this? If closed is executed before open, it won’t work. And are you saying that the new rules engine, while not running the rule instances together might run them out of order?

Ok. The timer can see the door is open and mr_time is at least 4:55 or something earlier.

Even at 100 msec that is plenty of time for the Rule to start processing the first event and with the new Rule engine all the subsequent events will be queued up awaiting their turn to use the Rule. If there is only one subsequent Rule there is no problem. If there are two or more, I don’t think that those two or more are guaranteed to be processed in order.

I added some debug log statements, but that problem hasn’t occurred again so far. But now I have a new one that makes no sense.

The rules:

import java.util.concurrent.ConcurrentLinkedQueue

val commands = new ConcurrentLinkedQueue()
var Timer timer = null

rule "Amazon Echo announcement Controller"
when
    Item AmazonController received command
then
    commands.add(receivedCommand.toString)
    AmazonControlFlag.sendCommand(ON)
end


rule "Amazon startup"
when
    System started
then
    AmazonControlFlag.sendCommand(OFF)
end


rule "Amazon work"
when
    Item AmazonControlFlag changed from OFF to ON
then
    timer = createTimer(now, [ |
        if(commands.peek !== null) {
           val cmd = commands.poll
           Alexa_Kitchen_TTS_Volume.sendCommand(100)
           Alexa_Kitchen_TTS.sendCommand(cmd.toString)
           timer.reschedule(now.plusSeconds(10))
           }
        else{
           timer = null
           AmazonControlFlag.sendCommand(OFF)
        }
     ])
end

First, the logs for items Alexa_Kitchen_TTS and AmazonController commands received:

2020-05-21 11:27:40.668 [ome.event.ItemCommandEvent] - Item 'AmazonController' received command Mudroom door is left open.
2020-05-21 11:27:40.670 [ome.event.ItemCommandEvent] - Item 'AmazonControlFlag' received command ON
2020-05-21 11:27:50.675 [ome.event.ItemCommandEvent] - Item 'AmazonControlFlag' received command OFF
2020-05-21 11:36:34.878 [ome.event.ItemCommandEvent] - Item 'AmazonController' received command Main garage door is left open.
2020-05-21 11:36:34.881 [ome.event.ItemCommandEvent] - Item 'AmazonControlFlag' received command ON
2020-05-21 13:06:47.488 [ome.event.ItemCommandEvent] - Item 'AmazonController' received command Main garage door is left open.
2020-05-21 13:06:47.490 [ome.event.ItemCommandEvent] - Item 'AmazonControlFlag' received command ON

After the middle command, the Flag is not turned off. The timer either reschedules or turns the flag off, so how can that timer exit without turning the flag off?

Proof the timer actually ran at that point, the entire log around that time:

2020-05-21 11:36:34.878 [ome.event.ItemCommandEvent] - Item 'AmazonController' received command Main garage door is left open.
2020-05-21 11:36:34.880 [vent.ItemStateChangedEvent] - AmazonController changed from Mudroom door is left open. to Main garage door is left open.
2020-05-21 11:36:34.881 [ome.event.ItemCommandEvent] - Item 'AmazonControlFlag' received command ON
2020-05-21 11:36:34.881 [vent.ItemStateChangedEvent] - AmazonControlFlag changed from OFF to ON
2020-05-21 11:36:34.889 [ome.event.ItemCommandEvent] - Item 'Alexa_Kitchen_TTS_Volume' received command 100
2020-05-21 11:36:34.897 [ome.event.ItemCommandEvent] - Item 'Alexa_Kitchen_TTS' received command Main garage door is left open.
2020-05-21 11:36:34.899 [nt.ItemStatePredictedEvent] - Alexa_Kitchen_TTS_Volume predicted to become 100
2020-05-21 11:36:34.904 [nt.ItemStatePredictedEvent] - Alexa_Kitchen_TTS predicted to become Main garage door is left open.
2020-05-21 11:36:34.909 [vent.ItemStateChangedEvent] - Alexa_Kitchen_TTS changed from  to Main garage door is left open.

To that means the timer started, and should then have been rescheduled. When it then had no work it should have turned off the flag. Since it didn’t, the future ones didn’t execute. I updated an unrelated items file and that somehow turned off the flag, so the next event caused Alexa to go crazy with a whole list of pending things to say.

You’ve removed the code from the rule that prevents you starting off a new timer if one is already in use (the null check). I don’t know if that was original problem, but it will bite you eventually.

Yes, editing stuff in flight can cause Items to go to state NULL and System started rules to run.