[JSR223][Jython]What is the correct way to avoid a race condition getting an item value inside a /received update/ triggered rule?

leif · June 14, 2019, 1:29am

Platform information:
- Hardware: Raspberry Pi 3 B+, 120gb mSATA SSD in a USB enclosure
- OS: Raspbian Stretch
- Java Runtime Environment: 1.8.0_212
- openHAB version: 2.4.0 release build. Jython 2.7.0

This morning my night lights didn’t turn off.

Here’s how I have them set up.
Two Switch items triggered by astro channel events, and other rules triggered based on the switches.

Items:

Switch DarkOutside "Dark" <moon> (Bench)
Switch SleepMode "Sleep" <bedroom_blue> (Bench)

night_lights.py:

from org.slf4j import LoggerFactory

from core.rules import rule
from core.triggers import when

def logprint(text):
	pass
	LoggerFactory.getLogger("org.eclipse.smarthome.automation.examples").info(text)


@rule("Sunset Rule", description="", tags=["schedule"])
@when("Channel \"astro:sun:local:civilDusk#event\" triggered START")
def Sunset_Decorators(event):

	item=ir.getItem("DarkOutside")
	item.send(ON)
	

@rule("Sunrise Rule", description="", tags=["schedule"])
@when("Channel \"astro:sun:local:civilDawn#event\" triggered START")
def Sunrise_Decorators(event):

	item=ir.getItem("DarkOutside")
	item.send(OFF)
	
	


@rule("Night Lights", description="", tags=["schedule"])
@when("Item DarkOutside received update")
@when("Item SleepMode received update")
def NightLights_Decorators(event):

	itemDarkOutside=ir.getItem("DarkOutside")
	itemSleepMode=ir.getItem("SleepMode")

	isDarkOutside=1 if itemDarkOutside.getState() == ON else 0
	isSleepMode=1 if itemSleepMode.getState() == ON else 0


	itemNightLights=ir.getItem("NightLights")

	logprint("dark: " + str(isDarkOutside) + "   sleep: " + str(isSleepMode))

	itemNightLights.send(ON if isDarkOutside!=0 else OFF)

	itemMainGateLights=ir.getItem("MainGateLights_On")
	itemMainGateLights.send(ON if isDarkOutside!=0 and isSleepMode==0 else OFF)

I looked at my logs and saw the following:

2019-06-14 00:07:11.985 - bedroom wallmote: 1.0
# that was me turning on my bedside light

2019-06-14 00:07:12.996 - bedroom wallmote: 2.0
# that was me turning off my living room ceiling lights

2019-06-14 05:17:26.087 - wallmote: 1.0
# that was me trying to turn on my workstation monitors

2019-06-14 05:17:35.193 - wallmote: 1.0
# that was me trying again to turn on the workstation monitors since the z-wave switch didn’t react the first time…

2019-06-14 05:32:00.041 - dark: 1 sleep: 0
# this was civilDawn today, when the night lights should have turned off.

2019-06-14 05:46:58.872 - WashingMachineFrontLoaderWatts entering washing cycle
# someone is doing laundry

It looks to me like the “Night Lights” rule was correctly triggered by Item DarkOutside received update, but when we got the state of itemDarkOutside, it still reported ON.

What is the correct way of getting the state of an item that has triggered a rule?

Also, even if the getting the item state the way I did it is wrong, it would probably make things easier for everyone who scripts openHAB in the future if the underlying implementation could be modified (some may say bug fixed) to make the trigger consistently happen after the item states have been updated. As it is now, it behaves as if there’s a race condition in the code, and even if it can be worked around by more scripting, it might be better if it just didn’t happen at all.

Let me also please emphasize that I didspend a good hour googling and searching the forum trying to find a better looking way to do it, before implementing in this fashion to begin with. Frantic searching gets more difficult over time though as more and more related links end up showing as already visited and thus it becomes difficult to know whether I visited a page with regards to the current problem at hand or a previous problem. I was not able to find a better answer, but that doesn’t mean there isn’t one, of course.

“Implicit Variables” seemed to offer a clue, but I never found good documentation about those… until now, as I’m writing this post, of course. This post seems to be it: forum post
The “But how do I…?” page has many good tidbits of information, like event.itemState but it doesn’t say what methods are available on the received object! Different items (numbers / switches / strings) have very different accessor functions, and it would be great to have examples for how to both update all of those item types as well as how to write “received update” rules… I feel this is a highly fundamental part of OpenHAB and could probably use more documentation. I do understand that this is an open source project and as such written by volunteers, and documentation is not the fun part of development, to be sure…

This really feels like a trap, though. I just found another way to do it:

@rule("MyButtonRule", description="", tags=["utility"])
@when("Item MyButton received update")
def MyButtonRule(event):

	itemstate = items["MyButton"]
	logprint("itemstate is " + str(itemstate))

…and this way ALSO doesn’t work! Here’s what the log says when I toggle the MyButton switch item on and off.

2019-06-14 09:39:41.971 [INFO ] itemstate is ON
2019-06-14 09:39:42.255 [INFO ] itemstate is OFF
2019-06-14 09:39:42.618 [INFO ] itemstate is ON
2019-06-14 09:39:42.957 [INFO ] itemstate is OFF
2019-06-14 09:39:43.804 [INFO ] itemstate is ON
2019-06-14 09:39:44.244 [INFO ] itemstate is OFF
2019-06-14 09:39:44.764 [INFO ] itemstate is ON
2019-06-14 09:39:45.138 [INFO ] itemstate is OFF
2019-06-14 09:39:45.712 [INFO ] itemstate is ON
2019-06-14 09:39:46.244 [INFO ] itemstate is OFF
2019-06-14 09:39:47.022 [INFO ] itemstate is ON
2019-06-14 09:39:47.693 [INFO ] itemstate is OFF
2019-06-14 09:39:48.168 [INFO ] itemstate is ON
2019-06-14 09:39:48.717 [INFO ] itemstate is OFF
2019-06-14 09:39:51.887 [INFO ] itemstate is ON
2019-06-14 09:39:53.163 [INFO ] itemstate is OFF
2019-06-14 09:39:54.608 [INFO ] itemstate is OFF
2019-06-14 09:39:55.159 [INFO ] itemstate is OFF
2019-06-14 09:39:55.996 [INFO ] itemstate is ON
2019-06-14 09:39:57.596 [INFO ] itemstate is ON
2019-06-14 09:39:59.669 [INFO ] itemstate is ON
2019-06-14 09:40:00.372 [INFO ] itemstate is OFF

Seriously. This CAN’T just be me, can it? Is this a really obvious bug? What am I missing?
I’m about to put a freaking sleep statement in at the top of all my events just so I can get something done.

Thank you for reading! I’m very impressed by this forum so far, I’ve gotten excellent answers every time, from one person in particular but nonetheless!

///Leif

leif · June 14, 2019, 3:35am

Okay, I believe I have the answer now.
The answer is:

event.itemState

It works the same way for switches, numbers AND strings, and it’s reliable.

Now, it took a lot of frustration and time to find this – i don’t know why, but it did. I’m leaving the original post up there to underline that fact.

I still think the race condition could/should be fixed in the underlying code. I’m sure it will save plenty of people plenty of frustration down the line. Why not just simply delay calling the triggers until the values have been updated? We are not in that much of a hurry, sheesh!

Best regards,
///Leif

leif · June 14, 2019, 4:04am

…and, that’s not quite the answer, because what do you do if there’s more than one trigger on a rule?

This cannot possibly be the right answer, I hope:

@rule("MyButtonRule", description="", tags=["utility"])
@when("Item MyButton received update")
@when("Item Test_Switch_1 received update")

def MyButtonRule(event):

	stateMyButton=event.itemState if event.itemName=="MyButton" else ir.getItem("MyButton").getState()
	stateTestSwitch=event.itemState if event.itemName=="Test_Switch_1" else ir.getItem("Test_Switch_1").getState()

	logprint("MyButton=" + str(stateMyButton) +"    Test_Switch_1=" + str(stateTestSwitch) )

This works, but come on… it’s atrocious.

rossko57 · June 14, 2019, 10:26am

I don’t do Jython, so I can’t see the detail of your rules.
But I can share experience from DSL rules - it’s the same event-driven system after all.

It’s important to distinguish update from changed. It is possible to update an Item to the same value. This is perfectly normal, e.g. the state could be coming from a device periodically reporting or polled.
(If it does change, you get both changed and updated events)
Be careful what rule triggers you choose.

This may be the explanation behind your myButton example as well; without knowing what the button is or what the technology/binding it is associated with might be doing.

matt1 · June 14, 2019, 10:55am

Don’t forget the third one which is command.
It is covered at this link under event-based triggers.

rossko57 · June 14, 2019, 11:00am

Absolutely. That’s usually the first hole people fall into - “I trigger on command but the state hasn’t changed!”. Glib answer - why should it? Long answer - it probably will in a while, when the binding has actioned it and the device responded. If you’re interested in the state, trigger from the state.

leif · June 14, 2019, 12:13pm

MyButton was just a switch which I toggled from the basic UI, defined like so:

Switch MyButton 	"My Button" 	<lightbulb> (Bench)

Actually I do understand the difference between changed and update, but in this case I sat there toggling the switch back and forth by clicking on it in the basic UI. The rule fires every time. Usually the item state has updated when rule runs, but sometimes not, and that’s when we see duplicate states in the log (OFF, ON, OFF, ON, ON,ON,OFF).

Speaking from my experience as a programmer, it looks to me like a typical race condition. If the core code waited until it has updated the item state before it fires the rule, I don’t believe this behavior would be possible.

rossko57 · June 14, 2019, 12:36pm

The race I think you are looking at with your UI switch is I think with the UI.
It takes time for the UI to send a command (NOT an update) to OH. Then it takes time for OH autoupdate feature (which I think you rely on here) to process the command into a state update.
Now at last your rule can trigger.
Meantime, the new state has not yet arrived back at the UI. The UI switch is still at the old state, and clicking it again will produce a duplicate command and eventually a no-change update.

Check your events log for duplicate commands that explain duplicate updates.

leif · June 14, 2019, 1:31pm

Thank you for replying, rossko57!

I was going to dismiss the theory outright because I first saw the problem on a “received update” rule on an item that was updated by postUpdate in a rule triggered by astro, not the GUI.

But, I figured you might be right, so I decided to test it thoroughly, and the test results shows that it is not in the gui.

Here’s the test setup:

Items:

Switch RaceGuiSwitch 	"Race Condition Gui Item" 	<lightbulb> (Bench)
Switch RaceItemSwitch    "Race Condition Item"

racecondition.py

from core.rules import rule
from core.triggers import when

from core.log import logging, LOG_PREFIX
log = logging.getLogger(LOG_PREFIX + ".racetest")

def logprint(text):
	log.debug(text)

@rule("RaceGuiSwitch", description="", tags=["utility"])
@when("Item RaceGuiSwitch changed from OFF to ON")
def RaceGuiSwitch(event):
	events.postUpdate(ir.getItem("RaceItemSwitch"),ON)

@rule("RaceGuiSwitch", description="", tags=["utility"])
@when("Item RaceGuiSwitch changed from ON to OFF")
def RaceGuiSwitch(event):
	events.postUpdate(ir.getItem("RaceItemSwitch"),OFF)


@rule("RaceItemSwitch", description="", tags=["utility"])
@when("Item RaceItemSwitch received update")
def RaceItemSwitch(event):
	logprint("ir.getItem().getState(): " + str(ir.getItem("RaceItemSwitch").getState()) + "    event.itemState:" + str(event.itemState))

Here’s the console output:

2019-06-14 20:06:13.706 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF
2019-06-14 20:06:14.990 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:ON
2019-06-14 20:06:15.685 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF
2019-06-14 20:06:16.500 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): ON    event.itemState:ON
2019-06-14 20:06:16.992 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF
2019-06-14 20:06:17.446 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:ON
2019-06-14 20:06:17.743 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF

The discrepancy between getItem().getState() and event.itemState could not have come from the GUI because the RaceItemSwitch is decoupled from RaceGuiSwitch by two rules: “changed from ON to OFF” and “changed from OFF to ON”.

I also implemented the test rule in Rules DSL, as follows:

rule "test race condition"
when
	Item RaceItemSwitch received update
then
	logInfo("rules DSL","item state: " + RaceItemSwitch.state)
	
end

pi@openhab:~ $ tail -f /var/log/openhab2/openhab.log | grep jython
2019-06-14 20:18:50.457 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): ON    event.itemState:ON
2019-06-14 20:18:50.658 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF
2019-06-14 20:18:50.891 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:ON
2019-06-14 20:18:51.071 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): OFF    event.itemState:OFF
2019-06-14 20:18:51.263 [DEBUG] [jsr223.jython.racetest              ] - ir.getItem().getState(): ON    event.itemState:ON
^C
pi@openhab:~ $ tail -f /var/log/openhab2/openhab.log | grep DSL
2019-06-14 20:18:50.471 [INFO ] [pse.smarthome.model.script.rules DSL] - item state: ON
2019-06-14 20:18:50.670 [INFO ] [pse.smarthome.model.script.rules DSL] - item state: OFF
2019-06-14 20:18:50.916 [INFO ] [pse.smarthome.model.script.rules DSL] - item state: ON
2019-06-14 20:18:51.085 [INFO ] [pse.smarthome.model.script.rules DSL] - item state: OFF
2019-06-14 20:18:51.273 [INFO ] [pse.smarthome.model.script.rules DSL] - item state: ON
^C

Note how it does NOT show any problem! I spammed the GUI button for several minutes, could not trip it up. So this problem looks to be inherent to JSR223, or perhaps the next gen rule engine, or perhaps Jython. That makes more sense – a bug of this magnitude couldn’t remain if it affected the majority of users and uses.

Now, the reason I’m not using Rules DSL is that as I read through the documentation, I came across some (to my eyes as a programmer) ugly workarounds around the lack of being able to define functions, and then in this document i came across this:

Upon reading this, and seeing as I was dumping my $600+ Fibaro Home Center 2 and switching to Openhab mostly because of annoying scripting limitations that make anything beyond the simplest of systems a maintenance nightmare, I realized that Rules DSL would indeed NOT be for me, and that perhaps it was time to learn a new language. And that is how after 20 years of C++ experience, I started learning Python!

So anyway. Could anyone else help confirm this race condition in JSR223/Python rule handling?

rossko57 · June 14, 2019, 1:52pm

Well, I’ve no idea what that does or how long winded about it it is. It remains possible here for the “real” Item state to have changed after the rule was triggered, while the “event” version is captured forever. It seems pretty unlikely as this is early on in the sequence.

Maybe we’re looking at it wrong, and it is actually slow performance of rule that gives a discrepancy. With button presses half a second apart it would have to be dismal.

As before, side by side comparison with events log timestamps would aid understanding.

leif · June 14, 2019, 2:46pm

JSR223/Python:

ir.getItem("RaceItemSwitch").getState()

…is supposed to be the exact equivalent of
Rules DSL:

RaceItemSwitch.state

Yes, the event.itemState version works, and is usually usable. That’s no problem, it’s great that it works. The problem is that getItem().getState doesn’t work reliably, and it’s not obvious that it doesn’t work well until you start relying on it. It complicates things for new users for no good reason.

When you have a rule triggered by two item states, which makes decisions based on both item states, then event.itemState doesn’t cut it anymore, because you still have to read the state of the item that didn’t change… but the one that did change must be read from event.itemState because reading the item state directly is not reliable.

And that’s how you end up with this atrocity:

	stateMyButton=event.itemState if event.itemName=="MyButton" else ir.getItem("MyButton").getState()
	stateTestSwitch=event.itemState if event.itemName=="Test_Switch_1" else ir.getItem("Test_Switch_1").getState()

The rules seem to be running instantly. If anything they’re running too quickly – they’re executing so quickly that the item state hasn’t had time to update yet! Adding a short delay statement before reading the item statement band-aids the problem (it behaves properly) but at the expense of performance.

rlkoshak · June 14, 2019, 3:11pm

I think this might be a case where understanding how OH works under the covers might explain the behavior. This might not be something that has been seen or reported before as perhaps other things cause Rule triggers to slow down enough that there isn’t such a race condition. Note: I’m taking your word for it that there is a race condition here. I haven’t looked at the posted code that closely to confirm.

I’m hoping that @5iver, our foremost Jython and JSR223 expert will chime in. If this is indeed a problem with Jython, as the evidence seems to indicate (whether or not it’s Jython’s fault) we will have more people run into this problem. Also, rossko57 and I might be off base and their might be some detail we don’t know about that would make this behavior.

OK, now for a deepish dive into OH. OH is driven by an EventBus. Events like updates, commands, and changes get published to this event bus. All other parts of OH operate off of this bus. That means that when an update gets published to th

e bus every part of the system that cares about that event pick it up and start processing it in parallel. This means that the Rule get’s triggered at the same time that the Item Registry (I’m guessing which part actually changes the Item) picks up the event to update the Item which happens at the same time that Persistence picks up the event to store it into the database, and so on.

Given this is all occurring at the same time, strictly speaking you should never rely on anything else to have finished before the Rule get’s triggered. So, as you have found, if you want guaranteed to get the state/command that triggered the Rule then you need to get that from the event, not from the Items. NOTE: changed is a little different because it is the Item registry that generates the event so you know the Item has entered the new state before the Rule triggers. But even here there is no guarantee that the Item hasn’t changed state again before your Rule triggers.

In practice, the timing usually works out that you can rely on the Item’s state for update and even command triggered Rules. But not all the time, especially for command triggered Rules. But every now and then, especially on fast machines, you discover cases where the timing breaks down.

NOTE: I’m basing the above on years of experience using OH, helping users on the forum, and experience I’ve had coding similar event based systems. I have not looked at the OH code so I may be wrong.

But if I am correct on this, it means that it is a fundamental limitation of the architecture and not something that can easily be fixed. It probably isn’t a problem with Rules DSL because Rules DSL is s-l-o-w. It takes long enough for the Rule to kick off that the Item registry has already processed the update as necessary.

It is the functional equivalent. There is no such thing as an exact equivalent when you are dealing with to such drastically different programming languages with vastly different structure, initialization, etc. In this case, when the Rules DSL Rule runs, the ir.getItem("RaceItemSwitch") part is executed before the Rule triggers (if I understand correctly).

There is no solution to this problem that won’t be at the expense of performance. But in a home automation context, so what? If 100 msec or so makes enough of a difference to become a problem then OH isn’t the tool for you and you should be using a real time system (or moving the operation to the device itself).

leif · June 14, 2019, 4:45pm

Thanks for all your tutorials and design patterns, @rlkoshak! Pleased to make your acquaintance.

Interesting. It could indeed be that Rules DSL is slow enough that it’s a not an issue.

Interesting about the EventBus too, that makes sense. Although, there are a couple of ways to solve the issue.

One way, which would be a pretty big change, would be if the eventbus listeners had priorities, so that the ItemRegistry got the update synchronously, first, and then every other listener got it. Without knowing anything about the inner workings of OH, I’m rather audacious to even suggest it. It is a viable and pretty obvious programming model (i’ve done similar things myself in event-based systems) – it could even be that OH already has it and the JSR223 binding forgot to flag itself as lower priority than the item registry? Wishful thinking with no factual basis.

Here’s a possible safe way to solve the problem in the JSR223 layer, though.
What if “ir” when referenced within a rule got you a subclassed version of the itemregistry, where getItem() instead returns a subclassed version of the item, which in turn returns event.itemState if available in the current context, but otherwise passes the request through to the real item registry like it normally does?

You are 100% correct, please forgive my hyperbole! I meant functional.

Indeed a 100 millisecond delay would be inconsequential. But, couldn’t it be in the back-end so that it “just works” without one having to worry about it when writing scripts?

I like the subclassing idea better though, that would probably be transparent, and appear to “just work”.

My self-proclaimed atrocity above wouldn’t be that if it were in the core code… then it just becomes business logic. It’s when you have to work around low-level issues like synchronization in high-level scripting that things get ugly, in my humble opinion.

ricadelic · June 14, 2019, 10:04pm

I’ve also seen race conditions with Jython. Especially it seems the item registry is loaded once the rule is run. When updating an item, it seems the state of the item isn’t updated until the script is released.

I’m always using the items.item_name wait to find out the state of the item, and finding out which item triggered the rule by checking the events object. I only use ir.getItem(“MyButton”) when needed. you should be able to use items.mybutton which returns the state of the item.

To make sure unexpected things don’t happen, i usually create an initialize part of a script that checks the state of all the relevant items. If items end in unexpected states, a rule is not executed, but rescheduled by a timer.

I do agree we should be able to fix this in the JSR223 Jython helper code.

rossko57 · June 14, 2019, 10:37pm

It may well not compare, but in DSL
someItem.postUpdate(newState)
does not, and is not intended to, have immediate effect. It puts the update request onto openHABs bus is all.
The actual update occurs asynchronously. It is not a variable assignment, it is a request for action.
For example, this will almost always “fail”
someItem.postUpdate(newState)
logInfo("test", "Bet you this is the old state " + someItem.state.toString)

It’s not a bug and it’s not a programming handicap, the rule does after all know what the shortly expected newState is.
Certainly confuses authors until they grasp the event bus idea.

mjcumming · June 14, 2019, 11:17pm

Rick have you opened an issue on github?

leif · June 15, 2019, 12:16am

That is the same in Jython.
It’s also the same in windows application (WIN32 API) programming. SendMessage() waits for a response, PostMessage() returns immediately. Makes perfect sense to me.

leif · June 15, 2019, 1:04am

THIS intrigued me – could you give me an example?

CrazyIvan359 · June 16, 2019, 2:23am

@leif the race condition you are experiencing is not a bug, it is a result of the parallel nature of openHAB, as @rlkoshak mentioned. The reason you are unable to duplicate it is because, as you mentioned, the DSL rules are slower to run than JSR223 rules, combined with your platform. My openHAB is running on an old desktop and I could duplicate your race example in the DSL if I were to set it up.

The solution, from experience, is to use event.itemState to get the new state the item will have instead of getting it from the registry. You should also be using the equivalent in the DSL if you need to make rules that trigger on received update to prevent race conditions. I had the same confusing race conditions when I was first starting out with openHAB and the DSL.

You may want to check out the new documentation for the Helper Libraries. We just finished the majority of it, finishing touches to come. You should look at the Event Object Attributes page for details about what values you get for each of the trigger types.

Feel free to reach out to me if you have any other issues. Rich, feel free to tag me as well as Scott if you need an assist on JSR223 stuff.

leif · June 16, 2019, 3:41am

Thanks, Michael. Are you saying this is the correct solution to acting on multiple items in one rule?

leif:

	stateMyButton=event.itemState if event.itemName=="MyButton" else ir.getItem("MyButton").getState()
	stateTestSwitch=event.itemState if event.itemName=="Test_Switch_1" else ir.getItem("Test_Switch_1").getState()

If that has to be done in every rule that acts on multiple items (not an uncommon thing), wouldn’t it be better if it was in the back-end code so we wouldn’t have to script it every time?

I do understand the “it has always worked this way”, but more and more people use openHAB every day, and it’s always evolving, so I’m asking, is there a way to make this better? Would there be any drawbacks to eliminating this race condition and making the behavior consistent?
It’s hard to imagine that there could be any script relying on the race condition to function properly… short of using it is a 1-bit random number generator?

As a product developer, I’m often on the other side of this equation. Often, “by design” is the right answer. But, there are times when I’m explaining a tricky concept to a customer (or the support department) when I realize that it would be easier and save everybody a lot of future grief if I just made it better, rather than explain the complicated current procedure. I’m wondering if this is one of those times.

New documentation looks great by the way! Bookmarked, will read.