[SOLVED] Rule 'Extractor Fan sensor changed': null - another case

CDriver · December 5, 2019, 7:22pm

That’s me trying to break it by setting the group to 23 and 24. The error only seems to creep in when both items are change very close to each other in time. It will, in that instance, kick off two rules; one for each item change.
I have my rules thread set to 30; is there a way I can log, at time of running, how many threads are in play? Even if it’s a different log and I can correlate them.

@rlkoshak
I’ll pepper it with more hereiam logging and see if it makes it clearer. I do not recall seeing the output of logWarn(“loggerName”, triggeringItem.toString) when it does fail so it’s like it does not actually get around to executing. I may just have missed that though - will double check.

The thing that confuses me is why can I not get it to fail when I update one item at a time? If it was because of working with a null item or something like that (I think I have taken care of all of them though?)

Come to think of it, I don’t think this ever failed when I was only monitoring the temperature item. It started after I put humidity in the mix. While not necessarily a problem with the humidity item per se, but rather that there is a rapid “double-execution”.

As part two, I will remove the humidity item for a few days and see what that does.

C

rlkoshak · December 5, 2019, 7:42pm

Or it’s the call to triggeringItem.toString that failed for some reason. You need to use static Strings to log especially at the beginning and end of the Rule to be sure that the error isn’t caused by one of the log statements themselves.

That shouldn’t matter. The Rule isn’t trying to use something outside of itself (e.g. a global variable) and thus is wholly self contained. I don’t think that Rules are not thread safe so you should be able to have two or more running at the same time without stomping on each other. It could be the case that the sendCommand from each instance of the Rule get’s processed out of order, but there shouldn’t be a null error because you have two instances running at the same time.

rossko57 · December 5, 2019, 8:56pm

I would guess your rule blows up on this line or one of its sisters. That’s guessing that one or more of that group is getting updated somewhere in your process.

Without ever quite pinning it down, there seems to be the odd issue accessing a Group membership while that group is dealing with an update. Maybe the group update (an iterative process) locks rules out of an Item briefly, but you seem to get the occasional null result.
Casting null as SwitchItem is going to fail before you get to the if() test later, so those won’t help here.

I suspect it might just work if you lose the casting.
Or maybe this casting method would do it, not sure

val itemFan = g_ExtractorFanProxy.members.findFirst[GenericItem item | item.name == location + "_ExtractorFan"]

CDriver · December 5, 2019, 9:57pm

I’ve long suspected / guessed this is the case. I have had a number of posts on here where it looks like it could be something like and internal deadlock. There have been some pretty strong resistance to that idea. Maybe not a deadlock per se but it is concurrency.

I just wish I could create a case that is recreatable so the devs can see.

Maybe it’s just time to switch to jython you’re banging on about

edit:
I’ve also added the GenericItem thing. will see if that makes a difference too.

rossko57 · December 5, 2019, 10:09pm

If it is a timing sensitive issue, then “reproducible” on your system isn’t going to be reproducible in some other environment.

I’d imagine there a number of members of g_ExtractorFanProxy, that probably boosts the odds by extending the churning time of any member update.
Does this group have a type and/or an aggregation function? It would be surprising if much happened to it, if it did not.

Not me banging on - DSL does all I need,
If as we suspect this is something to do with core group handling, I do not think the rules language will matter. It might fail more gracefully of course.

rlkoshak · December 6, 2019, 2:53am

I forgot about that problem. Indeed it could still be an issue.

There are ways to avoid it though.

use the Image registry to get the Item instead of filtering on a group. See Design Pattern: Associated Items for an example.
use changed triggers instead of received update where possible
try to avoid having a rule that triggers really fast if you can help it, sometimes this requires totally restructuring your approach though.

Actually in Python at least, you really need to access the Group in the way. For example, in the Rule above, you’d just pull the Item states it of items[“itemName”] and never touch the Group as it’s changing. So there is no access to the Group’s members while it’s changing through the Group Item and no error.

CDriver · December 6, 2019, 8:25am

As the dev always says “huh. It didn’t do that on my machine”

Actually, only 4 items, two pairs of two (temperature and humidity)

Nope.
My test one, which can make it fail, is this:

Group tmpf

Normal one is this:

Group:Number    g_ExtractorFanTriggers  "Group of sensors to trigger extractor fans"

I’ll change it to the registry option now and see how that behaves. I stopped using it because of the exception it throws if an item is not found. I found that “untidy”.

Let’s see what the logs say this morning…

CDriver · December 6, 2019, 8:50am

Well, no failures overnight. I removed the humidity items from the triggering group and no issues.
I’ve now put them back in to see if it returns.

Unrelated to these items but used in the group, I have removed the two “max” items and hardcoded them.
Is my thinking correct here? By the time the rule fires, the binding has finished updating the two items. Therefore, is it unlikely it’s a deadlock type of issue? However, now that both rules are running, they’re both searching through the group to find the associated items. Perhaps the issue is only on the group search? I’d expect to see a lot more complaints in the forum if it were this.

rossko57 · December 6, 2019, 10:22am

This dubious. Item updates are individual events. If you trigger rule on either one, the second one can update part way through the first rule. And trigger a second parallel run, of course.

Choosing rule triggers carefully can be very important; see Rich’s point about changed triggers.

CDriver · December 6, 2019, 10:46am

Fair enough. It’s a stretch

Anyway, it still fails. So, having both the humidity and temp items in, both of which are in the triggering group, the errors return. Having only one means no errors.
Going to change now to use the registry and see if that makes a difference.

CDriver · December 6, 2019, 9:29pm

Well, it’s pretty confirmed then
I change it to look up from the registry and not a single error all day.

So, [on my system] a group with two items, both being updated (either by binding ot push from another group) and a 3rd item being set will cause an error when you are searching the group for the item. Using the registry will not cause an error.

Now, to recreate this and give a working example

Thanks for the help folks!

CDriver · December 6, 2019, 9:57pm

and here is something that can recreate it

Items

Group:Number g_SearchGroup
Group:Number g_TriggerGroup
Group:Number g_FakePushGroup


Number Item1    (g_SearchGroup, g_TriggerGroup, g_FakePushGroup)
Number Item2    (g_SearchGroup, g_TriggerGroup, g_FakePushGroup)
Switch Switch1  (g_SearchGroup)

rule:


import org.eclipse.smarthome.model.script.ScriptServiceUtil

rule "FFGG"
when Member of g_TriggerGroup changed 
then
    logWarn("PPP", triggeringItem.name + " has changed and is firing")



    // //option 1
    val itemSwitch          = g_SearchGroup.members.findFirst[SwitchItem item | item.name == "Switch1"] as SwitchItem
    val itemItem1           = g_SearchGroup.members.findFirst[NumberItem item | item.name == "Item1"] as NumberItem
    val itemItem2           = g_SearchGroup.members.findFirst[NumberItem item | item.name == "Item2"] as NumberItem


    //option 2
//    val itemSwitch          = ScriptServiceUtil.getItemRegistry.getItem("Switch1")
//    val itemItem1           = ScriptServiceUtil.getItemRegistry.getItem("Item1")
//    val itemItem2           = ScriptServiceUtil.getItemRegistry.getItem("Item2")


    if (itemSwitch === null)        {logWarn("ItemSwitch is null Cannot continue.");return;}
    if (itemItem1 === null)        {logWarn("itemItem1 is null Cannot continue.");return;}
    if (itemItem2 === null)        {logWarn("itemItem2 is null Cannot continue.");return;}


    val Number val1         = (if (itemItem1.state == NULL) 80     else itemItem1.state as Number)
    val Number val2         = (if (itemItem2.state == NULL) 26  else itemItem2.state as Number)

    logWarn("PPP", "val1:" + val1.toString + " val2:" + val2.toString + " Switch:" + itemSwitch.state)

    if (
        (
                val1 > 25 
            ||  val2 > 30
        )
        && itemSwitch.state != ON){
        logWarn("PPP", "Turning on switch")
        itemSwitch.sendCommand(ON)
        return;
    }

    if (
            (
                    val1 <= 25
                &&  val2 < 50
            ) && itemSwitch.state == ON){
        logWarn("PPP", "Turning off switch")
        itemSwitch.sendCommand(OFF)
        return;
    }

end 

end

If you have two consoles open - one with log:tail and the other rapidly (as fast as my fingers can go - maybe two changes a second?) changing the fake group value

You will eventually (after less than ten goes!) see this:

If you comment out option 1 and put option 2 in, you cannot make it fail.
It would be interesting if this example fails for others too?

C

cweitkamp · December 7, 2019, 10:09am

Hi @CDriver,

Thank you for reporting this issue. I am not sure if we can provide a proper solution in the framework for this. Why? Because we are working in a threaded environment. Your group g_SearchGroup is a global entity. A rule will be executed in a separate thread. Meaning each time you try to access a property of this group - in your case members - in a rule (thread A) it may have been touched in the meantime by another thread B. Thus a rule should be implemented thread safe too.

For your example I would suggest to store the members of the group in a local variable first before applying findFirst on them:

...
    // //option 3
    val localMembers = g_SearchGroup.members
    val itemSwitch          = localMembers.findFirst[SwitchItem item | item.name == "Switch1"] as SwitchItem
    val itemItem1           = localMembers.findFirst[NumberItem item | item.name == "Item1"] as NumberItem
    val itemItem2           = localMembers.findFirst[NumberItem item | item.name == "Item2"] as NumberItem
...

rlkoshak · December 7, 2019, 5:03pm

Not knowing the internals if how OH does this, but is OH using a thread safe Collection to store the Group members. What’s confusing to me is that the error looks a little like what you get when your are iterating over a collection and some other thread modified the collection. So would a CopyOnWriteArrayList or something like that be a possible solution?

I don’t think that will work because localMambers does not get a copy of the list, it will point to the same object in memory that .members points to. They would have to create a copy of the List, but that will run into the same concurrency problem as he iterates over the list to make the copy.

Unless something else is going on, such as a brand new List of members being created when the Group processed an update of one of it’s members.

Not having looked at the code I may be way off base, but I’ve had to dealt with similar issues in the past on stuff I have coded myself in this way.

cweitkamp · December 7, 2019, 5:13pm

Something like that. The GroupItem uses a CopyOnWriteArrayList for managing its members internally. Thus it is thread safe inside the framework. Once - and every time - you access the members (e.g. getMembers) a complete new copy of the members list - not the members itself - will be created and returned as immutal set (LinkedHashSet). Have a look here for details. A local variable is not only thread safe but additionally faster and saves a little bit memory.

rlkoshak · December 7, 2019, 5:23pm

So would it be possible to change .members to do this for us so that this internal implementation detail need not be exposed to the rules user? I think it would be a great boon to all users as this has been an issue that crops up from time to time since I’ve been using OH.

At a minimum, I need to add this to the docs and change all of my DPs and tutorials and we will have lots of examples in this forum that will never get updated. If .members were refactored with a method that returned the copy instead of pointing at the internal data member that would fix it for everyone.

rossko57 · December 8, 2019, 9:00pm

Woah, this is something of a revelation.

cweitkamp · December 9, 2019, 8:58pm

I am afraid I did not get your point. Or we misunderstood each other. What I explained in my post above reflects the current implementation …

CDriver · December 9, 2019, 9:16pm

Well, I’m glad we’ve got to a point where this is now a known thing. Be it good or “bad” - I’ve had a number of rules misbehave which I am sure can be pinned to this. My cron-for-motion one had the exact same logic in it and too failed randomly. Never could get to the bottom of it.

I will try option 3 for clarification and report back.

Question though, you option 3 will effectively give me a read-only copy of the list? I won’t be able to update it. Or, is it a by-ref copy (not sure what the Java equivalent is?)

Should this, as Rich is saying, not be handled internally? Thinking from a data point of view (I’m a database guy by trade) it can become complicated quickly. But it does seem though that the we’ll always have this gotcha lurking?

Cheers

rossko57 · December 9, 2019, 9:18pm

Why would you want to edit a Group’s membership list in this kind of rule?