[SOLVED] Error: Rule '<rulename>': null

It’s all above, quoted again. First excerpt is openhab.log, 2nd is events.log

cat /var/log/openhab2/openhab.log | grep 2018-06-10\ 10:23
cat /var/log/openhab2/events.log | grep 2018-06-10\ 10:23 (result is cut, two messages before and one after)

OK, that is what I was trying to confirm… there were no log entries that actually came from your rule to determine which line caused the exception. See if you get some logging with this…

rule "calc diff temp"
when 
    Member of gActTemp changed
then
    logInfo("calc diff temp", "{}: state changed to {}",triggeringItem.name,triggeringItem.state)
    val setitem = gSetTemp.members.findFirst[ i | i.name == actitem.name.split("_").get(0) + "_SETTEMP" ]
    if (setitem === null) {
        logInfo("calc diff temp", "{}: Could not find setitem",triggeringItem.name)
    }
    else {
        logInfo("calc diff temp", "{}: setitem.name=[{}]",triggeringItem.name,setitem.name)
        if (setitem.state == NULL) {
            logInfo("calc diff temp", "{}: setitem state is NULL {}",triggeringItem.name,setitem.name)
        }
        else {
            logInfo("calc diff temp", "{}: setitem.state=[{}]",triggeringItem.name,setitem.state)
            val diffitem = gDiffTemp.members.findFirst[ i | i.name == actitem.name.split("_").get(0) + "_DIFFTEMP" ]
            if (diffitem === null) {
                logInfo("calc diff temp", "{}: Could not find diffitem",triggeringItem.name)
            }
            else {
                logInfo("calc diff temp", "{}: diffitem.name=[{}]",triggeringItem.name,diffitem.name)
                if (diffitem.state == NULL) {
                    logInfo("calc diff temp", "{}: diffitem state is NULL {}",triggeringItem.name,diffitem.name)
                }
                else {
                    logInfo("calc diff temp", "{}: diffitem.state=[{}]",triggeringItem.name,diffitem.state)
                    var Number diff = (triggeringItem.state as DecimalType) - (setitem.state as DecimalType)
                    logInfo("calc diff temp", "{}: diff=[{}]",triggeringItem.name,diff)
                    diffitem.postUpdate(diff)
                }
            }
        }
    }
end

The failing line of code for me is:

val isHomeItem = gIsHome.members.findFirst[ t|t.name == isHomeName ]

and all other .findFirst / .filter lines where you are getting the item from the group. My null checks are not catching this but the logs immediately above these lines are working and the ones immediately below are not.

OK, this is some progress.

First, log/test everything on that line.

  • what is gIsHome? logInfo("test", gIsHome)
  • how big is the list of members? logInfo("test", gIsHome.members.size)
  • what is isHomeName? logInfo("test", isHomeName)

Do these all make sense?

I thought a bit about my case:

  • the rule is called more than once
  • the only items in common of the two rule instances are the groups (gDiffTemp, gActTemp, gSetTemp). All other items are completely disjunct.
  • the only change of item states is the calculation of the new diff temp. This is secured by the guards before, so the result and the inputs can not be NULL
  • the group gActTemp is (hopefully) changed before the first line of the rule
  • the group gDiffTemp is changed after the last line of the rule
  • the group gSetTemp does not change at all
  • i think an item state change operation is an atomic operation, otherwise this error would happen more often
  • the exception happens with the 2nd instance (ArWT), between the change of gActTemp and gDiffTemp by the first instance (F2WT), according to timestamps.

My guess would be, the rule engine (instance ArWT) cannot access a group (gDiffTemp or gActTemp), while a group member is being changed (by instance F2WT) or while the group average is being calculated by openHAB.

This is my combined (openhab.log+events.log) log put into sequence by time:

2018-06-10 10:23:30.890 [vent.ItemStateChangedEvent] - LM3_Spannung changed from 228.40 to 228.80
2018-06-10 10:23:35.193 [vent.ItemStateChangedEvent] - F2WT_RSSI changed from -74 to -72
2018-06-10 10:23:35.282 [vent.ItemStateChangedEvent] - F2WT_ACTUALTEMP changed from 24.20 to 24.30
2018-06-10 10:23:35.345 [vent.ItemStateChangedEvent] - ArWT_ACTUALTEMP changed from 25.60 to 25.70
2018-06-10 10:23:35.450 [GroupItemStateChangedEvent] - gActTemp changed from 25.10 to 25.11 through F2WT_ACTUALTEMP
2018-06-10 10:23:35.630 [ERROR] [ntime.internal.engine.RuleEngineImpl] - Rule 'calc diff temp': null
2018-06-10 10:23:35.872 [GroupItemStateChangedEvent] - gDiffTemp changed from 9.29 to 9.30 through F2WT_DIFFTEMP
2018-06-10 10:23:35.878 [vent.ItemStateChangedEvent] - F2WT_DIFFTEMP changed from 3.20 to 3.30
2018-06-10 10:23:41.648 [vent.ItemStateChangedEvent] - AZWT_ACTUALTEMP changed from 25.40 to 25.50

It’s really fast but I don’t know we can make that assumption. One would have to review the Item Registry code to verify that is the case.

That might very well be the case. You might be able to test for that. Check to see if the Group is null before the findFirst line. If it is, that could be evidence that it is the problem.

Given your reasoning, the gDiffTemp would be the Group that would be changing while this Rule is running since you say gSetTemp doesn’t change. So focus on that Group.

The member of gActTemp definitely changed. But I suspect the state of gActTemp itself is still being calculated in the background when this Rule triggers. Also, the state of gDiffTemp is also potentially being recalculated from the previous run of this Rule, depending on how close together new ActTemps come in.

I think we need to gather more evidence before we file an issue, but if your guess is right, then there will likely need to be an issue filed. It will probably be a major pain to correct though so we should also look for a work-around.

This is also an assumption that we may not be able to make. We know it is not NULL at the time of checking it in the if statement. But the actual line where we use it occurs some milliseconds later and the state could have changed out from under us. It’s unlikely but if you have enough Items updating all at once it might occur, though rarely.

We can eliminate this possibility as well:

rule "calc diff temp"
when 
    Member of gActTemp changed
then
    val actitem = triggeringItem // we don't really need to cast this to a NumberItem
    val act = actitem.state
    // skip the rule if the actitem changed to NULL
    if(act == NULL)
    {
        logInfo("calc diff temp", "%s's state is NULL!", actitem.name)
        return;
    }

    val setitem = gSetTemp.members.findFirst[ i | i.name == actitem.name.split("_").get(0) + "_SETTEMP" ]
    // skip the rule if setitem or diffitem were not found in their respective Groups
    if(setitem === null)
    {
        logInfo("calc diff temp", "Could not find setitem for %s", actitem.name)
        return;
    }
    val set = setitem.state
    if(set == NULL)
    {
        logInfo("calc diff temp", "%s's state is NULL!", setitem.name)
        return;
    }

    val diffitem = gDiffTemp.members.findFirst[ i | i.name == actitem.name.split("_").get(0) + "_DIFFTEMP"]
    if(diffitem == null)
    {
        logInfo("calc diff temp", "Could not find diffitem for %s", actitem.name)
        return;
    }

    // at this point we have all three relevant Items and know their states are not NULL

    var Number diff = (act as DecimalType) - (set as DecimalType)
    diffitem.postUpdate(diff)
end

The big change is we save the state to a val and then only use the val in the Rule. This will prevent any changes taking place behind the scenes from impacting the Rule. Based on the observed behavior and your reasoning I don’t think this is the problem. But it will eliminate it as a possiblity.

I can think of a number of mitigations for the observed behavior but we need more info before I’ll know which one is most appropriate. But they include:

  • Don’t use findFirst to get the Item out of gDiffTemp and instead use postUpdate(actitem.name.split("_").get(0) + "_DIFFTEMP", diff). Since gDiffTemp is the only one that we think is causing problems, if we avoid trying to findFirst on the Group then the error should not appear.
  • Design Pattern: Gate Keeper, but it will require the members of gActTemp to receive a command, not just update.

I have tried doing that in my code but it has not helped - I think this has to do with the way we extract the item from the group.

postUpdate(actitem.name.split("_").get(0) + "_DIFFTEMP", diff)

I will try this version on mine and report back.

Hopefully we are getting somewhere!

Thanks

No, it’s a fact defined by my devices. The don’t publish the temperature in miliseconds, they publish the current temperature sometimes between 5 and 10 minutes. (IIRC roughly every 8 minutes).

I will make the change regarding the elimination of the findfirst for diffitem, this is the most promising change in my opinion. Thanks. Let’s see what happens.

Edit:
This will be difficult from now on. This error occured only twice after the rule change (on 2018-06-08). The rate went down from three times a day to once every three days. :wink:

Moving from members.findFirst to postUpdate(x,y) seems to have worked - no errors in the last 24 hours!

Having said that, I implemented this change in a fairly simple rule where I directly update the item I fetched from the group without any state checks. Most of my other rules include code that checks an item’s state before doing something - @rlkoshak, do you have any alternative methods to replace a piece of code like below?

val distanceName = userCode + "_DistanceFromHome"
val distanceItem = gDistanceFromHome.members.findFirst[ t|t.name == distanceName ]
if (distanceItem.state < 10000) {
  //do_something 
}

Thanks

val distanceItem = gDistanceFromHome.members.findFirst[ t|t.name == distanceName && t.state < 10000 ]
if (distanceItem !== null) {
    //do stuff
}

yes, thanks, but we are trying to avoid using

gDistanceFromHome.members.findFirst[x]

altogether as we suspect that is what’s causing the null errors…

Afraid not. If you need the state of the Item you found you must have a reference to the Item and the only way I know how to do that is through the members of a Group. Well, this isn’t exactly true. You could get the state of the Item by using the sendHttpGetRequest Action and the OH REST API, but that is going to be much slower and potentially introduce additional errors.

The true solution will be to figure out why we can’t access the members of a Group sometimes. Unfortunately that is going to be hard to reproduce so it might take some time.

Are you seeing the periodic error in these other Rules too? I doubt you have dozens of members of gDistanceFromHome so if our guess is right the problem should almost never occur on this Rule because the Group isn’t changing all that often.

I am seeing the error on all rules where I use members.findFirst (I was using members.filter before and I was getting the same error). I get it most frequently on one of my rules which executes every 20sec on a group with 3 members (Unifi binding updating client’s uptime), so that leads me to think it’s more about the frequency of rule execution rather than size of the group. I have left the rest of my rules unchanged and added logging so I am waiting to get the error and see what the logs say.

To be honest I expect other people are having the same problem but just have not noticed it - I only noticed it because one day I woke up to a cold shower, turns out my rule to update target temperatures on my 8 zones + hot water tank failed with the error that morning. If you get the error in a non-critical rule which executes frequently then there is no real impact, your code gets executed correctly the next time it runs.

I would love to get to the bottom of this and understand the root cause, but in the worst worst case, I can always go back to repeating the code and not using Groups at all if it means I get 100% execution!

Thanks

It really is more about how close together the Items are updating. If all three Items update at nearly the same time then our guess as to the cause of the problem would be supported. If the updates are spread out then something else is going on.

I’m sure others are seeing the error but I’m also certain that the error is quite rare. I have a dozen rules that use findFirst and filter on Groups that are constantly being updated and I’ve never seen this error.

I can’t remember if you’ve mentioned which version of OH you are running. 2.3 release? A 2.4 snapshot?

Most of the heavy users of OH frequently watch their logs for errors. Someone even wrote a Logreader Binding to get alerts when errors appear in their openhab.log. If the error were common I would have expected one of the users with a big deployment like sihui or sipvoip to have reported it already.

That sounds about right, the Unifi binding fires up every 20sec and updates all Uptime items at the same msec. My heating rule is cron based so it updates setpoints at the same time as well.

Sorry I forgot to mention, I am on 2.4 snapshot.

Let’s wait for the errors and see what the logs say. Thanks a lot for your guidance Rich!

I tend to look for outside-the-box solutions to problems. It sounds like you do have some control over the heating. Is there a way you can stagger the updates to the individual Items a little using a Thread::sleep or wait (if it is a shell script) or something like that? You don’t really have control over the Unifi binding but you do for the other.

What happens if you add a Thread::sleep(100) just before the findFirst lines for the Unifi binding Rule? If the problem is related to the Group updating in the background while the Rule is running, the sleep should give that enough time to finish before the Rule runs. It will add some latency (up to 600 msecs) to when the updates occur and when the rules execute for the last set of Items in the 30 temp updates case (only five can run at a given time).

On second thought that probably won’t work.

What if you deploy Design Pattern: Gate Keeper for that Rule? This will make the latency be even longer but what it will let us do is add a sleep after updating the member of gDiffTemp Items to give gDiffTemp a change to settle before the next Rule instance is run. You don’t need a proxy Item like described in the DP, just keep the same Rule trigger.

That’s right, at certain cron triggers I do the following:

gCHSchedule.members.forEach[ item|item.postUpdate(schedule) ]

I am guessing .forEach is a simple loop so the below would help?

gCHSchedule.members.forEach[ 
 item|item.postUpdate(schedule)
 Thread::sleep(100)]

In the beginning I had thought this was a race condition related issue so I added Thread::sleep(100) at every other line, but that did not help.

So would that be just a reentrant lock? I was thinking of experimenting with those today.

Yep. This is exactly what I’m talking about. Experiment with the sleep time to find the smallest value that works reliably.

I usually do the spacing a bit different but I think it won’t matter functionally.

gCHSchedule.members.forEach[ item |
    item.postUpdate(schedule)
    Thread::sleep(100)
]

The DP is more than just a reentrant lock. The reentrant lock provides the locking mechanism to prevent more than one instance of the Rule from actively running at the same time. But what you do inside the locked part of the Rule is also important. In this case you will want to capture the all the states that you can before the lock but not run the findFirst until after acquiring the lock. Then you would add a small Thread::sleep at the end just before unlocking to give the Item Registry a chance to finish updating the Group.

I’m always hesitant to suggest this approach because, in the 30 Item example, if they all update at the same time and you have a sleep of 100 msecs it will take 2.5 seconds before ANY other Rules will be able to run. So if you have other Rules in your system where a 2.5 second delay between the Rule trigger and the Rule execution occurs you will have only traded one problem for another.

There is a way to increase the Thread pool (I think it is one of the files in userdata/etc) so if you do go down the gatekeeper route, you will probably want to increase that from 5 to 35. It will increase the amount of required RAM significantly so that might not be a good option if you are on an RPi. You will have to experiment to find the largest number of threads in the pool that will fit in your available RAM.

You will also have to go and change it back everytime you upgrade OH.

Got it. Slapped a Thread::sleep(50) on there so let’s see how that works.

Will add the locks tonight. I don’t think I have massive groups - the largest is about 10 items, so there shouldn’t be a massive timing impact.

I’ve upped it to 10 for now. OH2 is running on a server with 32g memory so I doubt it will be a bottleneck.

3 days with the gatekeeper and no null errors on any of my rules so far! Keeping fingers crossed but this might be the solution for anyone else having the same problem. Thanks @rlkoshak!

1 Like