triggeringItem problem (concurrency?)

openHAB 2.4.0.

I have a rule that can monitor “open” items (windows) and count how many minutes it is open. However, it seems when I open 2 windows at the same time, then after a while the rule runs into an exception and it stops monitoring one window.

For each item to monitor, it has to be in the group “Timed” and there must be a helper item with the name “_Minutes” appended.

import java.util.Map

val Map<String, Timer> timers = newHashMap

rule "Timed"
when
    Member of Timed changed
then
    if(triggeringItem.state == ON || triggeringItem.state == OPEN) {
      logWarn("Timed", triggeringItem.name + " = " + triggeringItem.state.toString)
      val minutes = TimedMinutes.members.findFirst[ dt | dt.name == triggeringItem.name + "_Minutes" ]
      minutes.postUpdate(0)

      val timer = createTimer(now.plusSeconds(60), [|
        logWarn("Timed", triggeringItem.name + " Timer fired")
        val minutes = TimedMinutes.members.findFirst[ dt | dt.name == triggeringItem.name + "_Minutes" ]
        val current = minutes.state as Number
        val newValue = current + 1
        logWarn("Timed", "New value = " + newValue)
        minutes.postUpdate(newValue)
        val timer = timers.get(triggeringItem.name)
        timer.reschedule(now.plusSeconds(60))
      ])
      timers.put(triggeringItem.name, timer)

    } else {
      logWarn("Timed", triggeringItem.name + " = " + triggeringItem.state.toString)
      val timer = timers.get(triggeringItem.name)
      if(timer !== null) {
        timer.cancel()
        timers.remove(triggeringItem.name)
      }
      val minutes = TimedMinutes.members.findFirst[ dt | dt.name == triggeringItem.name + "_Minutes" ]
      minutes.postUpdate(0)
    }
end

When the problem happens, I get 2 exceptions and I can see this in the log (I have reduced the output of the exception):

2019-07-10 09:49:44.729 [WARN ] [eclipse.smarthome.model.script.Timed] - HM_FensterSchlafzimmerLinks_1_StateContact Timer fired
2019-07-10 09:49:44.729 [WARN ] [eclipse.smarthome.model.script.Timed] - HM_FensterSchlafzimmerMitte_1_StateContact Timer fired
2019-07-10 09:49:44.730 [ERROR] [org.quartz.core.JobRunShell
java.lang.IllegalStateException: null
        at org.eclipse.xtext.xbase.typesystem.references.UnboundTypeReference.doCopyInto(UnboundTypeReference.java:677) ~[?:?]

2019-07-10 09:49:44.730 [WARN ] [eclipse.smarthome.model.script.Timed] - New value = 225
2019-07-10 09:49:44.748 [ERROR] [org.quartz.core.ErrorLogger
threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.

One of the timers continued to work and the other one got the exception and so stopped monitoring the open window.

I can only imagine that perhaps triggeringItem was null for one timer in the short lambda code.

What is the definition, meaning what is supposed to happen if 2 items of a group fire at the same time when I am using “Member of abc changed”?

Any ideas? Thanks!

I run into a similar problem. I thought this was caused by a procedure I use which is not thread safe. But now it looks like you mentioned that the timer causes this problem. I think the hashmap is not thread safe.

An idea how to fix this is handled here:

I had no time to test it in my installation. Maybe this helps.

Looking at the events log. two different timers execute practically simultaneously. Both announce themselves with logWarn “fired”.
Then it looks like one gets to the next logWarn “New value”, but the other timer blows up with a null error before it gets to that point.

Of note, the map-array doesn’t get used in the timer code before “New value”.

I don’t think there is firm doctrine on this, but I believe concurrent access of Group members while updates are going on can give null results, so I’d point the finger at the .members action. Rich @rlkoshak may have a view here.

If it were me, I would put a queuing lock on the timer execution block.

But it might be worth trying the other direct registry method to retrieve members.

Be interested to know if that helps - my own code has .members.findFirst etc. with locks, be useful to know if direct registry read was more robust.

Some things I notice.

  • You don’t need to get minutes inside the timer again. You already have it because the timer will inherit everything that exists in the context up to the point where the timer’s lambda was created. That’s why you can access triggeringItem inside the Timer. When you call minutes.state a Number it goes out to the Item registry so you will get the current value.

  • I think this whole Rule would be simpler using Design Pattern: Expire Binding Based Timers.

  • I’ve never seen a doCopyInto error before but it is coming from an UnboundTypeReference. Perhaps if you specify the types of everything inside the Rule that will fix it or give us an error that is easier to figure out.

  • What of those variables does it make sense would become 255?

  • I think the two exceptions are really the same error.

What is supposed to happen is what you would expect. Two separate instances of the Rules get triggered and create two separate Timers. But I gotta say, it’s really hard to get two things to happen in a computer at exactly the same time. I find it highly unlikely that it’s physically possible for you to open two windows and/or have that event get reported to openHAB within the same millisecond. There should be at least a couple of milliseconds between the two just because of noise on the network and in the processor. The fact that the two log statements happen at exactly the same millisecond is suspicious to me.

I imagine it might be possible that Quartz can’t handle two new Timers being created at the exact same time from two different threads. That could cause some problems, but I wouldn’t expect that the two Timers could work at all, meaning that the two log statements wouldn’t have been printed.

I think there are two ways you can go about this.

  1. Preserveer debugging the problem to figure out what is happening, why, and how to work around it.

  2. Pivot and use an alternative implementation (Exipire binding timers) which will not have this same problem.

Interesting thought. I remembered incorrectly. HashMap is indeed not thread safe. I think you can use newConcurrentHashMap to get a thread safe version. If not, the link shows how to create a new ConcurrentHashMap.

But I don’t think that explains the error. If that were the case, I would expect some Timers to not appear in the HashMap when they should (e.g. get null back in the else clause) but it shouldn’t impact the actual running of the Timer’s lambda.

That is true. If you are trying to loop over or filter on a Group and the Group changes it will generate an error. But I can’t remember if it’s just a null error or something else. You can prove this is the case I think by adding a short Thread::sleep between the findFirst and the minutes.postUpdate (maybe 50 msec should do it), but like I said above, the findFirst inside the Timer lambda is unnecessary. So maybe all you need to do is eliminate that line so the Timer uses minutes that was retrieved in the Rule body.

Many thanks to all, especially @rlkoshak.

  1. I will change the rule so that it uses the Minutes from outside the lambda. Although I did not think that this is the problem. To me it seemed more like triggeringItem got lost somehow.
  2. I opened the 2 windows at about the same time. But it took 225 minutes to trigger the problem. Pure coincidence, I guess, but this did happen to me once or twice in the last months.
  3. 255 is no problem, why should it be? I had minutes go up to 800 or so.
  4. I looked at the Expire Binding Based Timers. If I understand correctly, I would have to create another item for each window and also attach it to the expire binding. Seems like a lot of boilerplate to me.
  5. HashMap should not be the problem because when the error happened, it is only reading from the HashMap. But I could of course change it to the Concurrent version to be safe with the writes also.
  6. The group members do not change, so findFirst should be OK, I think.

It was just a question to make a sanity check on the number. If 255 doesn’t make sense then that might point at the cause of the problem.

I’d trade 100 lines of Items for 15 lines of complicated Rule code any day. It’s really hard to mess up an Item definition, especially an Unbound Item. Items are also really easy to understand. If using more Items results in much simpler and easier to understand code it means a very large reduction in the overall complexity of your system’s configuration. Less complexity and easier to understand means easier to debug and easier to maintain. It also usually means shorter and less prone to the types of errors identified above.

Is “boilerplate” really so bad that you’d rather have longer, more complex, and more brittle Rules?

The problem is when any member of the Group change’s it’s state. That is absolutely happening here.

For completeness here is what the Expire Binding Timer Rules would look like.

import org.eclipse.smarthome.model.script.ScriptServiceUtil

rule "Timed changed"
when
    Member of Timed changed
then
    // See How to structure a Rule Design Pattern
    // 1. Always run the Rule
    logWarn("Timed", triggeringItem.name + " = " + triggeringItem.state.toString)

    // 2. Calculate what needs to be done, do we start or cancel the Timer
    val minName = triggeringItem.name + "_Minutes"
    val timerName = triggeringItem.name + "_Minutes_Timer"

    // If the new state is ON or OPEN, turn on the Timer, else cancel the Timer
    var timerState = if(triggeringItem.state == ON || triggeringItem.state == OPEN)  ON else OFF

    // 3. Do it, set or cancel the Timer, reset the minutes
    postUpdate(timerName, timerState)
    postUpdate(minName, 0)

end

rule "Increment open time"
when
    Member of TimedMinutesTimers received command OFF
then
    // Increment the minutes
    val minutes = ScriptServiceUtil.getItemRegistry.getItem(triggeringItem.name.replace("_Timer", ""))
    val newValue = (minutes.state as Number) + 1
    minutes.postUpdate(newValue)

    // Reset the Timer
    triggeringItem.sendCommand(ON)
end

The above code is way simpler and easier to follow. There are fewer indentations (i.e. no indentations), no lambdas, no book keeping required to keep up with the Timers, it’s significantly fewer lines of code (if you don’t count the comments and white space, ), and it will avoid all the potential concurrency issues that rossko57 and I have brought up because we don’t need to loop through or filter the members of a Group. Even if you had 20 members of Timed, it would be totally worth the extra Items to get the simplification in the code above.

This isn’t a scientific calculation because I know of no tool to calculate Cyclomatic Complexity on Xtend, but a quick by hand estimate is a complexity of 5 for the OP and a complexity of 2 for my version.

1 Like

@rlkoshak
Wow, thanks a lot for taking the time to provide and explain the code. I will certainly look into it.

What I still don’t understand what the problem with the 255 is because the problem did not happen at 255.

It’s not a problem. I was only asking if that number made sense at the time the logs were taken. When debugging a problem, knowing what all the variables and Items are set to can be as important if not more important as the exceptions being generated.

Ah, ok. The number was 225 and it was the correct number of minutes since opening the window.

Hmmm, looking back at my code I see a minor bug. It should be


    // 3. Do it, set or cancel the Timer, reset the minutes
    // If the new state is ON or OPEN, turn on the Timer, else cancel the Timer
    if(triggeringItem.state == ON || triggeringItem.state == OPEN) sendCommand(timerName, "ON")
    else postUpdate(timerName, "OFF")

    postUpdate(minName, 0)

I don’t think the Expire Binding timer will get triggered if we only update the Item to ON. We need to send a command. But to cancel the Timer, we must use postUpdate or else the second Rule will trigger.

Expire will trigger on updates (it was the original purpose of it)

That kind of goes against the usual behavior where updates don’t go to the binding. That’s why I doubted my self on reflection.

Last night I found your above rule-example, but forgot to save it immediately. Only made a link on it.

When now looking in this post I missed a line (something with generic-item and replace). Would it be possible to bring this line back for me ?:innocent: As I do not know the correct syntax anymore. And this isa very interesting possibility of manipulating item-names.

Thx in advance
Peter

Edit: When one more time thinking about what I have read last night and about the hints you gave me in the past, I found the solution, which works for me.

import org.eclipse.smarthome.model.script.ScriptServiceUtil

rule "gLWT_Logging"

when Member of gLWT changed to OFF
then
   if(Sonoff_LWT_Logging.state == ON) {
//      logInfo("gLWT_Logging", "Ausgabe Gerät: " + triggeringItem.label + ": " +triggeringItem.state.toString)
      var GenericItem tempItemName = ScriptServiceUtil.getItemRegistry.getItem(triggeringItem.name.replace("_Unreach","_Version")) as  GenericItem  // ***thx to @5iver and @rlkoshak for help***
      logInfo ("Rich - Timed", "Ausgabe-Gerät: "  + triggeringItem.label + ": " +triggeringItem.state.toString + " Release-Version: " + tempItemName.state.toString)     
    }
end
//===============================================================================================

and the result in the log looks like:

2019-07-12 13:36:58.175 [INFO ] [.smarthome.model.script.Rich - Timed] - Ausgabe-Gerät: Stehlampe Ereichbarkeit: OFF Release-Version: 6.5.0(sonoff)
2019-07-12 13:37:08.905 [INFO ] [.smarthome.model.script.Rich - Timed] - Ausgabe-Gerät: Schreibtischlampe Ereichbarkeit: OFF Release-Version: 6.6.0(sonoff)

thx again for inspiration