Nullpointer hazard with timers?

Hundertvolt1 · November 28, 2023, 9:50pm

Hi all,
I have a general question about timers in rules DSL.

From what I found, it’s usual practice (and so did I many times in my config) to use timers in a fashion like:

if (myTimer === null) {
  myTimer = createTimer(now.plusSeconds(myTimeout), [ |
    <doSomeStuff>
    myTimer = null
  ])
}

to create a timer stored in a global variable if it does not yet exist and to delete it by itself (set to null) after it finished running.

In other places (other conditions, other rules, etc.), there may be an occasion to let it run longer like

if (myTimer !== null) {
  myTimer.reschedule(now.plusSeconds(myTimeout))
}

…so to first check if it exists and then do some action on it (may be reschedule, may be cancel, or whatever).

As this is all asynchronous in general (the timers in separate threads, triggers of different rules, etc), I wonder if there might be a slight chance for this to fail:

What happens if by bad chance the second code piece runs in parallel with the code within the timer in such a way that myTimer = null is executed and the variable is written precisely after the if (myTimer !== null) check, but before the myTimer.reschedule? This would lead to an action on a null pointer

Is this a “real” hazard which might actually happen? Or is there some background thing that prevents this?

Even if the odds for this are very low - if it’s possible, it will happen right on your first day of holiday and leave your lights turned on all the time

rlkoshak · November 28, 2023, 11:27pm

Is it possible? Sure.

I’ve never seen it happen in all the years I’ve used OH.

You can test for isRunning and hasTerminated in addition to testing for null to further reduce the likelihood of this time of check time of use error happening.

In practice, this is so unlikely as to not be worth worrying about. And anything that you might do to make this more thread safe is more likely to cause new problems than stop the problem.

Udo_Hartmann · November 29, 2023, 3:57am

Imho best practice is, only to use timer.reschedule() inside the timer, but never outside.

like that:


rule "some timer"
when
    // some trigger which should cause 
    // the timer to be scheduled or rescheduled
then
    tTimer?.cancel                    // throw away any old timer (if there is any)
    // do what needs to be done...
    tTimer = createTimer(now.plusSeconds(10), [|
        if(!(action should be still delayed)) {
             // action
        } else
            tTimer.reschedule(now.plusSeconds(10))
    ])
end

Hundertvolt1 · November 29, 2023, 7:49am

Hi, thanks for your suggestions!

This brought me to some ideas…
First, rescheduling only inside timers actually makes it safe for this funcion, which is cool already.
In some cases, it may be necessary to reschedule to some certain asynchronous event plus a fixed time, but I guess it would work with a global variable e.g. myTimerRe and this code:

if (myTimer === null) {
  myTimerRe = null
  myTimer = createTimer(now.plusSeconds(myTimeout), [ |
    if (myTimerRe !== null) {
      myTimer.reschedule(myTimerRe)
      myTimerRe = null
    } else {
      myTimer = null
    }
  ])
}

and if I want to reschedule, I just set myTimerRe = now.plusSeconds(10) or the like from an asynchronous call. I guess this would work?

Overengineering definitely leads to more pain absolutely agree. No good feeling about “good code” with this hazard anyway. So how about not avoiding the rare case, but dealing with it in a simple way?

The case with rescheduling can be solved as seen
The case of myTimer.cancel() is only sensible when calling from outside the timer but may be affected with the same issue.

But: Canceling the timer just stops it, you have to set it to null afterwards anyway.
If the timer runs out before canceling, even if it’s only one command line, is an allowed situation.
So even the very rare case can be dealt with this, I guess:

if (myTimer !== null) {
  try { myTimer.cancel() } catch (Exception e) { }
  myTimer = null
}

If it’s still running: stop and delete it.
If it expires right at that moment: catch the nullpointer that the script can continue running and have everything in a defined state.
I suppose that there is not much which can go wrong apart from the timer expiring. Either the pointer is valid or the timer is expired.

And, actually, a similar issue might occur with rescheduling as described above - myTimerRe may also be asynchronously written. No nullpointer though, only wrong or missed reschedules.

So maybe the safest way with asynchronously updated timers is to never reschedule, but always cancel the “safe” way as described and create a new timer…?

What do you think?

rlkoshak · November 29, 2023, 2:50pm

I’m not sure that’s feasible in all use cases. For example, what if the timer is there to detect when motion stops, i.e. you want the timer to run 10 seconds after the last motion detection event?

With a good deal of book keeping you probably could reschedule the timer inside the timer body, but it’s going to be really complicated to do so.

Don’t use null. As I said above, you can test for isRunning (i.e. the timer’s code is actively executing) and hasTerminated (i.e. the timer expired and ran its code to completion). You’ve also isActive (i.e. it’s currently scheduled to run sometime in the future and hasn’t completed nor been cancelled) and isCancelled which is true if the timer never got to run and was cancelled before that point.

if(myTimer === null) is a lot simpler than if(myTimer === null || myTimer.hasTerminated || myTimer.isCancelled) but if you don’t reset the timer variable to null inside the timer body the window for a TOCTOU error is reduced even further.

If it really bothers you, you can file an issue to make these calls synchronized. From what I can tell they do not appear to be though I didn’t dig that deep into the code.

Only for some use cases and often with a large increase in complexity.

In JS Scripting calling cancel() inside the timer results in an error.

Not really. That’s common practice but not required by any means. The “dead” Timer can stay there forever. You still need to test for null because the variable will initialize to null and remain so until that first Timer is created.

The big problem with Rules DSL, and why trying to address this in the rules code directly, is that not all exceptions make it back up to the rule to be caught. Is this one of those cases? I’ve never seen such an exception so I can’t say. And therein lies the problem. Unless and until someone actually experiences this problem and we can observe what happens, any solution is going to be theoretical at best and a Galileo’s Column at worst.

I think this potential problem is most simply solved by simple not resetting the timer variable back to null and use the Timer functions to see the current state of the timer, in specific hasTerminated and isCancelled. You still have the potential of a TOCTU because the timer might become cancelled or terminate after you test, but the timing is such that it will be nearly impossible to occur in practice.

Beyond that you can try to set up semaphores and/or reentrant locks but those come with their own problems and complexities. The biggest complexity is what I mentioned above. Not all exceptions make it back up to the Rules DSL rule meaning you cannot always guarantee that the lock will always become unlocked. And now you’ve just moved the problem from the Timer to the lock, but you’ve not actually solved anything.

Finally, because as far as I know this problem has never actually occurred, how do we know if any of these suggestions actually solves it? You’ll have to set up some sort of test case where you can be pretty sure that the error does occur to verify that your solution works. I’m not sure how to do that in this case.

All-in-all, I still think this is a Galileo’s Column.

Hundertvolt1 · November 29, 2023, 3:54pm

First of all, thanks for your detailed answer!

It’s generally good to know that the timers have so many functions to obtain information about their state! This did not make it to my eyes yet, although reading a lot about that function. Nevertheless, this only shifts the issue (as do semaphores, as you mentioned) but does not really solve it.

Real-time OSes usually have system-wide ways to deal with such issues, but I’m not sure if this is what we really would need for a rather “low-frequency” thing like smart home … overengineering, to the rescue!

So, if I imagine the rare event that everything happens at the same time, there still are at max two states we can end up with if we finish each timer with setting its variable to null: either the timer still exists and we can still call functions on it, or the timer is dead as its variable turned to null right after the safety check and we make a call to the nullpointer.

With this assumption, I tested what happens:

if I call myTimer.cancel() with myTimer = null, I do get a nullpointer exception on the console. Script execution breaks at this point.
if I call try { myTimer.cancel() } catch (Exception e) { }, there is no exception on the console, and code execution continues.
if I call try { myTimer.cancel() } catch (Exception e) { } with a valid timer behind myTimer, the same happens as without the try-except - the timer is canceled as expected.

With this background, I checked my (dozens of) different timers.
Actually, I could solve all my use cases with this scheme!

There were some changes in the if-statement order about what to check before creating or canceling a timer(some of which even made my code easier ).
Every cancel statement, I safeguarded with the try-except as mentioned. As the cancellation was setting the timer variable to null anyway, it does not even matter if the timer runs out just in that moment or if it’s canceled. It ends up in a defined state in any case.
In any case, there is the possibility of dealing with the situation in more detail by using the exception code brackets as well (code does get exectued there in case of a null pointer, I checked as well!)
Every reschedule outside of the timer’s code I replaced with such a try-cancel-except, followed by recreating the same timer at some different places. Which was easy in my cases; the code inside these timers was simple, short and context-independent anyway.

So at least from this side, I would suppose that even the rare case of parallel execution is dealt with like this. The additional complexity is absolutely low and at some places the code even became nicer.

So far, thanks to all of you for the nice discussion, I learned quite a lot from it