deltaSince or historicState sometimes returns null on openHAB 3.2

Beluk · January 5, 2022, 7:18am

For historical data rrd4j is used. The data is saved every minute. But deltaSince or historicState sometimes returns null. Any ideas why that is happening?

test.rules:

rule "Energie"
when
    Time cron "0 * * * * ?"
then
    var Number delta1 = Smartmeter_Energie.deltaSince(now.minusHours(1)) as Number
    var Number past = Smartmeter_Energie.historicState(now.minusHours(1)).state as Number
    var Number now = Smartmeter_Energie.state as Number
    var Number delta2 = now - past
    logWarn("test.rules", "Energy last hour: " + delta1 + " kWh (deltaSince) " + delta2 + " kWH (histotricState)")
end

rrd4j:

Items {
	Smartmeter_Energie					: strategy = everyChange, everyMinute
}

Smartmeter_Engerie.json (60.5 KB)

rossko57 · January 5, 2022, 9:53am

The timestamp of your failing log entry is 8 seconds out of alignment with the rule trigger. That’s a clue that one of the retrievals has taken a long time, or more likely the deltaSince has just timed out.

Question; has a reboot happened in the previous hour or so? rrd4j will place “NaN” into time-series pigeonholes where no per-minute persisting can have taken place, which can cause glitches later.

Beluk · January 5, 2022, 10:26am

There was no reboot since a day. Also the system was untouched during night when this happens. I also checked the data in rrd file and also plotted the graph. I could not find a “NaN” entry.

But during cheking the log i found that i only filtered for warning. There were also a few error due to null values of historicState.

All warning have a delay off a few minutes. But the historicState seams to have no delay.

Do you know how long it waits until a timout occur? Can it be increased?

rossko57 · January 5, 2022, 10:58am

So that’s not a timeout then, and there’s no point trying to change timeout periods.

Next step is probably put DEBUG logging on persistence service.

Beluk · January 5, 2022, 4:18pm

I have set the log level to debug. But the value seams to be stored 50 ms before. What looks strange to me is that all entries have a second log entry with (again).

rossko57 · January 5, 2022, 4:24pm

Remember rrd4j has an archive structure, with multiple tables for each Item, all representing the same times but more compressed. I’ve no idea if that is normal debug log, but the different messages (“again”) suggest it is expected.

Anyway, debug of that module didn’t tell us anything about read activity. I don’t know what to suggest to get ‘persistence extensions’ logging.

opus · January 5, 2022, 5:55pm

I would doublecheck which items are really persisted, the individual item and a group containing the item?

Beluk · January 5, 2022, 6:29pm

@rossko57 Thanks for all that information
@opus The group item is not persisted. Under OpenHAB 3.1 i had never seen these errors.

Beluk · January 11, 2022, 7:09pm

I tried to isolate the problem. Deactivated nearly every binding, transformation, ui,… Made the DSL rule also in Blockly and also in the new JS Scripting. All the same. But then i changed the timing to trigger 30 seconds later.

  triggers: [triggers.GenericCronTrigger("30 * * * * ? *")],

The last days i observed it i had never seen a null value again.

So it looks like to me, that when the rrd4j file is written at the same time when the rule is quering the data it cannot get a value and returns null.

rossko57 · January 11, 2022, 7:47pm

Seems a reasonable deduction. rrd4j has a fairly big job every minute.
Are you running with default “everything you can” settings? Might improve by selective persistence.

Of course what we really want is for your ad-hoc query to go in the work queue and give a result. Maybe it’s just timing out silently.

Beluk · January 11, 2022, 8:13pm

Before over 600 items werd stored in rrd4j, but currently only 50. But the intervall the null value happens was nearly the same on both configurations.

During my test before I have run three different language rules at the same time triggering every full minute. Sometimes one failed but the other both returned values and the difference of the logging where only 100 milliseonds more or less. I can also make it working with a second try when the first time a null is returned. So not sure why it should timout on one and not on the others.

rossko57 · January 11, 2022, 10:35pm

Have you tidied away the unwanted rrd files? I don’t know if will still carefully update/compress the archives as time marches on, it might keep populating the per-minute pigeonholes even if it is with NaN, Meaning it still processes 600 archives every minute.

The random nature of failure does not surprise me.
There was recent post about similar jdbc workload-related errors. It would be nice if the persistence service logged WARNs for timeouts, if that is what is happening.

Beluk · January 12, 2022, 6:07am

Yes, I had deleted all files in the directory var/lib/openhab/persistence/rrd4j and also cleaned the cache with openhab-cli clean-cache.

Beluk · January 25, 2022, 7:52pm

After several hours of research I think i found the root cause. If the trigger overlaps with the timing when an archive in rrd4j is created (e.g. per default every hour, week, year, 5years and 10 years) it could happen that the persistance function returns null. I have tested this on different hardware and different operation system - everywhere the same.

PS:
When i changed the archive creation in rrd4j.cfg for example to 2 hours i have the null value with minusMinutes(120) and not anymore with minusMinutes(60).

Step to reprodice:

Make a timer with 0 * * * * ? * and the rule

  var item = items.getItem("SmartmeterEnergie");  
  var beforeOneHour = time.ZonedDateTime.now().minusMinutes(60);
  var todayEnergyConsumption = item.history.deltaSince(beforeOneHour, "rrd4j");
  console.warn("Energy:", todayEnergyConsumption);

If it is not shown the null value you may retrigger the rule so it get recheduled and can execute at the same time the archive is created.

Would be nice if someone else can test and confirm that…

rossko57 · January 25, 2022, 10:48pm

I think I follow what you’re saying. Say we have an archive configured for 2 hours, it will cover from now to 2 hours ago.
But - not exactly. We don’t update it every millisecond, so sometimes it must cover a little less than 2 hours.

So if we come along and ask for data from 1:59 hours ago, it might or might not be in this nominally “2 hour” archive. If we ask for 2 hours, it more than likely won’t be there yet. If we ask for 2:01 that’s no problem - we look in the “8 hour” archive instead.

There must be some code in the rrd4j persistence service that works out from the requested time which archive to look in. I guess that needs to made more cautious or clever.

Very well done for figuring it out

system · March 8, 2022, 2:48pm

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.