RRD4J seems to sum/average weird (after one hour)

A little mystery (for me, hopefully not for you :wink: ). I’m experimenting with a rule to create average/ trends configurable with tags. Well work in progress but behaves like expected in the first hour. It takes an quantifiable item (numeric:power) and calculates and average and a reference (also (numeric:power). This to make (yeah a little timeshifted) graphs with let’s say a five minute average compared to the period middle (2,5 minute behind) actual value. Image for the last hour (so the 10 second values in rrd4j). Blue line original data, yellow the ‘timeshifted’ value and green the average.


I would expect this to do (more or less) the same for older periods. rrd4j (default settings) will sum up the 10 seconds to a minute values by averaging them (since it’s quantifiable items). I would expect sometimes a 10 sec value just to be in the next or previous ‘bucket/bag’ so little differences. What I’m getting is this:

So huge differences between the yellow reference and blue original data. The peaks could be smeared a bit, but they’ve completely gone. Only thing I can think of it uses the default numeric option which is last value instead of averaging them. But according to documentation this shouldn’t happen.

Don’t know if relevant but I’m running on the 3.4.0 version. And as said, default settings for rrd4j (so no configuration files present).

Anyone know what I’m missing?!?

Well, did some more research. Using the API the graph shows the same as persisted. Just to be sure.

And using the aggregation mode in de the graph the daily totals are almost the same. So it must be in the distribution of the data when rrd4j averages the 10 sec intervals to minute intervals.

Next test, change my default period from 5 to 6 minutes which will make the reference run exact 3 minutes behind. So it matches the next minute bucket/bag of rrd4j.

Could you be a bit more pecific on what you did persist?
You persisted the original data (always the blue line Iassume), 'timeshifted" data (the ‘reference’? how did you do the timeshift?) and an average (how did you calculae this avewrage?)
The graphs shown did use which timeframe? (first chart: Last Hour, second chart: ??)

Which AP?

Which timeframe wass requested for this third graph? (week?)

As long as we do not know what was really persisted during the timeframe 07:00 to 07:59 we can’t really say if rrd4j did something weird. Since this timeframe is older then an hour, rrd4j will not have that old data! You might have it in the logs!

Timeframe for the first and second picture is a hour. But the first is the last.hour, so the actual 10 second data. Second is a previous hour, so the 10 second is aggregated to minute intervals. The last is the daily total for the last week.

The reference is 2,5 minutes delayed from the actual data. Which the first picture shows.

Worst case I would expect is a minute peak for lets say 3000W just in a minute interval. Since the reference is 2,5 minutes behind this is divided into 2 (minute) intervals with a value of 1500W.

But this would still show as a smeared peak and not disappearing.

Enough for now, happy new year!

That is undstood, however the question was how you did those datapoints into a .rrd file. There migth be the problem ( if those peaks were valid data ).
How does a chart look like for the timeframe of chart 1 ( i.e. when this timeframe is not the last hour)?

We have another 40 minutes to spend in 2022, see/read you next year.

Ok, first the 3 minutes test. Doesn’t change anything.

Let me try to explain my reasoning while answering your questions. Let’s make it as simple as possible, so forget the averaging and focus on the reference . I have a rule running on a cron-job. This rule takes the historic state “3 minutes ago” of “netto totaal” and writes it in another item “netto totaal ref”. That’s all you can do with rrd4j, there’s (as far as I know) no way to change old values. Only the actual value can be manipulated. So something like (actual code is very long):

var his = time.ZonedDateTime.now().minusMinutes(3);
var = items.getItem("netto_totaal").history.historicState(his);

So your first question. Yes the blue line is the actual/original data made by a http-thing. The green line is created data by a rule. And this part of the rule is just a timeshift. It’s running 3 minutes behind on the actual data.

Using the openHab gui, select the item “netto totaal”, choose button “analyse”, add item “netto totaal ref” and select period “1h” shows the last hour of both items. I get what I expect. And this is why I think the actual rule is not that important. Looking at the last hour it seems to do it’s job. And the rule cannot to anything with values in the past. It’s all up to rrd4j now. Here’s the 3 minute graph.


So in answer to the second set of questions. I just use analyse item option in the openHab GUI to display the last hour of both items.

Now to the issue. Using the navigation buttons top-right I can move to past hours. I included the time, a few hours back.


And here’s what I don’t understand. It’s less than 24 hours ago, so I expect rrd4j to take the average from six 10 second periods and make this the new 1 minute value. This shouldn’t change a lot. Maybe some smearing of peaks, but not making them dissapear. But this is what seems to happen.


So yeah, you’re right. This isn’t the actual data anymore. But in my mind it doesn’t matter since the rule had a simple job and I assume (yeah I know assumptions) it will do it’s job “copying data”.

So this is the next test. I made a printscreen of now (11-12). And will wait an hour and make a screenshot of that same period when it’s aggregated. Then we know we look at the same data. To be continued!

Did it 2 times and they’re good. So your suggestion seems a good hint. Now put in more logging in the script to test the two things I can think of as a cause:

  1. The script does sometimes not run and misses a 10 second interval. And this interval contains a ‘lucky’ peak.
  2. The script does run, but a little early. So it reads the item just before it’s updated. And misses a ‘lucky’ peak.

I bet on #2. Hope my extra code can detect this situation (I use the item.history.lastUpdate() and compare it to now. If it’s off to much I log a warning to see if this happens a lot.

Why don’t you use the logging for rrd4j? Setting it to DEBUG for the rrd4j bundle would show each and every value that is put into the database.

I think my issue happens “before” the value is persisted, so that’s why I added some logging to the rule. I think the log is confirming option #2.

2023-01-02 14:12:01.043 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:12:11.539 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated false
2023-01-02 14:12:21.044 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:12:31.094 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:12:41.082 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:12:51.081 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:13:01.044 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:13:11.514 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated false
2023-01-02 14:13:21.045 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true
2023-01-02 14:13:31.057 [INFO ] [nhab.automation.script.ui.da64f660dd] - so update run EnergieMeter_nettototaal 1 state-updated true

It runs like clockwork at xx:xx:x1 except at the 11 seconds past. That interval is 500ms late and (other log line) also the update of the “netto totaal” is (even more) late. So every 11 past it (re)uses the old value. If values are steady not a big issue but if that period hits a big spike it’s missed.

Now think of something to improve/solve this. I would like to keep the setup that it handles several items. It seems my OH is busy at xx:xx:10, maybe running this rule 5 seconds later will do wonders. Or should I run it twice and detect is the source items is changed? Suggestions are welcome!

I think knowing which values are really used by rrd4j would show what is going on.
Remember that the actual state might nof be the value that is used by rrd4j at this moment. It all depends at which event or timestep openHAB is providing a value to be persisted, rrd4j is using this value at the10 second steps.

I’ll give this a go too. Tinkering with the cron timing was a big improvement.

You have a good link on how rrd4j works with openhab? I’m curious what happens when 2 values are written in de 10 sec period. Or how it times that 10 sec periods. This seems to be the thing behind my observations. The data is provided also at roughly a 10 sec interval. So all these 10 sec events probably interfere …

I’m wondering if rrd4j is the right service for that task, as it does data compression.
See here:

Read that page, but for my purpose “visually ok for 0-24 hrs graphs” aggregating 10sec to a minute interval should be fine. I don’t need mathematically correct until the last decimal or pixelperfect. I’ve seen posts using an influxDb and grafana to accomplish that. Maybe it’s inevitable but for now I would like to keep the standard configuration. And it’s kind of a learning opportunity to understand openHAB a bit more.

As I understood it openHAB does provide a value to be persited accorsing the used strategy (i.e. everyMinute, everyUpdate, everyChange or the like).
This provided value is used by rrd4j according the archive configuration ( the heartbeat setting). As long as no new value is provided by openHAB the last value will be persisted at each heartbeat.

It’s the moment of persisting that causes my artifacts. The *** is the update of the source item. Note it’s only one “source” 10 second period 387 W. However, looking at graph and database this value is persisted in two “rrd4j” periods.

The rule that calculates my average and reference ‘sees’ this now. The first value in the ‘check update’ line is the state of the source item. That follows exact the source item. The second value is the source item state retrieved from the persistance (using the history methods). Usually it’s just 1 behind, but the 387 is present twice and the (first) 203 that it should be is not persisted.

Well I trapped the source item now, I presume the same might happen to my average and reference value. My calculated state will be every 10 seconds in the item states but might not be persisted that way.

Sounds all logic, but I wouldn’t have thought of this at the start. So in the graps the reference might be a better representation of the actual values the thing passed. Because in this example the graph of the reference doesn’t show the doubled 387 W.

Now back to my goal, a visual good enough looking graph (also when looking beyond one hour ago) … food for thought!

2023-01-03 13:20:48.380 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 159 W
2023-01-03 13:20:54.560 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 false false 159 W 159 W
2023-01-03 13:20:58.383 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 387 W
2023-01-03 13:21:04.565 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 false false 387 W 159 W
2023-01-03 13:21:08.382 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 203 W
2023-01-03 13:21:14.537 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 true true 203 W 387 W
2023-01-03 13:21:18.379 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 207 W
2023-01-03 13:21:24.568 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 false false 207 W 387 W
2023-01-03 13:21:28.381 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 203 W
2023-01-03 13:21:34.600 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 false false 203 W 207 W
2023-01-03 13:21:38.381 [INFO ] [org.openhab.rule.b0e0ef3b4f         ] - *** new state nettototaal 130 W
2023-01-03 13:21:44.563 [INFO ] [nhab.automation.script.ui.da64f660dd] - check update EnergieMeter_nettototaal 4 -1 false false 130 W 203 W

@opus thanks for pointing me in the right direction!

To summarize all I learned:

  • In a rule the current state the value is usually the same as the value persisted, but it’s not guaranteed!
  • Especially when all have an update frequency of 10 seconds (in my case the thing, rrd4j and the rule) and other ‘heavy’ processes (or rules) run (that might delay the three process named).
  • If the rule calculated and writes a new value same may apply, what you set might not be whats persisted
  • In the end when you put it all in a chart it causes deviations. The more spikes in the data, the bigger the deviations.

Knowing this, I had a long talk with myself and decided consistency is more important than accuracy. Or I like my reference (except for the timeshift) to be in synch with the source. Best way to achieve this is not to take the actual state of an item but use the (persisted) historic state. To reduce timing issues the script runs at 4/14/24 etc and gets the historic state for (just before) 0/10/20.

It did the trick. I’ve got one hour with exact match between source and (delayed) source. Now the data is ok I can continue development of my script :white_check_mark:

(when it’s done I’ll put it in a new post for anyone interested)