Persistance: rrd4j with strategy = everyUpdate, everyMinute, restoreOnStartup
I just came across a strange behaviour of the maximumSince function.
It seems like, the function filters very short maximums when asking for longer timeframes.
This leads to the fact, that asking for the maximumSince(1h) returns 1 while the maximumSince(1d) returns 0.
The following code isolates the behaviour. Make sure to use a new test item with no history and persistence enabled:
val example_item = exampleitem
// Switch Item On, wait 5s, switch off again
example_item.sendCommand(ON)
Thread::sleep(5000)
example_item.sendCommand(OFF)
// Check Maximum for one Hour and One Day
logWarn("test", example_item.maximumSince(now.minusMinutes(60), "rrd4j").state.toString()) // Returns ON
logWarn("test", example_item.maximumSince(now.minusDays(1), "rrd4j").state.toString()) // Returns OFF
Am I missing something or is this a thing that should be at least mentioned in the documentation?
Also, does someone know of a way to get the “real” maximum in a longer timeframe?
@jimtng it may be. But I nevertheless think, this should be mentioned somewhere in the documentation. Including, which “on-time” in which timeframe is needed, so that maximumSince really recognizes the maximum.
When i now there is a maximum value in the DB and the DB offers a maximum-function, I’d expect that function to return the maximum unless stated otherwise.
I have been using openHab for years and heavily rely on maximumSince in my rules and have never known this, until i discovered it by accident.
In contrast to a “normal” database such as db4o, a round-robin database does not grow in size - it has a fixed allocated size. This is accomplished by saving a fixed amount of datapoints and by doing data compression, which means that the older the data is, the less values are available. The data is kept in several “archives”, each holding the data for its set timeframe at a defined level of granularity. The starting point for all archives is the actually saved data sample (Item value). So while you might store a sample value every minute for the last 8 hours, you might store the average per day for the last year.
Thanks for providing the documentation paragraph, but I don’t think, that states what i am talking about.
First of all, the data i want to query doesn’t qualify as old, as this also happens with timestamps that have been created seconds ago, as long as the timespan you query in maximumSince() is long enough.
Secondly, the documentation only states that data is lost or compressed.
As I can see the datapoints are there (e.g. in the graph in the item view) i also would expect maximumSince() to return them.
I think the documentation should state something like:
“maximumSince/minimumSince might not return values, even if they are availible in rrd4j. The functions group the data by a xyz timespan and ignore maximum/minimum values that are shorter than x.”
Did you ensure you persist the item’s data you’re now looking after?
If you configure persisting on everyChange but don’t actually change, you may not have any change to that item’s persistence.
If configure to persist everyMinute, you will be missing at least one the changes from your example. Similar stuff may apply with h vs d.
Enable debug level for rrd4j persistence and double-check item data IS persisted before searching for other reasons.
My apologies. I have misunderstood your original problem.
I tried to reproduce your problem and indeed, it seems weird.
Even if you ask for maximumSince 61 minutes ago, it would return OFF, as if it’s querying from a completely different database. Perhaps it makes sense from an rrd point of view, but IMO, it’s unexpected.
I’m not sure if it’s a bug, or an expected outcome due to the peculiarity of rrd.
The easiest solution would be to use a different persistence service.
Nevertheless, if you try the code-example i provided initially, you can see that the Data has to be there (as the first log-command prints ON) but the query immediately after returns OFF.
Based on experience it’s my understanding that if you query rrd4j beyond the earliest saved data point rrd4j does not tend to work correctly. Does the problem persist with Items that have data older than the 1d (or what ever you are using)?
I assume you tests on OH 4.2 or 4.3. @uupascal is using 3.4 and there have been a number of changes to the persistence extensions since then which might have solved the issue. But if you get the same behavior in the latest OH then it’s something that persists.
I think this could be tested out another way. As I understand it the persistence extensions are implemented generically. They pull all the records between now and the passed in time and then do the operation on those results. So theoretically one should be able to see what records maximumSince is working on by querying the REST API end point for those same records and visually inspecting them. Do you get a different set of records between 1m and 1d where that maximum isn’t in the latter set of records?
We’ll that’s very weird and unexpected. Is there any data for TestSwitch2 older than 61 minutes?
It’s clearly a wrong result and I suspect the problem is in the add-on. Just because you’ve moved into a new bucket doesn’t mean the earlier bucket should be ignored.
That’s indeed what happens in the persistence extensions. The extensions don’t consider the specific underlying service at all (except for differences between just Queryable and Modifiable). It stays away from the details of the persistence services implementation.
This reminds me again of the discussion about pushing these calculations in the actual service, so the db itself can use an optimized method. I wonder if that would give different results. I believe there are some statistical methods in rrd4j.
Either way, the query function or rrd4j is IMO wrong and needs to be fixed independently of the statistical extensions. Even if you extend beyond the time of the earliest data shold not result in zero results.
I can see there being fewer results as the time period spans across buckets if it’s the case that it normalizes the data (i.e. if you’ve moved into a one record per hour bucket, the one record per minute values are averaged to get one per hour even for the more recent values. That kind of makes sense. But the results @jimtng shows do not make any sense to me.
rrd4j is REALLY not suitable for retrieving the data back for general analysis.
Take this for example: you have 5 changes occurring consecutively within 500ms.
The minimum granularity for rrd4j is one second. So none of those data will actually be persisted. They’ll be discarded.
So you’ll get this
19:58:18.255 [WARN ] [rd4j.internal.RRD4jPersistenceService] - Could not persist 'TestSwitch1' to rrd4j database: Bad sample time: 1727344680. Last update time was 1727344693, at least one second step is required
It’s OK for analysis if you understand the limitations and those limitations don’t matter to your use case. It’s not common in a home automation context where doing analysis on three events happening within a second is meaningful. Most of the time it’s going to be seconds to minutes to hours and rrd4j is fine for that.
But either way, it makes no sense to get 713 records for the past 60 minutes but only 61 for the past two hours and none from more than a year ago (note, IIRC rrd4j does get rid of records older than two years). There is something incorrect in how the queries are being done in the add-on above and beyond the built in quirks on how rrd4j works.