Dynamic persistence

ramack · July 7, 2019, 7:03pm

I am a data junkie and I have a lot of sensors in my home for which I want to keep records. Most of them are temperature and electrical power and I want to see short peaks in the persisted data, so today I store them very frequent or even on change. Therefore a lot of the Mysql tables are flooded with data that is not so interesting and could be “compressed”, e. g. if over 30 minutes the temperature linearly decreases 1°C / minute I wouldn’t loose much information if I’d store a value only every 30 minutes, but as soon as I have fast changes I want to see them as accurate as possible and also the min and max values (or better say every peak) shall be kept. If possible even the integral should be the same as the original data. All without producing GBs of data.

To solve this, I’d wish for a dynamic persistence strategy, that monitors every change of the respective items, (temporarily) stores all of them and then filters them to be able to reproduce the input dataset by proper interpolation with a targeted accuracy.

In theory I could duplicate all items, persist the original ones in a round robin database and do the long term storage with rules. But 1st this looks quite complex to reach and duplicating ALL items is high effort and error prone, 2nd I don’t like the idea to copy-pasting a rule just because I added a new item…

Any idea is welcome

rlkoshak · July 8, 2019, 1:31am

No matter how you slice this it’s going to be complex and challenging.

There is no reason to ever do this. Look at the DRY Design Pattern for details.

There might be something you can do with InfluxDB retention policies. If not, you have to ask yourself whether this is really worth the effort. I have 31 Items stored on every change (most as often as once a minute) and have been running this way with InfluxDB for about 3 years. My InfulxDB currently consumes a mere 129 mb of disk space.

ramack · July 8, 2019, 7:30pm

Hey Rich,

thanks for your answer. Sometimes it feels as you’d be the only one to answer! Please keep that fast and helpful responding, I highly appreciate every post from you!

But now let’s come back to the topic:

That’s exactly the reason why I hoped that someone came across this topic before and maybe found a good solution. - Or has some plans for OH3 or 4
On the other side, a colleague of mine has a mantra: It would not be fun, if it would not be hard.

Professionally I agree we have to avoid unwanted redundancy. But still, sometimes it is a fast solution for a problem that might never be solved in a clean way and then I prefer to have something that doesn’t make me as a software guy happy but still does the job…
Beside the fact that I don’t plan to win a software architecture contest with my OH config my OH experience is still quite limited and as a hobby project I also cannot invest too much time into it.

This also brings me to your proposal to look into InofluxDB retention policies. - I am not too happy about the need to invest into yet another database stuff and would prefer a solution that is more or less independent from the used persistence add-on (otherwise the abstraction of the persistence layer in OH would be kind of absurd).

I have Mysql/MariaDB set up and can smoothly integrate it into my backup strategy and all that. But, if you say it is worth to take a look, I’ll do. Not today but before doing it on my own, so thanks for the idea.

I am at roughly 400MB with 218 tables (items) and my setup is far from satisfying and not running since a long time, having the biggest amount of data even for the items added only 3 months ago, which was actually the trigger to think about what I’d wish for.

Once a minute is not what shows short switching of e. g. an electric water boiler running for 50 sec. Whether it is finally worth it I don’t yet know, maybe the choice will be to accept big amount of data, but I want to first try and ask here. As storing “useless” data is also not clean and in a sense we can also apply the DRY principle to data, so let’s see if someone else is also not happy with 1min. (which is 60 000 ms BTW) intervals.

So I still hope for a clean solution to this topic, which might be something like an intermediate layer in the persistence stuff in OH or a new persistence add-on doing it in a way or the other. There might even be the need to do a small model of how it should behave in Matlab before and if anybody has some ideas what should/could be done for dynamic persistence, please raise your voice now!

rlkoshak · July 8, 2019, 7:43pm

Don’t take this the wrong way, but you are trying to have your cake and eat it too. You explicitly state “I don’t like the idea to copy-pasting a rule just because I added a new item…” but applying a little bit of effort to review several well documented ways to avoid this is too too much work?

It isn’t totally clear what this Rule would look like but you can write one rule using

rule "Handle persistence"
when
    Member of AdvancedPersistItems changed
then
    // do advanced persistence stuff with triggeringItem which is the Item that changed
end

OH supports a wide variety of databases (round robin, time series, NoSQL, and SQL) with a single API. This API must therefore be the lowest common denominator. Something like this is probably never going to be implemented into the OH API because it is impossible or meaningless for one or more of these database types.

If you are looking for something that already exists that you can use, InfluxDB retention policies is all I know that might have what you need. Beyond that, you will have to do it yourself. The kind of highly flexible database compression algorithms you are looking for is a niche requirement and will require a completely separate implementation for each different database type and won’t work at all for some database types. Therefore it doesn’t really belong in the Persistence layer of OH at all.

ramack · July 8, 2019, 9:07pm

For sure I cannot argue here and even having some groups configured (and using them for dedicated persistence strategies) I did not know of that rule trigger, maybe because I only wrote very little and simple rules so far. So thanks for opening my eyes.

Now I also found the DRY Design Pattern you have been referring to. Before I just googled for it and ended up on wikipedia.

For Influx the downsampling could probably be reached by a continuous query but doing the real filtering in SQL feels harder than doing it in the rule DSL. in MySQL/MariaDB stored procedures could be also used to cleanup. For the short-term (maybe an hour or day) high resolution storage it might be an interesting option, because it looks like RRD4J cannot be used for that and just deleting everything older than a day is not that hard in SQL

So to summarize: it is quite some effort, none did it before with that level of complexity, to be portable the best is to do it in a rule with a trigger condition on a group.
Simpler solutions are described in

Another niche discussion was in

rlkoshak · July 8, 2019, 9:50pm

Except that Rules DSL doesn’t have a good query capability. All you got is MyItem.getHistoricItem(now.minusMinutes(10)) and the like. That’s going to be way uglier than code that can directly use SQL.

And then once you have managed to extract the data you want and figured out how to decimate/compress it, you have no way to put it back into the database because you can’t specify the timestamp. All your old data will appear as new again.

And you don’t have access to the SQL libraries necessary to directly interact with the DB from Rules.

The only reason Rules DSL feels easier I think is because you are more familiar with it. When the only tool you have is a hammer, all problems look like nails.

You might be able to do something from JSR223 rules. It is easier to get access to third party libraries there.

??? rrd4j keeps the first 24 hour’s worth of entries unmodified. That is at least one entry per minute. Only after a day does it start to down sample the data. Though indeed if you don’t care about anything older than a day then rrd4j indeed doesn’t work.

JensH · November 11, 2020, 12:04am

Just saw this older Thread. I got another request maybe the same. So the question if I understand right:

You got persistence at every change and every Minute for example. And if 10 Minutes same value, the eight values in the middle could be deleted, so you got example:

a) 7
b) 2
c) 2
d) 2
e) 2
f) 2
g) 2
h) 4

later reduces to

a) 7
b) 2
g) 2
h) 4

to reduce data and get the right charts, correct?

I just added Tutorial for deleting old items, so this could be a nice next task

ramack · November 11, 2020, 5:54am

This a possibillity, yes. Whether deletion later or directly not storing those values in the beginning I don’t care. And I have some items where there schould be more or less tolerance.
Temperatures
12.1
12.2
12.0
12.3
Could become
12.1
12.3
And even more agressive
12
10
11
12
13
14
15
16
17
Could be
12
10
17
As a linear interpolation would have a perfect fit.

JensH · November 11, 2020, 9:39pm

That are some really different tasks.

As I understood:
My Plan: Delete unneeded values because of history minutly, hourly,… are the same
==> Would be possible.

Your tasks:

delete unneeded Values under a limited treshold, question: In first why result should be 12.1 an 12.3 (Treshold not 0.2 and not 0.3, changing ==> Not possible) and not 12.1, 12.0 and 12.3 (would be treshhold of 0.2 ==> think could be possible.)
Looks like a Treshhold of 2 plus additional last value. What about Time-Faktor? Or in Graphs some later it would look like a strong Increase ==> Looks like a hard task.

In my point of view: Unneeded values is sensefull maintenance to save storage

Could save storage, but gives additional questions: If value raises over a long time under Treshhold, all changes would be lost. Or more complex: Treshold against last still stored value.
In arguments same as 1)

Problem ist for an universal procedure in Mysql there would be other informations needed, so the field type of openhab2 ot to check against field type to only got numeric values.

ramack · November 14, 2020, 9:25pm

I never fully thought that through. For sure your approach of deleting duplicates is the simplest approach and probably already very powerful. For my 0.1° resolution temperatures maybe not that perfect but clearly good as a first step and the best: very simple to implement.

My idea was basically the remove everything that is not needed to show what I am interested in. This means: if an interpolated graph looks the same (or at least very similar) it is ok for me. And the queries I usually do - most prominent min and max - should not be messed up. In best case even the integral (average) should be the same.

Mathematically you eliminate only sections of polynomial degree 0, while my approach would also remove degree 1 - which is perfectly defined with the start and end fix points. But I don’t see how to continue this and even less how to properly extend it under the given constraints. Maybe there is a cubic spline based approach would be great but I am not fluent in all that. So far it is simpler to ensure enough storage space.

JensH · November 15, 2020, 11:50am

For your 0.1 thing I would configure devices to another threshold to reduce traffic too. So they send only at a Change of 0.2 or 0.3 for example . That’s more efficient

Script double values is on and checked