Hi all,
a feature request for adding Median to the group functions, in addition to
SUM
, AVG
, MAX
and MIN
.
Would that be a difficult one ?
Hi all,
a feature request for adding Median to the group functions, in addition to
SUM
, AVG
, MAX
and MIN
.
Would that be a difficult one ?
Is this to add it to group functions (so median of values of items in group), or calculate the median over a timeseries of historical values? Anyway, best place is to create an enhancement request on github.
median of values of items
@GeertClaes would you mind explaining the use case for this?
Thereās many applications. I, too, am looking for this.
For example if you have multiple sensors values for one field and want to exclude out-of-range readings from broken sensors. You cannot use the average in these cases.
@GeertClaes as Mark said please request this on github.
Please make it two requests, one for group aggregation function and one for persistence medianSince()
similar to averageSince()
.
Energy contracts with hourly tarrifs (but as @mstormi mentioned, there are many use cases). The idea for a future request came from this post.
So itās for persistence, and not group function?
For persistence, you could already calculate the median yourself by using getAllStatesSince
/ Between / Until.
No, itās for group function. There are 24 items (or more) in my case. As an example look at this binding.
Ugh, but that would be highly ineffective if done often.
Thatās why I would like to have a medianSince()
. Much more simple to track the median inside the database than to compute it every time you want it.
A quick google search reveals thereās this for InfluxDB:
Use the median() function to return a value representing the 0.5 quantile (50th percentile) or median of input data.
rrd4j has a percentile
method, too.
Guess both of these can be used for this computing purpose, please refer to that in your request.
Not arguing that it wouldnāt be nice to have this in persistence, but AFAIK, none of the existing implementations of persistence methods utilise specific functionality from the underlying persistence backend. Furthermore, it isnāt even currently supported by the openhab core framework. Not saying that it canāt be implemented, but it wouldnāt be as straight forward as simply calculating the median āmanuallyā from getAllStatesBetween.
For example, the average persistence methods simply call getAllStatesBetween and perform the calculations in the code.
So Median would be done in a similar manner.
All Iām saying is you can already implement it yourself in your rule rather than waiting.
The only difference is the code runs in java vs in your chosen scripting language. The difference in performance would be negligible.
Not at all!
Your routine would have to run and retrieve all states every time you want to compute the median. Compared to when already pre-calculated available inside the DB, itās a simple readout.
The overall difference in end-2-end performance and efficiency is HUGE.
(yeah sure thereās ways of pre-calculating and caching that yourself, but why reinvent the wheel for every instance of this problem when a generically applicable solution is already available in the major DBs).
Take a look at how itās currently done in core for the existing methods (average, sum, etc)
what do you want to tell me by that?
that itās ineffectively implemented because it doesnāt use DB features to compute?
that you cannot replicate it for the median?
BTW Iām no core programmer
You assume a database pre-calculates all these metrics and actually stores them. I donāt think any database does that. A number of databases have functions to do some calculations (but not all of them have). And the implementation is most likely more optimized in the context of the database. Also it obviously avoids the round trip to the database with resulting data. But the calculation will still happen.
If you search for calculation median with SQL you will find algorithms that do 2 select sorts and then pick the 2 middle values and calculate the average. Thas is hardly an optimization because the selects will return all values as well. Depending on the database this will be different.
For OH to do this, all of these calculation methods would have to be exposed as OH APIās to the persistence addon developers from core, leaving the responsability to every persistence addon developer to create the specific optimized code for it. I am not convinced the potential optimization is worth the much more complex code. The value of the persistence service at the moment is that it allows querying any supported persistence service and getting results, irrespective of the underlying database.
I do. Itās merely the opposite. Most if not all serious/major DB implementations do (at least as an option) because the most simple and efficient way to track median and averages is to update them on every db write transaction (= item persistence update in OH), the algorithm to calculate it when you have access to the time series of previous values is much simpler than without that access.
Thatās why I want OH to make use of it by making that DB API function accessible from inside OH rather than implement it in OH java core or even leave implementation to the user.
Much more efficient in terms of software development efforts and benefits.
At least rrd4j and InfluxDB do, see my previous post. And that was just a quick google search.
I actually donāt think it is. Median needs to be calculated over all data points in a selection.
If your selection window stays the same, the only relevance to the value of the previous median is at best a seed value to find the new median, unlike average calculations, where it is easy if you keep previous average and number of data points.
And that is if you keep the selection constant. But you may not. What if you use another time horizon in your new query?
And you are right, InfluxDB and rrd4j have functions to do the calculations. To my knowledge, they do not store the results for every possible set of query parameters. As I wrote, they can avoid returning the whole dataset as they do the calculation themselves, and can optimize that calculation, but they do not store results (except for some caching). Please prove otherwise if you insist this is not correct.
If that DB function exists (and that is not true for all databases), it is specific to the database, so needs to be implemented for each database independently. Again, that is a major effort, and I am not sure it is worth it, especially as you would still require the general mechanism as not all databases have a function for it.
It would be easy/fast to find the median if the database has an index for the data/value field in addition to the primary key (timestamp).
A quick google shows that mariadb has a median function too
https://wiki.postgresql.org/wiki/Aggregate_Median
Indexing on the value field might also be handy for things like max/min, but again, this isnāt currently supported in the openhab persistence framework due to the generic nature of the design.
Since rrd summarises its data, Iām not really sure how that works, and I havenāt looked it up.
I agree with @mstormi that it āwould be niceā to use the more optimised version from the underlying persistence/database service should it be available, and fallback to a generic implementation if not. But I know this will be a lot of work!
Perhaps we can start by extending the FilterCriteria
to include ordering of the value. That would greatly enhance our ability to optimise min/max/median and even add percentile.
Yes, median
functions exist in many databases, and can optimize under the condition that the table is indexed on the value. That is not always the case, but could be done, depending on the database. A database will still find the median with such a function even if there is no index (and some databases will dynamically create the index if this query is often used).
Having filterCriteria that returns a sorted list on value (and not on date as it is now) can definitely help in getting better performance on the median function, again, pending on the configuration and support of a specific database.
If I look at the overall persistence actions, the following would most likely not benefit from using a database function:
For these functions, the value is limited (as it is only retrieving one or two values before calculating and the natural ordering is time):
Following would benefit provided the database has indexes on values, either pushed in the database or by working with a list of sorted values from the database (which would be the case with enhanced filterCriteria):
Following would typically benefit in all cases if they can get pushed to the database:
I only see major value in median being pushed into the database as that has the highest calculation complexity and does not scale well. Working on a sorted list returned from the DB would also speed it up provided the database indexes on that value and therefore optimizes that sort. If it does not, sorting is more expensive than a good median calculation algorithm.