Feature Request : Median

Hi all,

a feature request for adding Median to the group functions, in addition to

SUM, AVG, MAX and MIN.

Would that be a difficult one ?

1 Like

Is this to add it to group functions (so median of values of items in group), or calculate the median over a timeseries of historical values? Anyway, best place is to create an enhancement request on github.

median of values of items

@GeertClaes would you mind explaining the use case for this?

Thereā€™s many applications. I, too, am looking for this.
For example if you have multiple sensors values for one field and want to exclude out-of-range readings from broken sensors. You cannot use the average in these cases.

@GeertClaes as Mark said please request this on github.
Please make it two requests, one for group aggregation function and one for persistence medianSince() similar to averageSince().

1 Like

@JimT,

Energy contracts with hourly tarrifs (but as @mstormi mentioned, there are many use cases). The idea for a future request came from this post.

So itā€™s for persistence, and not group function?

For persistence, you could already calculate the median yourself by using getAllStatesSince / Between / Until.

No, itā€™s for group function. There are 24 items (or more) in my case. As an example look at this binding.

Ugh, but that would be highly ineffective if done often.
Thatā€™s why I would like to have a medianSince(). Much more simple to track the median inside the database than to compute it every time you want it.

A quick google search reveals thereā€™s this for InfluxDB:

Use the median() function to return a value representing the 0.5 quantile (50th percentile) or median of input data.

rrd4j has a percentile method, too.

Guess both of these can be used for this computing purpose, please refer to that in your request.

1 Like

Not arguing that it wouldnā€™t be nice to have this in persistence, but AFAIK, none of the existing implementations of persistence methods utilise specific functionality from the underlying persistence backend. Furthermore, it isnā€™t even currently supported by the openhab core framework. Not saying that it canā€™t be implemented, but it wouldnā€™t be as straight forward as simply calculating the median ā€œmanuallyā€ from getAllStatesBetween.

For example, the average persistence methods simply call getAllStatesBetween and perform the calculations in the code.

So Median would be done in a similar manner.

All Iā€™m saying is you can already implement it yourself in your rule rather than waiting.

The only difference is the code runs in java vs in your chosen scripting language. The difference in performance would be negligible.

1 Like

Not at all!
Your routine would have to run and retrieve all states every time you want to compute the median. Compared to when already pre-calculated available inside the DB, itā€™s a simple readout.
The overall difference in end-2-end performance and efficiency is HUGE.

(yeah sure thereā€™s ways of pre-calculating and caching that yourself, but why reinvent the wheel for every instance of this problem when a generically applicable solution is already available in the major DBs).

1 Like

Take a look at how itā€™s currently done in core for the existing methods (average, sum, etc)

1 Like

what do you want to tell me by that?

that itā€™s ineffectively implemented because it doesnā€™t use DB features to compute?
that you cannot replicate it for the median?

BTW Iā€™m no core programmer

You assume a database pre-calculates all these metrics and actually stores them. I donā€™t think any database does that. A number of databases have functions to do some calculations (but not all of them have). And the implementation is most likely more optimized in the context of the database. Also it obviously avoids the round trip to the database with resulting data. But the calculation will still happen.
If you search for calculation median with SQL you will find algorithms that do 2 select sorts and then pick the 2 middle values and calculate the average. Thas is hardly an optimization because the selects will return all values as well. Depending on the database this will be different.
For OH to do this, all of these calculation methods would have to be exposed as OH APIā€™s to the persistence addon developers from core, leaving the responsability to every persistence addon developer to create the specific optimized code for it. I am not convinced the potential optimization is worth the much more complex code. The value of the persistence service at the moment is that it allows querying any supported persistence service and getting results, irrespective of the underlying database.

I do. Itā€™s merely the opposite. Most if not all serious/major DB implementations do (at least as an option) because the most simple and efficient way to track median and averages is to update them on every db write transaction (= item persistence update in OH), the algorithm to calculate it when you have access to the time series of previous values is much simpler than without that access.

Thatā€™s why I want OH to make use of it by making that DB API function accessible from inside OH rather than implement it in OH java core or even leave implementation to the user.
Much more efficient in terms of software development efforts and benefits.

At least rrd4j and InfluxDB do, see my previous post. And that was just a quick google search.

1 Like

I actually donā€™t think it is. Median needs to be calculated over all data points in a selection.

If your selection window stays the same, the only relevance to the value of the previous median is at best a seed value to find the new median, unlike average calculations, where it is easy if you keep previous average and number of data points.
And that is if you keep the selection constant. But you may not. What if you use another time horizon in your new query?

And you are right, InfluxDB and rrd4j have functions to do the calculations. To my knowledge, they do not store the results for every possible set of query parameters. As I wrote, they can avoid returning the whole dataset as they do the calculation themselves, and can optimize that calculation, but they do not store results (except for some caching). Please prove otherwise if you insist this is not correct.

1 Like

If that DB function exists (and that is not true for all databases), it is specific to the database, so needs to be implemented for each database independently. Again, that is a major effort, and I am not sure it is worth it, especially as you would still require the general mechanism as not all databases have a function for it.

1 Like

It would be easy/fast to find the median if the database has an index for the data/value field in addition to the primary key (timestamp).

A quick google shows that mariadb has a median function too

https://wiki.postgresql.org/wiki/Aggregate_Median

Indexing on the value field might also be handy for things like max/min, but again, this isnā€™t currently supported in the openhab persistence framework due to the generic nature of the design.

Since rrd summarises its data, Iā€™m not really sure how that works, and I havenā€™t looked it up.

I agree with @mstormi that it ā€œwould be niceā€ to use the more optimised version from the underlying persistence/database service should it be available, and fallback to a generic implementation if not. But I know this will be a lot of work!

Perhaps we can start by extending the FilterCriteria to include ordering of the value. That would greatly enhance our ability to optimise min/max/median and even add percentile.

1 Like

Yes, median functions exist in many databases, and can optimize under the condition that the table is indexed on the value. That is not always the case, but could be done, depending on the database. A database will still find the median with such a function even if there is no index (and some databases will dynamically create the index if this query is often used).

Having filterCriteria that returns a sorted list on value (and not on date as it is now) can definitely help in getting better performance on the median function, again, pending on the configuration and support of a specific database.

If I look at the overall persistence actions, the following would most likely not benefit from using a database function:

  • average: (we use a weighted average on time duration between time points, not a real average, so it would only be of benefit with equal duration between persisted values, which is not guaranteed)
  • variance: relies on average
  • deviation: relies on variance
  • countStateChanges: would be complex query to filter on only changes and then count

For these functions, the value is limited (as it is only retrieving one or two values before calculating and the natural ordering is time):

  • persistedState
  • last/nextUpdate
  • last/nextChange
  • previous/nextState
  • updated
  • delta
  • evolutionRate

Following would benefit provided the database has indexes on values, either pushed in the database or by working with a list of sorted values from the database (which would be the case with enhanced filterCriteria):

  • min
  • max
  • median

Following would typically benefit in all cases if they can get pushed to the database:

  • sum
  • count

I only see major value in median being pushed into the database as that has the highest calculation complexity and does not scale well. Working on a sorted list returned from the DB would also speed it up provided the database indexes on that value and therefore optimizes that sort. If it does not, sorting is more expensive than a good median calculation algorithm.

2 Likes