Delete all InfluxDB measurements except for the most recent for an openHAB item

Hi there!

This morning I was looking for a way to remove unneccesary data records (measurements) from my InfluxDB time series database.

I have a lot of Items for which I only need the last state stored so that those items get their last known value restored when restarting openHAB. That is, for those items I don’t need more than a single value for each item stored. It turns out that I have thousands of measurements that serves no purpose.

All of my Items belongs to an openHAB group that I’ve set up to make the persistence working, e.g:

Group G_PersistOnChange // Persist on change and system start up
Group G_PersistOnUpdate // Persist on change and system start up. For temp, humidity sensors etc.

First I tried out the restoreOnStartup predefined strategy which was supposed to load and initialize the last persisted state of the Item on openHAB startup. After some trial and error I realized that restoreOnStartup doesn’t store anything in the db itself. If I used it alone, nothing was saved to the db hence nothing was restored on startup. The strategy seems to work only if I combine it with another strategy like everyChange or everyUpdate.

I also learned that I can not have different InfluxDB retension policies for different openHAB items.

I thought it would be nice to be able to define on item level for how long time I want to keep the data stored. I made a small python script that can take care of deleting all measurements but the latest for an openHAB item.

purge_influx_data.py

import json
import sys

import requests
from dateutil.parser import parse # Install this by issuing "pip install python-dateutil"
import argparse
 
# create parser
parser = argparse.ArgumentParser()
 
# add arguments to the parser
parser.add_argument('-itemname', type=str, required=True, help="The openHAB item name for which all measurements except the most recent one shall be deleted")
parser.add_argument('-db', type=str, default='openhab_db', help="The InfluxDB database name")
parser.add_argument('-hostname', type=str, default='localhost', help="The InfluxDB server host name")
parser.add_argument('-port', type=int, default=8086, help="The InfluxDB server port name")
 
# parse the arguments
args = parser.parse_args()

def is_date(string, fuzzy=False):
    """
    Return whether the string can be interpreted as a date.

    :param string: str, string to check for date
    :param fuzzy: bool, ignore unknown tokens in string if True
    """
    try:
        parse(string, fuzzy=fuzzy)
        return True

    except ValueError:
        return False

url = 'http://{}:{}/query'.format(args.hostname, args.port)

params = dict(
    db=args.db,
    q='SELECT LAST(value) FROM "{}" ORDER BY time DESC LIMIT 1'.format(args.itemname)
)
response = requests.get(url=url, params=params)
response_dict = json.loads(response.text)

try:
    last_date = response_dict['results'][0]['series'][0]['values'][0][0]
except KeyError:
    print('The InfluxDB data for item: [{}] was not found'.format(args.itemname))
    sys.exit()

if not is_date(last_date):
    print('Last date: [{}] is not recognized as a date'.format(last_date))
    sys.exit()

query= 'DELETE FROM "{}" WHERE TIME < \'{}\''.format(args.itemname, last_date)
print('Running query: [{}]'.format(query))
params = dict(
    db=args.db,
    q=query
)
response = requests.get(url=url, params=params)
response_dict = json.loads(response.text)

db_result_code = response_dict['results'][0]['statement_id']

if db_result_code == 0:
    print('Query ran successfully. All InfluxDB measurements except the most recent are now deleted for the openHAB item [{}]'.format(args.itemname))
else:
    print('Query returned result code [{}]'.format(db_result_code))

I can now assign a group to all Items for which I want all InfluxDB measurements except the most recent one to be purged. After doing that I just make a simple rule that runs at regular intervals.
I use a jython rule, you might do it in any flavour

import os
from logging import DEBUG, ERROR, INFO, WARNING

from core.rules import rule
from core.triggers import when

@rule("Purge Influx Data", description="Deletes InfluxDB measurements except the most recent for openHAB items belonging to the group gPersist_Last_Value.", tags=[])
@when("Time cron 3 11 10 ? * MON *") # Run at 10:11:03 every Monday
#@when("System started")
def purge_influx_data(event):
    purge_influx_data.log.setLevel(DEBUG)
    '''
    Deletes InfluxDB measurements except the most recent for openHAB items belonging to the group gPersist_Last_Value.
    '''
    purge_influx_data.log.debug('Checking the group gPersist_Last_Value for items to delete measurements except for the most recent one')
    for item in ir.getItem("gPersist_Last_Value").members:
        purge_influx_data.log.debug('Deleting unneccesary measurements for item [{}]'.format(item.name))
        cmd = '/usr/bin/python /openhab/conf/scripts/purge_influx_data.py -db {} -itemname {} -hostname {} -port {} 1>/dev/null 2>&1'.format('openhab_db', item.name, 'MYHOST', 8086)
        #purge_influx_data.log.debug('Running script: {}'.format(cmd))
        os.system(cmd)

It does what I expect from it. It works here but if you decide to try it out you do it at your own risk and you’ll need to take a backup of your database first of course.

I’m thinking that it might be a nice idea to alter the script so that I choose to preserve data for an arbitrary number of days… thinking about it.

Please feel free to comment on this.

Cheers! :grinning:

Look here, maybe this is what you need?

That’s true, but in many cases you can use the continuous query to move from one retention policy to another after it is collected. For example, I gather temperature data as it changes, and it goes into the 1 month retention policy. Then I use a continuous query to get the min/max daily and store those forever.

Also, you don’t have to use the persistence to store the data to InfluxDB. It is possible to POST directly from a rule. In that case you have control over the retention policy.

There’s some work on a new InfluxDB persistence binding here that mentions retention policy specification.

Thanks! That’s exactly what I need for most of my openHAB items. I’ve added mapdb as an additional persistence service and I’ll update my items files to use mapdb persistence where appropriate.
:grinning:

Thanks. I’ll have a deeper look into that too as it might be useful for some items.