Rrd4j reliability, data disappearance

giorginus80 · December 4, 2023, 6:38pm

I’ve been using rrd4j to log data for various items in my solar power system, and everything was working smoothly until last week. I was able to view all the monthly data on my graphs, but recently, the data seems to have mysteriously disappeared. As a result, my charts are empty, and I can’t figure out what went wrong.

I’m wondering if anyone else has encountered a similar issue with rrd4j. It’s frustrating because I’ve been relying on these graphs to monitor my system’s performance over time, and suddenly, they’re blank.

I’ve attached some screenshots to illustrate what I’m experiencing. Here’s an example: [insert screenshot link]

Additionally, I’m contemplating whether it’s a good idea to switch to InfluxDB for a production environment. Has anyone had a positive experience with InfluxDB? I’m looking for a reliable solution that won’t result in data disappearing unexpectedly.

If you have any insights or suggestions, please share them! I’m open to trying different approaches to get my data back and ensure it doesn’t vanish again in the future.

Thanks in advance for your help

rlkoshak · December 4, 2023, 7:16pm

Given the way that rrd4j works there has to be data there. However it’s likely zero values.

rrd4j requires a value to be saved every minute whether there is a change or not. If there is a gap, the database is basically corrupted and won’t return any values at all. Over time, to keep the database from growing it decimates the values by calculating the average of ten entries and replacing that ten with the average. It can’t do that if there are gaps in the data so the fact that you are getting anything at all points to there being data there.

You can confirm by using the API Explorer to query for what is in the database for the dates where the graph isn’t working.

If there is in fact no data I’ve no ideas. It should work at all in that case. If there is zero data at that point, what were your Items doing at that time? Maybe the problem is the source instead of the database?

ubeaut · December 4, 2023, 7:31pm

Mine disappeared as well when I upgraded to OH4.
I had to specifically add the items or group to the persistence where I didn’t have to do that before.
I also changed the persistence from file to the UI.
All is working again now.

giorginus80 · December 4, 2023, 7:37pm

Can you give me an example how to query the item in the exact period? and another thing is that every month I see the last day fullfilled…this from the starting:

This is my code:

config:
  chartType: month
  label: Fotovoltaico mensile
  sidebar: false
slots:
  dataZoom:
    - component: oh-chart-datazoom
      config:
        orient: horizontal
        show: true
        type: slider
  grid:
    - component: oh-chart-grid
      config:
        show: true
  legend:
    - component: oh-chart-legend
      config:
        left: left
        orient: vertical
        show: true
  series:
    - component: oh-aggregate-series
      config:
        aggregationFunction: last
        dimension1: date
        gridIndex: 0
        item: DeyeModbusProdDaily
        name: Produzione
        offsetAmount: 1
        offsetUnit: day
        type: bar
        xAxisIndex: 0
        yAxisIndex: 0
    - component: oh-aggregate-series
      config:
        aggregationFunction: last
        dimension1: date
        gridIndex: 0
        item: DeyeModbusLoadDaily
        name: Consumo
        offsetAmount: 1
        offsetUnit: day
        type: bar
        xAxisIndex: 0
        yAxisIndex: 0
    - component: oh-aggregate-series
      config:
        aggregationFunction: last
        dimension1: date
        gridIndex: 0
        item: DeyeModbusBuyDaily
        name: Comprata
        offsetAmount: 1
        offsetUnit: day
        type: bar
        xAxisIndex: 0
        yAxisIndex: 0
    - component: oh-aggregate-series
      config:
        aggregationFunction: last
        dimension1: date
        gridIndex: 0
        item: DeyeModbusSellDaily
        name: Venduta
        offsetAmount: 1
        offsetUnit: day
        type: bar
        xAxisIndex: 0
        yAxisIndex: 0
  tooltip:
    - component: oh-chart-tooltip
      config:
        show: true
  xAxis:
    - component: oh-category-axis
      config:
        categoryType: month
        gridIndex: 0
        monthFormat: default
        weekdayFormat: default
  yAxis:
    - component: oh-value-axis
      config:
        gridIndex: 0
        name: kWh

rlkoshak · December 4, 2023, 7:43pm

Go to Developer Tools → API Explorer → Scroll down to “persistence”

giorginus80 · December 4, 2023, 8:06pm

@rlkoshak ok I did this query:
http://192.168.0.200:8080/rest/persistence/items/DeyeModbusProdDaily?starttime=2023-11-06T00%3A00%3A00.000Z

So starting from 6 November but first set of result start from 15 November (so like in the previous graph). Openhab is running on a raspberry, never restarted.

{
  "name": "DeyeModbusProdDaily",
  "datapoints": "1866",
  "data": [
    {
      "time": 1700041500000,
      "state": "6.3"
    },
    {
      "time": 1700042400000,
      "state": "6.8"
    },
    {
      "time": 1700043300000,
      "state": "8.2"
    },
    {
      "time": 1700044200000,
      "state": "9.1"
    },

1700041500000 = Wed Nov 15 2023 09:45:00
But data how can disappears???

rlkoshak · December 4, 2023, 8:16pm

Go back far enough to get results on both sides of the data gap.

Some possiblities:

running on an SD card or sometimes even SDD and the machine lost power (potential for file system corruption)
running openHABian and the machine lost power (zram doesn’t write out to disk except on a normal shutdown, on power loss you’ll lose all the data that was stored since the last start of the machine)
SD card is wearing out

Note, all three of these would cause loss of data for InfluxDB too. I think openHABian pust InfluxDB in zram.

giorginus80 · December 4, 2023, 8:25pm

This is really strange, raspberry is under ups inside a ced and sd card is 256gb with 230 free. How can I be sure on rrd4j in a production env? Is there a way to backup it? Because Its sure that the data for November for example was there and why data just disappeared? With influxDB on another server isnt more safe? Im asking because I need to install a solution on a production env and keep data is very important

giorginus80 · December 4, 2023, 8:49pm

How did you do it?

rlkoshak · December 4, 2023, 8:53pm

Don’t use an SD card for storage, take lots of backups, and continue to use the UPS I guess. Note that any restore is going to lose any data saved after that backup was taken.

All the files are in $OH_USERDATA/persistence/rrd4j. There’s one file per Item.

Because this is part of userdata, it gets backed up when backing up OH itself.

rrd4j writes to the file when ever it receives a value to save. If not immediately, within seconds.

So either you’ve restored a backup that didn’t include that data or something happened to cause that data to not be saved (1, 2 or 3 above are three potential ways this can happen).

At a top level there are only two ways to generate a gap in the data like that:

OH stopped writing to the file
Changes made to the file between those dates were lost

The only way 1 can occur is if an Item isn’t included in persistence, but in that case rrd4j removes the file (or at least it used to) so you would have lost all the data. But since you are sure there was data there before, 1 isn’t possible. OH did write out the data.

As shown above and below, there are lots of ways outside of OH that the data could be lost though. But the key is they are outside of OH and because of that what ever caused the loss in rrd4j would be a risk for any similarly configured database.

Depends on how it’s set up. It could be more safe but that’s not guaranteed. rrd4j could be made more safe where it is now as well.

Properties like “safe” and “production ready” are system wide properties. Neither rrd4j nor InfluxDB are production ready by themselves. What makes them production ready is how the entire system is set up. How do you do backups and how often and how do you restore? How do you handle power loss, unexpected system shutdowns, file system problems, etc?

You need to do the systems engineering to ensure that data isn’t lost which means understanding what happens in various failure situations and mitigating any loss of data. You need to know where all the data is stored. You need to know how it’s stored. You need to know how to back it up and restore it independently.

If InfluxDB is set up on an SD card with zram, it’s exactly as safe/unsafe as your current rrd4j setup (presumably). You didn’t lose data because rrd4j isn’t reliable. You lost data because something happened to cause the data between those dates to get lost outside of OH and rrd4j.

If you cannot afford the loss of any data, then you’re going to need a cluster of InfluxDB instances (for example) with redundant storage (e.g. RAID 5) and probably off site backups and stuff like that. Avoiding the loss of data isn’t easy but it’s also very well covered.

The easiest approach is to:

take a backup
Go to
a. 4.0 Go to Settings → Other Addons → rrd4j → blue gear icon → Persistence Configuration
a. 4.1 Go to Settings → rrd4j persistence under Add-on Settings → Persistence Configuration
Click the code tab and copy the YAML you find there
remove the rrd4j.persist file (check to see if the files are removed from userdata/persistence/rrd4j or not, I’m not sure if they are)
Back out and return to 2.
Click the code tab and paste the YAML you copied from before

That will configure everything exactly like it was in the .persist file.

giorginus80 · December 4, 2023, 9:05pm

Yes exactly, my idea was because Id like to setup an OH instance on Ubuntu to monitor a PV very big (20 inverters) and I need to keep safe all the datas and to connect it to a prod db like postgres (by the way on another protected server on another ip), I want just to be sure to keep all data safe because I noticed this data loss in my home environment.

This is just a doubt because obviously with rrd4j this is done automatically by openhab, and with postgres or other db need to be done manually?

I make all this questions only because I need a prod env on a big PV env like I said before and I dont want to lost their data. I will setup a fresh OH on Ubuntu and I will have to collect data from each inverter

ubeaut · December 4, 2023, 9:19pm

I put items I want to persist into groups.
E.G gRrd4j

configurations:
  - items:
      - LocalSun_Daylight_Duration
    strategies:
      - everyChange
      - everyUpdate
    filters: []
  - items:
      - gBatterycheck*
    strategies:
      - everyChange
      - everyDay
    filters: []
  - items:
      - gblindsposition*
    strategies:
      - everyChange
      - everyUpdate
      - restoreOnStartup
      - every5Minute
      - everyHour
    filters: []
  - items:
      - gFronius*
      - gRrd4j*
      - gTemp*
    strategies:
      - everyChange
      - every5Minute
      - everyDay
    filters: []
cronStrategies:
  - name: everyMinute
    cronExpression: 0 * * * * ?
  - name: every5Minute
    cronExpression: 0 */5 * ? * *
  - name: everyHour
    cronExpression: 0 0 * ? * *
  - name: everyDay
    cronExpression: 0 0 0 * * ? *
defaultStrategies:
  - everyChange
  - everyUpdate
thresholdFilters: []
timeFilters: []
equalsFilters: []
includeFilters: []

rlkoshak · December 4, 2023, 9:47pm

OH can talk to PostgreSQL too through the JDBC add-on I believe.

It’s not done by openHAB. It’s how rrd4j works. It’s the whole point of using rrd4j. Its databases never grow in size. They are fixed so you’ll never run out of disk space. But to do this, it has to decimate the data as it ages. The older the data, the fewer values that are saved. Year old data might only have one entry in the database for the whole day while yesterday will have one entry every second.

You will need to design your data collection to be reliable and resistant to data loss. It takes a lot of work to create a fault tolerant system. Figure out what’s the most amount of time you can lose data for and design your data system to support that. That’s going to drive how much you need to do.

For example, if you can afford to lose up to one day of data, then that can be relatively easy to meet. Just have an automated backup that runs twice a day (so that one failed backup doesn’t cause you to blow your timelines) and test the backup frequently.

On the other end of the spectrum you have no loss of data is permitted in which case you need redundant clusters of databases with hot backup servers and automated fail over and stuff like that.

Your system probably needs something in between at the lower end of the scale, but it’s driven by the requirements.

giorginus80 · December 4, 2023, 9:54pm

You totally right.
I need for to get

every 5 mins the actual production
total day is related only to the day so the data can be only the max of the day and overwritten
total production is only 1 to 1 overwritten always

When you say “to desing your data collection” you mean to remove the default behaviour of collecting data on every update right?

rlkoshak · December 4, 2023, 10:07pm

No I mean if your machine melts because the Avengers were fighting Thanos next door and a stray laser hit your data collection machine (i.e. openHAB machine and/or database machine), what’s the most amount of data as an amount of time, that you can afford to lose. Half a day? An hour? A week?

What’s the largest gap in your data that is allowed? Design your data reliability approach to that. If it’s half a day or more, you can probably manage it through automated backups. If it’s less than that, you’re going to have to build in some redundancy and backup services and such.

Note that all of this is outside openHAB and the database. This is a system level design. Not just some settings in openHAB or rrd4j or InfluxDB.

giorginus80 · December 4, 2023, 10:23pm

Yes this is true but it can be done on rrd4j and how do you know that something happened (like for me in my env) and I realize it after some days and Im backing up lost data? And by the way for rrd4j we need to stop OH to backup it.
I’m very scared of rrd4j because it made me lose months of data, and without understanding why, the Raspberry has been online for 4 months and is under UPS, and the SD card is totally free, so I can’t explain why this loss doesn’t happen None of the events you listed occurred

timo12357 · December 5, 2023, 8:47am

I replaced rrd4j with InfluxDB as I needed measurements timed in the future. InfluxDB is stable and does what it is expected to do in my system.

rlkoshak · December 5, 2023, 2:17pm

An unexamined and untested backup is worth little more than no backup at all. You have to check it. You could set up an automated check.

I do things pretty simply. I have a cron job that runs, packages up the backup and moves it to my NAS and then sends me an email with the contents of the backup. If the contents look fishy I know it’s a bad backup. Then I periodically manually check the backups to ensure that they look reasonable. I used to restore the backups once a month or so but haven’t done that in a while.

An example of one of my simple backup scripts:

#!/bin/bash
echo "Backing up openhab"
file=/srv/backups/openhab/openhab-$(date +%Y-%m-%d_%H%M).tgz
cd /srv/openhab
tar cfz $file .

fsize=$(ls -lh $file | cut -d ' ' -f 5)
toc=$(tar tfz $file)
body=${file}'\nBackup size: '${fsize}'\n\nContents:\n'${toc}

sendmail=/usr/sbin/sendmail
email=rlkoshak@gmail.com
to='To: '$email'\n'
from='From: '$email'\n'
subject='Subject: openhab Backed Up\n\n'
body=${body}
msg=${to}${from}${subject}${body}

echo -e "$msg" | $sendmail $email⏎

I’ve not found that to be the case but probably have been lucky. I don’t really care if I lose some or all the data in rrd4j so never looked closely into it. Though looking back through my history of using rrd4j (since OH 3.0) I’ve not lost any data.

You could take the backup from the OH REST API instead of just grabbing the files. That wouldn’t require OH to be stopped.

Like I said, it’s all driven by what your actual requirement for data backup and safety is.

All I can say is if rrd4j were randomly flakey it wouldn’t be used all over the place in the real world. What ever caused the loss of data is almost certainly external to rrd4j itself. It’s theoretically possible that OH caused the loss of data but if that’s the case, switching to InfluxDB or any other database isn’t going to fix that problem (and one would expect to see lots of complaints on the forum if that were the case).

My main point though is that just blindly switching to InfluxDB or some other DB is not sufficient to prevent data loss. If this data is important, you have to do more.

giorginus80 · December 5, 2023, 2:56pm

I’m totally agree, by the way I want to do some tests with postgres, creating some procedure to prevent the growth of the db (for example for the total day production I need just 1 data for day and I can delete with a stored procedure) but with postgres for example I’m quite sure I can control my data better, I will do some test.
Thanks for all your suggestions