Could not create rrd4j database file error in OH3

rlkoshak · February 2, 2022, 6:10pm

Everything you can do to configure rrd4j is covered in the docs. But understanding what each property means and how they interact requires in depth knowledge on how rrd4j works. Also, if you were to shrink the size of the files, that would mean that it has to compress the data more frequently. When rrd4j compresses data, it takes several readings (e.g. ten) and replaces those with a single value that’s the average. It’s a lossy operation so make sure you don’t lose the precision you need for recent data in the quest of shrinking the files. You are almost certainly better off limiting which Items are saved in the first place than you are trying to shrink the files.

Sonic · February 2, 2022, 6:32pm

@rlkoshak thank you for the guidance. I’ve read the doc more carefully, yes, i see managing this thing isn’t trivial. Ojay, i’ll limit items.
Just curious, why did you decide to turn on such a space-consuming feature by default? What’s the purpose of archiving everything? I guess the basic openhab setup doesn’t include any graphing, and if i am up to enabling grafana, i am also up to configuring the persistence explicitly.
More over, i’ve got an impression that you’ve changed the default behavior to “store everything” for all possible persistence types. I think so because i tried to disable influx (for unrelated purpose) by moving away its . persist file, then, upon stopping the influx i saw data submission errors in the OH log.
Why? I think it’s more user-friendly to enable such a heavyweight feature only if explicitly requested by the user.

rlkoshak · February 2, 2022, 6:44pm

Oh but it does. MainUI includes graphing out of the box. You don’t have to do anything at all on a basic OH instance. It will show up as a trend line on the background of all Number Items by default.

Anywhere you see “Analyze” or “Analyze All” in MainUI it will generate a chart.

You can define your own charts and even pages of charts as well.

Because there is no way to set a persistence strategy through MainUI, persistence add-ons now implement a default strategy with is almost always going to be all items, every change, restoreOnStartup. If you want to disable a persistence add-on, uninstall it. Each persistence add-on should document what the default behavior is when there is no .persist file to override it.

Because the developers and maintainers decided it was more user friendly to provide something that works out of the box instead of a situation where users are forced to go edit a text file to get basic functionality after installing an add-on. The presumption is that if you installed the add-on, you want to use the add-on.

RRD4J is installed by default so that the charts in MainUI just work out of the box with no configuration required. And since rrd4j doesn’t grow over time, it was the best choice to use for this.

Sonic · February 2, 2022, 7:06pm

Well, sorry, i didn’t know. That explains things.
Frankly speaking i haven’t configured MainUI. I migrated my install from v2.5, so i am keeping things as they were. grafana + influx + BasicUI + native Android app on my wall panel. You may call me lame if you want
I just don’t have much time to fiddle with configuration. And anyways my wall panel is an old Android v4.1 device; it doesn’t allow for fancy web graphics using html5.

Sonic · February 2, 2022, 7:08pm

BTW, i’ve got a suggestion. The doc says default persistence includes all the enum-erable data types, including switches. May be restrict those by default to “number” only? Does it make much sense to graph booleans ? WDYT ?

rlkoshak · February 2, 2022, 7:22pm

Yes it does. It’s particularly useful, for example, to correlate when a device is ON and some reading (e.g. the outside temperature with when the heater turns ON).

NOTE: I don’t have my thermostat connected to OH so that’s actually the outside temp with the state of one of my light switches on the same chart, but it shows the sorts of things that can be correlated.

These types of Items can also be included in bar charts and other types of data representations.

Sonic · February 2, 2022, 7:24pm

Back to the topic of zram usage; my zramctl shows:

# zramctl --output-all
NAME       DISKSIZE   DATA COMPR ALGORITHM STREAMS ZERO-PAGES TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram3     450M 390.3M 44.9M zstd            4        794  200M      200M     200M       0B /opt/zram/zram3
/dev/zram2     350M   364K  5.2K zstd            4          0  184K      150M     184K       0B /opt/zram/zram2
/dev/zram1     350M  65.3M  1.5M zstd            4         58 32.6M      150M    32.6M       0B /opt/zram/zram1
/dev/zram0     450M     4K   86B lzo-rle         4          0    4K      200M       4K       0B [SWAP]

Size of /var/lib/openhab/persistence/rrd4j is 36M.

Funny enough, irrd4j is stored on zram1, and zram3 is used for logs. And size of logs is 390M. The largest files (128 mbytes!!!) are log/daemon.log and log/syslog; with the majority of contents showing that very error:

Feb  2 03:47:25 openhabian karaf[813]: Exception in thread "OH-eventexecutor-149284" java.lang.NullPointerException
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)

Looks like the error is cascading. A failure in one place (persistence) causes spitting tons of these messages to logs; which in turn aggravates the failure until the complete collapse. @rlkoshak i wonder, can something be done about it ? One way would be to catch all DB submission errors in the openhab and report only the first error until things resolve. I understand that it’s easier said than done; but IMHO this is a major issue.

If you’re interested, this is how it began:

Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-1" Exception in thread "OH-eventexecutor-2" java.lang.NullPointerException
Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-3" #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlock
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.lang.Thread.run(Thread.java:829)
Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-4" java.lang.NullPointerException
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.lang.Thread.run(Thread.java:829)

So, it happened at 03:45; and the last such line is stamped 03:47. So, in two minutes the log was filled up to the top with 1044989 lines (i’ve used " grep karaf syslog |wc -l" to count); it looks like OH just keeps trying and failing with no delay.

kern.log has this:

Feb  2 03:47:23 openhabian kernel: [365506.992045] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 47231)
Feb  2 03:47:23 openhabian kernel: [365506.997667] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 47507)
Feb  2 03:47:23 openhabian kernel: [365506.997744] Buffer I/O error on device zram3, logical block 47507
Feb  2 03:47:23 openhabian kernel: [365506.997760] Buffer I/O error on device zram3, logical block 47508
Feb  2 03:47:23 openhabian kernel: [365506.997771] Buffer I/O error on device zram3, logical block 47509
Feb  2 03:47:23 openhabian kernel: [365506.997781] Buffer I/O error on device zram3, logical block 47510
Feb  2 03:47:23 openhabian kernel: [365506.997791] Buffer I/O error on device zram3, logical block 47511
Feb  2 03:47:23 openhabian kernel: [365506.997801] Buffer I/O error on device zram3, logical block 47512
Feb  2 03:47:23 openhabian kernel: [365506.997811] Buffer I/O error on device zram3, logical block 47513
Feb  2 03:47:23 openhabian kernel: [365506.997821] Buffer I/O error on device zram3, logical block 47514
Feb  2 03:47:23 openhabian kernel: [365506.997831] Buffer I/O error on device zram3, logical block 47515
Feb  2 03:47:23 openhabian kernel: [365506.997840] Buffer I/O error on device zram3, logical block 47516
Feb  2 03:47:23 openhabian kernel: [365507.010431] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 33278)
Feb  2 03:47:23 openhabian kernel: [365507.025818] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24322)
Feb  2 03:47:23 openhabian kernel: [365507.040076] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24602)
Feb  2 03:47:23 openhabian kernel: [365507.054284] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24882)
Feb  2 03:47:23 openhabian kernel: [365507.068931] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25154)
Feb  2 03:47:23 openhabian kernel: [365507.082773] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25438)
Feb  2 03:47:23 openhabian kernel: [365507.097182] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25703)
Feb  2 03:47:23 openhabian kernel: [365507.111421] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25979)

Note the timestamp of 03:47, i. e. this was a consequence of filling up the log partition.

Sonic · February 2, 2022, 7:34pm

And one more. I see that by default event.log logs all the events under INFO level. Every change of every measured value. Does it make sense to have this enabled by default? The log is huuuuge, and all the data is:

2022-02-02 01:02:32.994 [INFO ] [openhab.event.ItemStateChangedEvent ] - Item 'Current_Rooms' changed from 0.697876071936 to 0.696626070016
2022-02-02 01:02:33.642 [INFO ] [openhab.event.ItemStateChangedEvent ] - Item 'Current_Rooms' changed from 0.696626070016 to 0.699751074816

This comes from my multi-channel power meter; it is quite precise. Plug in a phone charger - and you get tons of such events per minute.

I admit that it can be useful for debugging the setup, but i’m in doubt it’s good to have this always enabled.

rlkoshak · February 2, 2022, 7:39pm

daemon.log and syslog are operating system log files. And these specific errors are not specific to the DB. They have to do with the file system running out of space. Anything that tries to write to that files system will generate errors and the OS will report those to its logs.

But OH doesn’t know specifically that it’s failing because the file system ran out of space. It gets the same error if the file’s permissions are such that OH doesn’t have permission to write to it, the file system is mounted read only for some reason, something else has a write lock on that file and doesn’t give it back for too long, etc. So all OH can do is to keep trying and every time it tries and fails will generate errors.

There is some arguments on that but in my experience, when trying to debug openHAB behavior problems, especially problems with rules, this log is critical. OH is an event driven system. events.logs show you the events.

You can easily turn that off in $OH_USERDATA/etc/log4j2.xml. Note that everything logged in events.log is at the INFO level. Or you can configure the files to be smaller before rotating and keep fewer backups.

In practice, the current logging and even persistence configuration is just fine for the vast majority of users. Some users, like you, will encounter some problems and there are lots of ways you can adjust your settings to deal with it. That doesn’t mean that the adjustments you need to make are generally applicable or should be the defaults. As with everything functionality needs to be balanced with efficiency.

Sonic · February 2, 2022, 7:45pm

I understand. I mean - the good countermeasure would be to log the same error only once and avoid logging it again if repeated. Or, let’s say, 10 times, then giving up. Because otherwise it just blows up. Interesting that the blowup isn’t raspberry-specific. Even if you have a huuuge hard drive, it will eventually blow up, just it would take longer. And you could be on vacation for example. It’s very frustrating to see your system just having gone offline. Aren’t things like OpenHab supposed to be autonomous and robust ?

Sonic · February 2, 2022, 7:48pm

By the way, log explosion on zram turned out even worse. The device locked up in read-only mode, requiring a reboot. So i was unable to use tools like logrotate to clean it up.

rlkoshak · February 2, 2022, 7:53pm

No, because those logs will rotate out. By default logs on these systems have a fixed size and can not go over that. For example at 50 k the log will rotate and a new file started. Older rotated files get compressed and only the most recent 5 are kept. All of these logs mentioned so far have configurable policies for this.

Sonic · February 2, 2022, 7:58pm

Yes because logrotate is cron-based; and it only runs once per day. And openhab, continuously failing, may fill the drive up to 0% free in let’s say couple of hours

mstormi · February 2, 2022, 8:12pm

Ultimately this is dancing around the golden calf and it happens on all sorts of servers and applications. There is no fundamental ‘solution’ to the problem as there can always be environmental conditions such as e.g. network connectivity loss that result in filling the disk with errors, no matter the application, if zram and how large your disk is. Those parameters only affect how fast or often you hit problems but not if you do at all.

So rotate or just clear your system logs (the ‘permanent’ copies in /opt/zram/log.bind/),
eventually increase size in ztab, then move on with life.

Sonic · February 2, 2022, 8:35pm

@mstormi Well, i kinda agree… But i think openhab’s robustness could be improved in this regard. There may be no fundamental solution but certain things perhaps could be improved.
Okay, i’ve adjusted openhab’s log file size limits, priorities, etc… Rebooted the system, this cleared the zram.
Meanwhile i’ve found another issue, also related: Openhabian dhcpcd[353]: received SIGPIPE . This alone generates 2MB

mstormi · February 2, 2022, 8:52pm

If you have a concrete proposal I’ll be listening. But the fact alone that you got hit by a rare combo of events is no proof for that statement but rather expectable.
Even more so as it seems you caused at least part of that yourself with your static IP …

Sonic · February 2, 2022, 9:01pm

Well, my v3 setup is only running for one month. But in fact the total uptime was approx 7 days as all this time i was fighting with an independent issue Raspberry pi 4 hard freeze, possibly overheating - #5 by chris4789 , due to which my actual uptime didn’t exceed 4 days. So, it took me a 7 days to get the problem. Isn’t it too fast for a rare problem ?

Well, perhaps you can blame me for being stupid and forgetting to lower log size limits after installing v3. In fact i had this done on my old v2.5 installation. I’ve accepted that.

Sorry, wrong, see the reply. I didn’t respond immediately because i went to test this suggestion.

Sonic · February 7, 2022, 7:01am

Hello again, the problem is back after one week. But with a different scenario.
I’ve restricted openhab log sizes, so now my zram isn’t overflown:

# zramctl --output-all
NAME       DISKSIZE  DATA COMPR ALGORITHM STREAMS ZERO-PAGES TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram3     450M 70.7M  6.6M zstd            4       3925 27.7M      200M    27.7M       0B /opt/zram/zram3
/dev/zram2     350M  364K  5.1K zstd            4          0  184K      150M     184K       0B /opt/zram/zram2
/dev/zram1     350M   52M  1.6M zstd            4         49 25.9M      150M      26M       0B /opt/zram/zram1
/dev/zram0     450M    4K   86B lzo-rle         4          0    4K      200M       4K       0B [SWAP]

rrd4j’s directory was working fine, i could write files, it didn’t go read-only. Yet i again saw “could not create rrd4j database file” message for existing file names. And there was one more suspicious log message this time:

Feb  7 07:17:56 openhabian kernel: [71351.619799] cgroup: fork rejected by pids controller in /system.slice/openhab.service

any ideas where that could originate from ? Perhaps that was the root cause of my previous failure; just openhab logs weren’t limited in size so they filled up zram.

Sonic · February 7, 2022, 7:06am

restarting the openhab via systemctl fully regained the control. Meanwhile i’ve googled for this message, it has to do with some process limits. Unfortunately i’ve already restarted, so “status” isn’t available.
But… What can openhab fork, in principle ? I have a python script run by an event, but that’s only triggered 5 days per month (and those have passed). What else ?

wborn · August 29, 2022, 3:38pm

There also seems to have been a concurrency issue that resulted in the “Could not create rrd4j database file” error. So if you run into it with items that are persisted/queried multiple times within a short interval, then it should be fixed with openhab-addons#13332.