Could not create rrd4j database file error in OH3

Yes it does. It’s particularly useful, for example, to correlate when a device is ON and some reading (e.g. the outside temperature with when the heater turns ON).

NOTE: I don’t have my thermostat connected to OH so that’s actually the outside temp with the state of one of my light switches on the same chart, but it shows the sorts of things that can be correlated.

These types of Items can also be included in bar charts and other types of data representations.

Back to the topic of zram usage; my zramctl shows:

# zramctl --output-all
NAME       DISKSIZE   DATA COMPR ALGORITHM STREAMS ZERO-PAGES TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram3     450M 390.3M 44.9M zstd            4        794  200M      200M     200M       0B /opt/zram/zram3
/dev/zram2     350M   364K  5.2K zstd            4          0  184K      150M     184K       0B /opt/zram/zram2
/dev/zram1     350M  65.3M  1.5M zstd            4         58 32.6M      150M    32.6M       0B /opt/zram/zram1
/dev/zram0     450M     4K   86B lzo-rle         4          0    4K      200M       4K       0B [SWAP]

Size of /var/lib/openhab/persistence/rrd4j is 36M.

Funny enough, irrd4j is stored on zram1, and zram3 is used for logs. And size of logs is 390M. The largest files (128 mbytes!!!) are log/daemon.log and log/syslog; with the majority of contents showing that very error:

Feb  2 03:47:25 openhabian karaf[813]: Exception in thread "OH-eventexecutor-149284" java.lang.NullPointerException
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:47:25 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)

Looks like the error is cascading. A failure in one place (persistence) causes spitting tons of these messages to logs; which in turn aggravates the failure until the complete collapse. @rlkoshak i wonder, can something be done about it ? One way would be to catch all DB submission errors in the openhab and report only the first error until things resolve. I understand that it’s easier said than done; but IMHO this is a major issue.

If you’re interested, this is how it began:

Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-1" Exception in thread "OH-eventexecutor-2" java.lang.NullPointerException
Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-3" #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlock
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.lang.Thread.run(Thread.java:829)
Feb  2 03:45:28 openhabian karaf[813]: Exception in thread "OH-eventexecutor-4" java.lang.NullPointerException
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:214)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb  2 03:45:28 openhabian karaf[813]: #011at java.base/java.lang.Thread.run(Thread.java:829)

So, it happened at 03:45; and the last such line is stamped 03:47. So, in two minutes the log was filled up to the top with 1044989 lines (i’ve used " grep karaf syslog |wc -l" to count); it looks like OH just keeps trying and failing with no delay.

kern.log has this:

Feb  2 03:47:23 openhabian kernel: [365506.992045] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 47231)
Feb  2 03:47:23 openhabian kernel: [365506.997667] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 47507)
Feb  2 03:47:23 openhabian kernel: [365506.997744] Buffer I/O error on device zram3, logical block 47507
Feb  2 03:47:23 openhabian kernel: [365506.997760] Buffer I/O error on device zram3, logical block 47508
Feb  2 03:47:23 openhabian kernel: [365506.997771] Buffer I/O error on device zram3, logical block 47509
Feb  2 03:47:23 openhabian kernel: [365506.997781] Buffer I/O error on device zram3, logical block 47510
Feb  2 03:47:23 openhabian kernel: [365506.997791] Buffer I/O error on device zram3, logical block 47511
Feb  2 03:47:23 openhabian kernel: [365506.997801] Buffer I/O error on device zram3, logical block 47512
Feb  2 03:47:23 openhabian kernel: [365506.997811] Buffer I/O error on device zram3, logical block 47513
Feb  2 03:47:23 openhabian kernel: [365506.997821] Buffer I/O error on device zram3, logical block 47514
Feb  2 03:47:23 openhabian kernel: [365506.997831] Buffer I/O error on device zram3, logical block 47515
Feb  2 03:47:23 openhabian kernel: [365506.997840] Buffer I/O error on device zram3, logical block 47516
Feb  2 03:47:23 openhabian kernel: [365507.010431] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 33278)
Feb  2 03:47:23 openhabian kernel: [365507.025818] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24322)
Feb  2 03:47:23 openhabian kernel: [365507.040076] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24602)
Feb  2 03:47:23 openhabian kernel: [365507.054284] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 24882)
Feb  2 03:47:23 openhabian kernel: [365507.068931] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25154)
Feb  2 03:47:23 openhabian kernel: [365507.082773] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25438)
Feb  2 03:47:23 openhabian kernel: [365507.097182] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25703)
Feb  2 03:47:23 openhabian kernel: [365507.111421] EXT4-fs warning (device zram3): ext4_end_bio:349: I/O error 10 writing to inode 56 starting block 25979)

Note the timestamp of 03:47, i. e. this was a consequence of filling up the log partition.

And one more. I see that by default event.log logs all the events under INFO level. Every change of every measured value. Does it make sense to have this enabled by default? The log is huuuuge, and all the data is:

2022-02-02 01:02:32.994 [INFO ] [openhab.event.ItemStateChangedEvent ] - Item 'Current_Rooms' changed from 0.697876071936 to 0.696626070016
2022-02-02 01:02:33.642 [INFO ] [openhab.event.ItemStateChangedEvent ] - Item 'Current_Rooms' changed from 0.696626070016 to 0.699751074816

This comes from my multi-channel power meter; it is quite precise. Plug in a phone charger - and you get tons of such events per minute.

I admit that it can be useful for debugging the setup, but i’m in doubt it’s good to have this always enabled.

daemon.log and syslog are operating system log files. And these specific errors are not specific to the DB. They have to do with the file system running out of space. Anything that tries to write to that files system will generate errors and the OS will report those to its logs.

But OH doesn’t know specifically that it’s failing because the file system ran out of space. It gets the same error if the file’s permissions are such that OH doesn’t have permission to write to it, the file system is mounted read only for some reason, something else has a write lock on that file and doesn’t give it back for too long, etc. So all OH can do is to keep trying and every time it tries and fails will generate errors.

There is some arguments on that but in my experience, when trying to debug openHAB behavior problems, especially problems with rules, this log is critical. OH is an event driven system. events.logs show you the events.

You can easily turn that off in $OH_USERDATA/etc/log4j2.xml. Note that everything logged in events.log is at the INFO level. Or you can configure the files to be smaller before rotating and keep fewer backups.

In practice, the current logging and even persistence configuration is just fine for the vast majority of users. Some users, like you, will encounter some problems and there are lots of ways you can adjust your settings to deal with it. That doesn’t mean that the adjustments you need to make are generally applicable or should be the defaults. As with everything functionality needs to be balanced with efficiency.

I understand. I mean - the good countermeasure would be to log the same error only once and avoid logging it again if repeated. Or, let’s say, 10 times, then giving up. Because otherwise it just blows up. Interesting that the blowup isn’t raspberry-specific. Even if you have a huuuge hard drive, it will eventually blow up, just it would take longer. And you could be on vacation for example. It’s very frustrating to see your system just having gone offline. Aren’t things like OpenHab supposed to be autonomous and robust ?

By the way, log explosion on zram turned out even worse. The device locked up in read-only mode, requiring a reboot. So i was unable to use tools like logrotate to clean it up.

No, because those logs will rotate out. By default logs on these systems have a fixed size and can not go over that. For example at 50 k the log will rotate and a new file started. Older rotated files get compressed and only the most recent 5 are kept. All of these logs mentioned so far have configurable policies for this.

Yes because logrotate is cron-based; and it only runs once per day. And openhab, continuously failing, may fill the drive up to 0% free in let’s say couple of hours

Ultimately this is dancing around the golden calf and it happens on all sorts of servers and applications. There is no fundamental ‘solution’ to the problem as there can always be environmental conditions such as e.g. network connectivity loss that result in filling the disk with errors, no matter the application, if zram and how large your disk is. Those parameters only affect how fast or often you hit problems but not if you do at all.

So rotate or just clear your system logs (the ‘permanent’ copies in /opt/zram/log.bind/),
eventually increase size in ztab, then move on with life.

@mstormi Well, i kinda agree… But i think openhab’s robustness could be improved in this regard. There may be no fundamental solution but certain things perhaps could be improved.
Okay, i’ve adjusted openhab’s log file size limits, priorities, etc… Rebooted the system, this cleared the zram.
Meanwhile i’ve found another issue, also related: Openhabian dhcpcd[353]: received SIGPIPE . This alone generates 2MB

If you have a concrete proposal I’ll be listening. But the fact alone that you got hit by a rare combo of events is no proof for that statement but rather expectable.
Even more so as it seems you caused at least part of that yourself with your static IP …

Well, my v3 setup is only running for one month. But in fact the total uptime was approx 7 days as all this time i was fighting with an independent issue Raspberry pi 4 hard freeze, possibly overheating - #5 by chris4789 , due to which my actual uptime didn’t exceed 4 days. So, it took me a 7 days to get the problem. Isn’t it too fast for a rare problem ?

Well, perhaps you can blame me for being stupid and forgetting to lower log size limits after installing v3. In fact i had this done on my old v2.5 installation. I’ve accepted that.

Sorry, wrong, see the reply. I didn’t respond immediately because i went to test this suggestion.

Hello again, the problem is back after one week. But with a different scenario.
I’ve restricted openhab log sizes, so now my zram isn’t overflown:

# zramctl --output-all
NAME       DISKSIZE  DATA COMPR ALGORITHM STREAMS ZERO-PAGES TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram3     450M 70.7M  6.6M zstd            4       3925 27.7M      200M    27.7M       0B /opt/zram/zram3
/dev/zram2     350M  364K  5.1K zstd            4          0  184K      150M     184K       0B /opt/zram/zram2
/dev/zram1     350M   52M  1.6M zstd            4         49 25.9M      150M      26M       0B /opt/zram/zram1
/dev/zram0     450M    4K   86B lzo-rle         4          0    4K      200M       4K       0B [SWAP]

rrd4j’s directory was working fine, i could write files, it didn’t go read-only. Yet i again saw “could not create rrd4j database file” message for existing file names. And there was one more suspicious log message this time:

Feb  7 07:17:56 openhabian kernel: [71351.619799] cgroup: fork rejected by pids controller in /system.slice/openhab.service

any ideas where that could originate from ? Perhaps that was the root cause of my previous failure; just openhab logs weren’t limited in size so they filled up zram.

restarting the openhab via systemctl fully regained the control. Meanwhile i’ve googled for this message, it has to do with some process limits. Unfortunately i’ve already restarted, so “status” isn’t available.
But… What can openhab fork, in principle ? I have a python script run by an event, but that’s only triggered 5 days per month (and those have passed). What else ?

There also seems to have been a concurrency issue that resulted in the “Could not create rrd4j database file” error. So if you run into it with items that are persisted/queried multiple times within a short interval, then it should be fixed with openhab-addons#13332. :slight_smile:

@wborn hi and thank you very much for the information. Which release will the fix be included in? I definitely need to give it a try.

The latest entry in the pull request ( link that is posted by wborn ) says that it was added to 3.4 milestone.

Thought I’d add my experience for anyone that may find it useful.

I experienced this error also and this was the first thread that popped up. Permissions on the file helped me work out what my problem was.

Turns out I’d had two persistence strategies defined - one for everyChange and one for everyMinute + everyChange, each with their corresponding Item Group. I was only experiencing this error for one item, and it was indeed persisting - looking more closely I had assigned both Item Groups to the item in question. I can only assume there was a concurrency issue at times where both strategies were attempting to persist at the same time - and hitting an i/o error. Removed the redundant group and all is good in the world!

Could you please reproduce the issue with a new item then open a GitHub issue in openhab/openhab-core?

After installing a new binding and creating the things/items, I also had the error:
*Could not create rrd4j database file*
I doubled the memory reservation, since then the error has disappeared for the time being.
I run my system on a 4GB Raspberry with an SSD