openHAB suddenly instable/crashing

supersjel · December 10, 2023, 9:55am

Hi,

Got openHAB 4.04 running on a raspberry 4 (4Gb, bullseye, java17) running for years. Well not on 4.x, but since it’s available. Always did it job.

Yesterday is suddenly crashed. The logs I know to find don’t show special thing, just running along and “bang”. It happened a few times now. Symptoms I see (not consistent):

Runs for 4-8 hours
Rise in CPU usage just before crash
Rapsberry reboots (but not ok, openHAB doesn’t resume right, a second reboot is needed)
Sometimes it keeps running, but SSH remote access is lost and OH web-GUI is not reachable. I also run some nodejs webservices on it, they keep running.
The openHAB log seems normal right until the reboot. But shows just before crashing a few unexepted messages:
2023-12-09 23:39:14.814 [WARN ] [nce.extensions.PersistenceExtensions] - There is no default persistence service configured!
2023-12-09 23:39:14.815 [WARN ] [nce.extensions.PersistenceExtensions] - There is no queryable persistence service registered with the id ‘null’
(this is repeated, rules fail and UI shows no historic graphs anymore).
2023-12-09 23:39:16.919 [INFO ] [e.io.rest.internal.filter.CorsFilter] - enabled CORS for REST API.
(lots of them, 1 every 100 ms, normally only present after startup i think). These are often the last in de openHAB log.

The Zigbee thing (USB POPP stick) didn’t recover after the first crash. Disabling it doesn’t change issues above.

So help is appreciated. Two questions I can think of:

Do the persistence errors mean the rrd4j is broken? Or the database is corrupt? Is there a tool/way to discover which of the item-files is affected?
Any other logs I could check for information? (checked also the openHAB event and audit log, but nothing beyond the normal in them). Tried the /var/log raspberry logs but don’t know what to look for in them (or what’s normal there). Can see it reboots after the openHAB runs into troubles.

edit/added: Did look into the zigbee device. It seems broken. Attaching it blocks all USB ports and probably the harddrive (with raspian). SSH command like ‘ls’ fail on input/output error. So I think I got the rootcause of the issues. But the crash/reset of the raspberry seems to have corrupted something. Or could it have broken the hardware USB?

Thanx in advance!

rlkoshak · December 11, 2023, 2:43pm

What’s using the CPU when it rises?

Fails with an error or doesn’t even try to start?

Not necessarily. The fact that it works at all after a first clean boot indicates it’s working just fine.

The behavior of OH is almost certainly a symptom of some problem outside of OH.

Look at the syslog as a start.

That is a possibility if it doesn’t shutdown nicely it is easy to end up with a corrupted file system.

A full file system can cause problems similar to these as well.

You don’t mention if you are using openHABian or tempfs or any other approach to mitigate SD card wear out. If you are not using any of those then it’s possible the SD card has worn out. Very strange behaviors occur when SD cards wear out. If you are using openHABian with zram enabled or have set up your own RAM disks it might be possible you are losing data on the crash as the changes in RAM are not saved to disk or incompletely saved to disk.

The fact that serial devices are broken points to a very sick little machine. It is almost always easier to create a backup of the data you need and rebuild using a fresh image. If you’ve not been using zram etc I’d recommend a new SD card as well.

supersjel · December 11, 2023, 5:27pm

Thanx for the help/response This is a bit of a quest.

Don’t know. The logging is a quite simple openHAB item. Just CPU total and memory (no peaks on that). Or do you know of a log that keeps more detail?

No errors in the openhab log (but I fear parts of the log are missing because of the reboots, there are gaps in the timestamps). GUI responds, but random thing stay offline/initializing/unresponsive.

Ok, maybe dead end. I had a theory (not based on actual knowledge) that something was corrupt in a rrd4j file that caused troubles when it was time to compact the data of the first hour.

You’re guess is as good as mine. I might have jumped to premature conclusions because the nodejs scripts kept running. Of course, SSH failing is an opposite sign.

So will look into that. First glance is that this log is not my cup of tea. Any keywords I can look for with a “grep”? Tried “fail”, “error”, “warning” but they don’t seem the right words for this log. Will try google also on that. When used on “openhab” it only gets these below. I think they’re normal when rotating the logfile.

frontail[635]: tail: ‘/var/log/openhab/openhab.log’ has become inaccessible
frontail[635]: tail: ‘/var/log/openhab/openhab.log’ has appeared; following new file

openHabian installed by apt (according to documentation page), no ramdisk or similar approach. It’s running on a 1000Gb harddrive with 400Gb free. So space and weardown should not be an issue (2-3 years old).

I got backups, but a few months old. Bit lazy I guess. Or spoiled because until so far the setup ran effortless. Of course created a fresh one, but it could contain the “back-upped” issue. I could use the older one, will lose some history on items that I like but I will survive. Created code/rules/widgets I also keep on github so I can get the latest versions back. That’s the most important.

I did manage to diagnose a bit more. I use several USB devices and to power them all they’re connected to a (powered) USB hub. At least one port is fried, so I replaced that one. And I did remove a rrd4j file. They seem to have similar sizes, a few numbers keep turning up. Except one, so I removed it. OH will start a new file anyway. Of course it’s a bit soon for cheers but so far it’s running a “record” of 24 hours with no issues. Maybe I got the sickness contained , fingers crossed!

rlkoshak · December 11, 2023, 5:49pm

When the CPU use starts growing you’ll need to log into the machine and run htop or top to see what processes are consuming the CPU.

What matters is what is being logged near when OH crashes.

openHABian isn’t installed by apt. Do you mean openHAB installed by apt or openHABian installed by cloning the git repository?

I’m all but certain the problem is outside of OH. So any backup you have of openHAB will be fine. You can rebuild the machine (i.e. new OS, reinstall the software) and then restore the backed up config and data and you should be good to go.

Creating a backup of your OH configs and data (and other software you may be running too) should also be fine.

The whole point of rrd4j is to remain fixed in size. The files for Items of the same type should all be the same size no matter how long you’ve been saving data into it.

The names of the files match the names of the Item it goes with. Is that Item behaving?

supersjel · December 11, 2023, 9:15pm

@rlkoshak, thanx again. In mobile now so quoting large text doesn’t work that well.

IWhen in CPU trouble, the transition is quite fast. And one of the first things that fails is SSH. So typing commands like top is not possible.

You’re right. It was long ago, it’s cloning a repo, modifying a config file and running a script. My bad.

Thanx for confirmation that the rdd4j files should have same sizes (probably depending on item type). So the one with strange size was worth deleting. Too late now, should have used the analyse item option to check a year history and see what happened.

Still running, let’s see what happens the next days …

supersjel · December 13, 2023, 4:33pm

Still running with no issues, yesterday got the new zigbee device. After installing it l, updating things and items the lights became operational too. So with help from @rlkoshak this is my conclusion:

It was no openHAB issue. The USB zigbee failed and fried the USB hub too. Somehow this blocked USB in the raspberry and contact with serial devices (including the hard drive). The system rebooted several times before I discovered the faulty hardware. This caused file corruption in a rrd4j file (weird file size). Deleting the file and replacing USB hub and Zigbee stick solved all issues. Probably the USB stick protected the raspberry itself.

Lessons learned, I’ll check the community if someone made a cron job to create a weekly rotating backup. Because mine was quite old and it’s a nice feeling to have a good recovery option.