OH3: unexpected power loss; major drama and question marks after restart

Max_G · March 15, 2024, 11:30pm

Platform information:
- Hardware: Raspberry Pi 4 Model B Rev 1.1 ; 4GB memory
- Host: Linux openhabian 5.15.76-v7l+ #1597 SMP Fri Nov 4 12:14:58 GMT 2022 armv7l GNU/Linux
- Distro: Raspbian GNU/Linux 11 (bullseye)
- openjdk version “11.0.18” 2023-01-17
- OpenJDK Runtime Environment (build 11.0.18+10-post-Raspbian-1deb11u1)
- OpenJDK Server VM (build 11.0.18+10-post-Raspbian-1deb11u1, mixed mode)
OH Version: 3.4.3 (Build)
- Installation method: openhabian

Directories: Folder Name      | Path                        | User:Group
             -----------      | ----                        | ----------
             OPENHAB_HOME     | /usr/share/openhab          | openhab:openhab
             OPENHAB_RUNTIME  | /usr/share/openhab/runtime  | openhab:openhab
             OPENHAB_USERDATA | /var/lib/openhab            | openhab:openhab
             OPENHAB_CONF     | /etc/openhab                | openhab:openhab
             OPENHAB_LOGDIR   | /var/log/openhab            | openhab:openhabian
             OPENHAB_BACKUPS  | /var/lib/openhab/backups    | openhab:openhab

Extras installed via openhabian-config:
zigbee2mqtt
SD card mirror

I have a whole property UPS, which in the eleven years it has been running continuously worked without fail, lost unexpectedly power yesterday for 2 seconds after the grid had brown-out issues.

The following experiences where had after all systems restarted.

As for openHABian:

The OS time came back one hour earlier than it actually was
The Zigbee dongle for the OH Zigbee binding stayed offline (reason: USB port change)
No log files since August 2023, which may have been the last restart/reboot of the whole machine
Zigbee2MQTT is (still) not accessible; it is running, but has no front-end (which it had before). I restarted mosquitto (because I read zigbee2mqtt won’t work if the broker wasn’t online first.) I fixed this with (recreate a directory)
As a result of Zigbee not working, fridge freezer and other important devices did not resume power (default state off; had to switch them on manually); lights and switches don’t work either
I lost all persistence data prior to last shutdown in August 2023
after adjusting the OS time, persistence restarted (deleting what was captured since reboot).

Since I barely touch the system (as it just runs) I am not familiar with lots of the settings (I tend to forget these and have to reread… getting old); hence, the following questions or statements:

I think I configured SD card mirroring. How can I check it is working? When is this image overwritten? Hence, no longer useful as a backup pre-failure state?
How can the issue above prevented? (Loss of persistence logs…)
I think I can install a UPS hat for the Pi to prevent data loss.
I could disable ZRAM, which seems the problem for data loss, with data held in memory. Should I rather install a SSD or HDD?
Why is the OS time out by an hour?

I also get these errors, and have no idea what triggers them:

2024-03-16 06:44:31.858 [ERROR] [ui.internal.items.ItemUIRegistryImpl] - transformation throws exception [transformation=org.openhab.transform.map.internal.MapTransformationService@172d3a4, value=-]
org.openhab.core.transform.TransformationException: Target value not found in map for '-'

2024-03-16 07:34:52.865 [ERROR] [ab.core.ui.icon.internal.IconServlet] - Failed sending the icon byte stream as a response: null
2024-03-16 07:34:52.868 [WARN ] [org.eclipse.jetty.server.HttpChannel] - /icon/solarpanels

I have this backup; is it a full or config backup?

lrwxrwxrwx  1 openhab openhab   61 Feb 18  2023 latest.zip -> /var/lib/openhab/backups/openhab-backup-23_02_18-10_05_07.zip
-rw-r--r--  1 openhab openhab 6.7M Feb 18  2023 openhab-backup-23_02_18-10_05_07.zip

Can I restore the persistence files only?
I noticed when I stop the OH service, it comes on by itself, hence, leaving no time to restore the persistence files.

Would it be better to install OH manually, run it in Proxmox which allows for snapshot backups?

Can ZRAM be set-up to write the logs at defined intervals?

Anything else that would prevent these issues… or better recover from them?

Thanks, any hints appreciated.

The ideal would be a system that comes up as required every time.

[edit1] I see this in journalctl; seems like zigbee2mqtt is restarting endlessly.

-- Journal begins at Sat 2024-03-16 06:26:39 AEST, ends at Sat 2024-03-16 09:46:34 AEST. --
Mar 16 06:26:39 openhabian systemd[1]: zigbee2mqtt.service: Main process exited, code=exited, status=1/FAILURE
Mar 16 06:26:39 openhabian systemd[1]: zigbee2mqtt.service: Failed with result 'exit-code'.
Mar 16 06:26:39 openhabian systemd[1]: zigbee2mqtt.service: Consumed 7.246s CPU time.
Mar 16 06:26:39 openhabian systemd[1]: zigbee2mqtt.service: Scheduled restart job, restart counter is at 3146.
Mar 16 06:26:39 openhabian systemd[1]: Stopped Zigbee2MQTT.
Mar 16 06:26:39 openhabian systemd[1]: zigbee2mqtt.service: Consumed 7.246s CPU time.
Mar 16 06:26:39 openhabian systemd[1]: Started Zigbee2MQTT.
Mar 16 06:26:40 openhabian npm[18779]: > zigbee2mqtt@1.33.1 start
Mar 16 06:26:40 openhabian npm[18779]: > node index.js
Mar 16 06:26:45 openhabian npm[18805]: Error: EACCES: permission denied, mkdir '/var/log/zigbee2mqtt'
Mar 16 06:26:45 openhabian npm[18805]:     at Object.mkdirSync (node:fs:1382:3)
Mar 16 06:26:45 openhabian npm[18805]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:199:16)
Mar 16 06:26:45 openhabian npm[18805]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:197:12)
Mar 16 06:26:45 openhabian npm[18805]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:197:12)
Mar 16 06:26:45 openhabian npm[18805]:     at Object.mkdirSync (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:82:10)
Mar 16 06:26:45 openhabian npm[18805]:     at Object.init (/opt/zigbee2mqtt/lib/util/logger.ts:43:12)
Mar 16 06:26:45 openhabian npm[18805]:     at new Controller (/opt/zigbee2mqtt/lib/controller.ts:53:16)
Mar 16 06:26:45 openhabian npm[18805]:     at start (/opt/zigbee2mqtt/index.js:106:18)
Mar 16 06:26:45 openhabian npm[18805]:     at processTicksAndRejections (node:internal/process/task_queues:96:5)
Mar 16 06:26:45 openhabian systemd[1]: zigbee2mqtt.service: Main process exited, code=exited, status=1/FAILURE
Mar 16 06:26:45 openhabian systemd[1]: zigbee2mqtt.service: Failed with result 'exit-code'.
Mar 16 06:26:45 openhabian systemd[1]: zigbee2mqtt.service: Consumed 7.263s CPU time.
Mar 16 06:26:45 openhabian systemd[1]: zigbee2mqtt.service: Scheduled restart job, restart counter is at 3147.
Mar 16 06:26:45 openhabian systemd[1]: Stopped Zigbee2MQTT.
Mar 16 06:26:45 openhabian systemd[1]: zigbee2mqtt.service: Consumed 7.263s CPU time.
Mar 16 06:26:45 openhabian systemd[1]: Started Zigbee2MQTT.
Mar 16 06:26:47 openhabian npm[18817]: > zigbee2mqtt@1.33.1 start
Mar 16 06:26:47 openhabian npm[18817]: > node index.js
Mar 16 06:26:51 openhabian npm[18831]: Error: EACCES: permission denied, mkdir '/var/log/zigbee2mqtt'
Mar 16 06:26:51 openhabian npm[18831]:     at Object.mkdirSync (node:fs:1382:3)
Mar 16 06:26:51 openhabian npm[18831]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:199:16)
Mar 16 06:26:51 openhabian npm[18831]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:197:12)
Mar 16 06:26:51 openhabian npm[18831]:     at mkdirSyncRecursive (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:197:12)
Mar 16 06:26:51 openhabian npm[18831]:     at Object.mkdirSync (/opt/zigbee2mqtt/node_modules/mkdir-recursive/index.js:82:10)
Mar 16 06:26:51 openhabian npm[18831]:     at Object.init (/opt/zigbee2mqtt/lib/util/logger.ts:43:12)
Mar 16 06:26:51 openhabian npm[18831]:     at new Controller (/opt/zigbee2mqtt/lib/controller.ts:53:16)
Mar 16 06:26:51 openhabian npm[18831]:     at start (/opt/zigbee2mqtt/index.js:106:18)

I have fixed this with Zigbee2MQTT @ OpenHab - #30 by Sleugner

Larsen · March 16, 2024, 6:40am

The reason I can see in the logs is that the directory under /var/log/zigbee2mqtt cannot be created.
As this log is in zram those files are restored during restart from the last successful write to sd.
I would try to fix permissions in openhabian first. Should be:

drwxr-xr-x   1 openhabian openhab         4.0K Mar  9 20:36 zigbee2mqtt/

A port change (which can happen if you have multiple usb-devices) should not cause any problems in a correct installation: Even if the port changes the system creates a correct link under /dev/serial/by-id (and /dev/serial/by-path) and this is used if you install with openhabian.

Max_G · March 16, 2024, 9:58am

I understand you know this stuff, having programmed around this space.
My installation is stock-standard openHABian, with Z2M installed via openhabian-config.

Given that, how can the USB not been assigned by /dev/serial/by-id? Thus creating this problem in the first place.
All I can add is, that Z2M was installed while OH was running and not restarted (as far as I can recall).
Can I check somewhere if the USB ports are done by ID rather than port name?

However, my questions are deeper, and I may not have been able to convey this better.

How can I ensure that the no data loss occurs to the degree I have experienced?

Having done further research post trauma I am opting for an APC UPS with USB port connected to a machine that runs NUT, and notifies the Pi running OH to shut down after a certain time w/o power.

Still, the ZRAM business (based on this experience) is not to my liking.

I am thinking of backing up OH on a at least daily basis. However, I suspect this will not capture the ZRAM data.

Is there any guidance on how to get around this issue?
Disable ZRAM seems a silly idea given its benefit.

Any input appreciated.

Wolfgang_S · March 16, 2024, 10:32am

based on

as far as I understand it should be possiblt to regularly sync ZRAM content back to the disk which is also described here sync-files-to-disk

Oliver2 · March 16, 2024, 10:50am

It is by design that you lose data from zram in case of a loss of power.
I have done some testing by simply unplugging my Pi/SSD from power supply and in all of these tests the system was getting up & running without problems.
The reason might be that my production system is a dedicated server for openhab, runs at 0.1% CPU load and the SSD does not need to do a lot of I/O stuff so that the cache is almost empty before a crash.
But that doesn‘t necessarily mean that in other constellations there are no problems, too.
Long story short, to answer your question: Yes

rlkoshak · March 16, 2024, 4:10pm

RPis do not have a built in hardware clock. If it remains off for too long I think NTP will refuse to bump up the clock all at once so it could take some time to get back in sync (assuming you have ntp configured in the first place). It might even refuse to bump it (I don’t remember).

They make HATS for RPis to add a hardware clock though.

I would use a default state of ON if at all possible for devices like these. As for the lights, . A long time ago we had a lengthy discussion about whether it made sense to skip having light switches at all and just use automation. It’s situations like this that inform my opinion on the matter. It’s simply not feasible for an individual with commodity hardware/software to build a fault tolerant system with more that one or two 9s of uptime.

Which persistence? I can imagine rrd4j having challenges with a clock change like this but the other ones shouldn’t.

It’s in the openHABian docs. I think it’s once a month IIRC.

Stop OH and flush zram periodically
reboot the machine periodically
configure overhead backups (e.g. Amanda)
disable zram (but of course you need to watch for SD card wear out)
frequent off machine backups (test the backups)
UPS

You don’t have a mapping for “-” in your .map file nor a default entry and your Item is NULL or UNDEF.

I didn’t know about the icon one.

It should include all your configs and embedded persistence (mapdb and rrd4j)

Depends on the db. Probably. But rrd4j might have a challenge with the changes in time from last entry to now. Give it a try and find out (backup what you have now first of course).

If it’s external than no.

There are lots of ways. What’s better or not depends on specifics we can’t know.
.

Which version?

Ultimately, configure it so the data is saved in the event of a data loss. Frequent off machine backups, UPS which allows the machine to gracefully shut down, etc.

Just be aware that when an RPi suddenly loses power and it’s actively writing to the card, you can lose not only what’s being written but anything sharing that sector of flash. This can and frequently does lead to fine system corruption and your machine won’t even boot when power is restored and the SD card could be unreadable.

Zram actually mitigates this problem by making sure there isn’t much being written to the SD card, meaning it’s unlikely a write is taking place when power is lost and thereby preventing fine system corruption.

Depending on the vendor, this can be a problem on SDDs too but they usually have capacitors that allow the SDD to finish a write before power is completely gone.

If you back up those folders it will capture the zram data. I don’t think SD card mirroring captures zram though.

mstormi · March 16, 2024, 10:21pm

systemctl list-timers will tell you the last and next scheduled sdrawcopy timer run times.
Mount the 2ndary SD card to check. Depending on how old your install is, you might find a recent copy of your persistence data in /<mount point>/var/lib/openhab/persistence, too.
Did you try to boot from that SD card ?

You should be (have been) doing that independent of ZRAM. I haven’t checked in a while but if you use Amanda it should catch the data in ZRAM, too.

I’ve just added that (untested, in main branch only). It might have worked or not in the past.

Max_G · March 16, 2024, 10:34pm

Thanks Rich, you’re gold.

NTP

I have installed NTP after noticing this.
I have a local ntp server which a GPS clock.

Zigbee

The plugs I have for the fridge Freezer and other devices are default OFF.
I have openHAB rules which monitor their state, with a ‘desired state’ item, which if not matched, will switch to that state.

Light switches

… or not. Yes, I do remember the conversation, and after years of agonising about it, I have decided to install one light switch per room. However, all others run Zigbee.

uptime

I also understand uptime and nines. Used to build install data centres in my life The whole property never failed in 10+ years.
Yesterday, I have ordered a battery for a small CyberPower UPS, just for the Pi and its current configuration (as an immediate fix).

Persistence

Yes, rrd4j. I have 640 items persisted. I know I have to revisit and work on what gets persisted. My comment was to mention, ey, this is what can happen. With ntp in place, this issue should be gone.

SD card mirror

Thanks, will check the docs.

If a month out, I might be able to use it to restore rrd4j files.
… if this is possible, as in rrd4j not discarding them.

ZRAM

I have seen a most recent response from Markus Storm on zram sync… but this is not the default setting. → Feeding my suspicion that any backup will not backup whatever is managed by zram.

Mapping warning

I should have been clearer in my description; while this happened, the issue is that OH doe snot tell me which map file is the cause.

Image stream error

I have to search for it. I did create an item with that icon, but the icon does not exist.

openHABian config

Which version?

2024-03-16_21:49:00_AEST [openHABian] openHABian configuration tool version: [openHAB3]{2023-12-12T18:59:04+01:00}(8e591be)

prevent data loss, zram or not

From what I have gathered so far, though have to confirm; ZRAM data is not backed up. Meaning I would have to shutdown OH, which is simply not feasible. I will try mstormi’s suggested sync command via cron. If this works the problem is solved. If not, ZRAM will have to go, which implies running on SSD.

Just saw mstormi’s post #7…

systemctl list-timers
NEXT                         LEFT                  LAST                         PASSED               UNIT                         A>
Sun 2024-03-17 16:11:34 AEST 7h left               n/a                          n/a                  apt-daily.timer              a>
Sun 2024-03-17 21:29:59 AEST 13h left              Sat 2024-03-16 21:29:59 AEST 10h ago              systemd-tmpfiles-clean.timer s>
Mon 2024-03-18 00:00:00 AEST 15h left              Sun 2024-03-17 00:00:01 AEST 8h ago               atop-rotate.timer            a>
Mon 2024-03-18 00:00:00 AEST 15h left              Sun 2024-03-17 00:00:01 AEST 8h ago               exim4-base.timer             e>
Mon 2024-03-18 00:00:00 AEST 15h left              Sun 2024-03-17 00:00:01 AEST 8h ago               logrotate.timer              l>
Mon 2024-03-18 00:00:00 AEST 15h left              Sun 2024-03-17 00:00:01 AEST 8h ago               man-db.timer                 m>
Mon 2024-03-18 00:00:35 AEST 15h left              Mon 2024-03-11 00:06:24 AEST 6 days ago           fstrim.timer                 f>
Mon 2024-03-18 00:57:09 AEST 16h left              Sun 2024-03-17 00:58:39 AEST 7h ago               amdump-openhab-dir.timer     a>
Mon 2024-03-18 01:56:38 AEST 17h left              Sun 2024-03-17 02:01:17 AEST 6h ago               amandaBackupDB.timer         a>
Mon 2024-03-18 02:01:44 AEST 17h left              Sun 2024-03-17 02:06:02 AEST 6h ago               sdrsync.timer                s>
Mon 2024-03-18 03:28:55 AEST 19h left              Sun 2024-03-17 02:39:19 AEST 5h 44min ago         firemotd.timer               f>
Mon 2024-03-18 06:07:07 AEST 21h left              Sun 2024-03-17 06:44:54 AEST 1h 38min ago         apt-daily-upgrade.timer      a>
Sun 2024-03-24 03:10:19 AEST 6 days left           Sun 2024-03-17 03:10:02 AEST 5h 13min ago         e2scrub_all.timer            e>
Mon 2024-07-01 01:15:00 AEST 3 months 14 days left Mon 2024-01-01 01:15:01 AEST 2 months 15 days ago sdrawcopy.timer              s>

14 timers listed.
Pass --all to see loaded but inactive timers, too.

No… I have only one system, and this needed to get back online as quickly as possible. However, I will create a test system to try. For the time being I have given up on restoring any persistence data.

Well, given the zram uncertainties (that exist for me at least at the moment), the backup seems to be incomplete, based on what is held in zram.
When I take a backup via openhab-cli, OH does not get shut down (for zram data to be written). Unless the backup process a ‘flush zram data to dusk’ there won’t b a complete backup.

Can someone tell me, if zram data is written when backing up via openhab-cli tool?

In any case, I do appreciate the responses.

mstormi · March 16, 2024, 10:55pm

That’s a badly put and irrelevant question. It dumps the config, no matter where it’s located.
openhab-cli is not a backup tool, it doesn’t catch persistence or 3rd party tools.
That’s one of the reasons why you should be using Amanda.

Max_G · March 16, 2024, 11:17pm

‘Badly’ is debatable… 'irrelevant: how so?

I go by the book, install openHABian, use its built-in backup, and wonder, whether it backs up zram data (a reasonable expectation) or not.

Well, openhab-cli says 'backup…
However, there is a difference between people who intimately work on a specific case, and those who occasionally use it.
You are right; literally reading this line:

Possible commands:
  backup [--full] [filename]   -- Stores the current configuration of openHAB.

… says ‘config’ not ‘data’…

In any case… taking this:

… at face value, then Amanda has either a mechanism to sync zram to disk, or shuts down OH when backing it up.

I tried Amanda in the past, but couldn’t get it to work… and never looked back.

What I seemingly do not understand is how ZRAM works in conjunction with backups.
It is ‘tricking’ the system into using its in-memory file system for the specified directories. This is a software configuration. How would a backup software write to a virtual space if the software creating this virtual file system is not running, hence the backup cannot restore to it.

As I see currently understand it, OH needs to run on a SSD/HDD without zram, then a standard tool, like rsync, can back it up at any time… and hence, a restore of a full backup will restore the system.

Looking at the SD card backup timer set-up, it will write the SD card every six months. While I may touch my systme every six months, and then major changes happen, this is not often enough to be useful IMHO.

Anyway, keep it coming; I am a life-long learner

mstormi · March 16, 2024, 11:34pm

Neither. It backs up the full system with OH running.

I don’t understand your thinking there, there’s nothing special about it. It’s just files, no matter if ZRAM or not. openhab-cli doesn’t include all of them such as persistence.
So you need a more comprehensive tool, Amanda that is.

You can manually run it at any time when you made bigger changes.
There’s a 2nd timer that’s running every day (doing a ‘diff’ only) to sync, mainly for persistence and logs.

rlkoshak · March 17, 2024, 2:39am

Probably file an issue. Little stuff like adding the file to an error message usually gets fixed fast.

I still think Amanda does capture zram. Markus says he just made sure sd card mirroring captures zram.

To put @mstormi’s answer another way, openhabe-cli is ignorant as to what type of file systems are involved. It just grabs the current contents of specific folders. It doesn’t know nor care if it’s in zram, an NFS mount or anything. It just sees a folder and copies what’s in it.

So yes, openhab-cli will save the contents of the zram folders if zram is running.

This is the misunderstanding. Tolls like Amanda and openhab-cli just see folders and files. They don’t need zram to flush the data to back it up same as openHAB doesn’t need it flushed to disk to use zram in the first place. From these tool’s perspective they see /etc/openhab and /var/lib/openhab and the files therein. The fact that it’s zram is invisible to these tools.

Zram tricks the backup software too. So long as zram is running the brackup tool just sees some folders and files to back up. If zram isn’t running, the backup tolls see what ever zram flushed to disk while shutting down last time.

A standard toll like rsync does work with zram.

About the only thing that doesn’t is yanking the power before zram has a chance to flush. Every other time the fact that you are even using zram is transparent to the software reading and writing to the files in zram.

There are a lot of moving parts here that are not accounted for

if you use Amanda, openhab-cli, rsync, or any other filesystem backup mechanism, it will back up what’s in zram if it’s running when it backs up. It doesn’t know nor needs to know anything about zram. It’s completely transparent.
when zram shutsdown normally it flushes it’s contents to the SD card so that when it starts up again it can pick up where it left off. In other words, it flushes the contents on shutdown and reads the contents on startup. This too is transparent to the backup toll mentioned above. They don’t know whether zram is running or not.
SD card mirroring isn’t back up, per se. It’s a way to make a duplicate of your card periodically (every six months I guess based on the above) so in an extreme emergency you can get back up and running by simply swapping the SD cards. It doesn’t happen frequently enough to be a viable backup on its own.

ultimately, you are under the mistaken assumption that zram must flush to backup and that’s not the case. It only needs to flush to restart itself. Most file system focused backup tools don’t know nor would they care that zram even exists.

Max_G · March 17, 2024, 10:43am

You got me; thanks. Yes… It didn’t click that ZRAM is just another file system. I haven’t read or checked further; I assume it is setup as part of the OS launch, and OH is simply using it. So whether OH runs or not ZRAM is there. Yes…

Thank you guys for enlightening me on this topic.

mstormi · March 17, 2024, 11:20am

I know it’s not popular to tell but anyone should be reading the openHABian docs before deploying.
While we cannot explain all the details and rationales in there, they’re clear on what the essentials are and how we cover them.

PS docs are being rewritten ATM

Max_G · March 17, 2024, 11:36am

I wouldn’t care, and appreciate the nudge… I reckon long-term users (like me) have read the documentation and rarely come back unless doing something new.

Appreciate your work; reading now…