Corrupt FileSystems every 2-3 month?

I stopped using a Raspberry 2 because I was running short of memory. OpenHAB, being Java based, put rather a strain on it.

I switched to a PC Engines APU 2 with a 16 GB SSD, runnung under Ubuntu 14.04. The first SSD failed, and, after the second one had been running for a while, I realised there were periodic disk problems which caused the system to freeze.

So, periodically, I automatically restart OH in the middle of the night. The disk problems are a thing of the past. I recently had to switch off the system for some msjor rewiring: it had been running perfectly for 358 days according to uptime.

Steve

Hi Steve. Great Ideea! I use the same Board with IPFire since almost 3 years. Think i will try too :slight_smile:

I had what I thought was curroption but was actually my cards getting filled with logs. I thought I’d just throw it out in case it might help.

See the discussion about corrupted file systems on RaspiPi[1-3] at

Even when you

  • shutdown before pulling th power plug
  • use a good PSU / power plug (min. 2A)
  • use sd cards of known brands
    in my experience there is no guaranty the file system doesn’t get corrupted.
    With some SD-Cards (which worked good in other systems) the
    apt-get update
    apt-get upgrade
    after a fresh install triggered enough write activity on my Raspi2 that the file system got corrupted and the system didn’t came up after reboot.

Something in the hardware and or firmware of the Raspberries keeps on eating sd-cards, may it be voltage spikes while shutting down or timing problems while writing data to the card, I don’t know.

So my solution was to shift the system on an usb stick and boot from a boot partition on a sd-card mounted read-only (from which I have a backup).

BTW: OH on my Synology NAS proved to be too inflexibel and after each DSM (OS) update the risk was high that OH didn’t run anymore.

As already written in the thread, the log-files and rrd4j will corrupt every sd-card in a couple of month.
Frequent writes is what kills the card, so this has to be prevented.
This can be done via a read only partition or mounting the folders to tmpfs (ram-disk).
To prevent data loss on reboot the data can be persisted on shutdown or cyclic once a day.

1 Like

Most statements on SD card wearout and filesystem corruption you can read about on this forum and the internet are missing the important points so I try to de-mystify and correct misconceptions with this post. It is constantly updated and meant to serve as a reference to new users. You probably got directed here by me or some other forum responder in response to a question or post of yours.
If you find any of the information contained here to be incorrect please let me know.
Note I’m assuming you’re using a Raspberry Pi to boot off the internal SD reader.
Information may or may not apply to other SBCs or modified RPi setups.

Power

First, file storage corruption can happen when your server loses power while writing to disk - SD cards in particular - because every flash controller provides some caching memory so not every write command really means that the data was successfully written to the medium. Note I’m not talking about file system level handling.
Power losses happen a lot in home automation setups, particularly if you’re in a build phase and there’s you or others to work on the electrical system of the house. Fortunately there’s a simple solution: get a UPS. Most add surge protection, too, and allow to run the OH server, Internet router and other critical systems on battery for at least a couple of minutes.
If you’re using a RPi, you might use a simple powerbank as a UPS but make sure though to get one that allows for charging and powering at the same time - most do not provide this capability.
Also, it must be able to provide the full amount of power the regular power adapter for your RPi provides. Raspberry foundation recommends 2.5A for a RPi3 with power hungry USB peripherals, and for the RPi4 they even had to move to USB-C and supplies to provide 3A. Common supplies are 1A or 2.1A at most. You usually get away with 1A, but you must not forget to factor in all your hats and USB-attached devices as well as to remember that you need to seize your system for peak power consumption such as at boot or backup time and not for the average value.
Note that if underpowered, a RPi3 or older will power down the USB chipset among the first, and Ethernet is connected via USB so a first symptom of this to happen usually is network problems.
You’ll usually get to see ‘under-voltage’ messages in syslog, too, as well as the lightning symbol on the boot screen. The power LED on newer RPIs (3, 4) will also flicker OFF while input current is insufficient.

SD and other media

Second, with maybe one exception (see #1 below), there’s no way to increase reliability of a SD card. They suffer from wearout leading to corruption, and you can do little about it. Worse even, this is not a SD thing but a memory chip thing. The same technology is used in USB sticks, eMMC cards and even SSDs, so the following applies even if you use one of those.
There’s some variances w.r.t. error-free runtime, some cards or models or brands are better than others, but all but the most cheapest SD cards do wear levelling to some extent already. Read on if you’re interested in details. There’s also ‘industrial grade’ tagged cards to have a larger spare buffer built in.
So do SSDs, but these have a DRAM cache in addition, that effectively results in relatively few writes to flash memory which is why they’re way less affected than SD and eMMC.

Unfortunately, all of this is ignoring the fact that once setup, openHAB keeps writing to the same files again and again, in rapid succession. Wear levelling may not apply or may simply not be enough in this case.
For what’s it’s worth, unlike many believe, SD size is not a good indicator for buffer size - twice the size does not imply twice the buffer. Wear levelling algorithms are proprietary and undisclosed, and you never know how large your safety buffer is (no vendor tells). And even those with a large buffer fail at some point in time. Don’t get fooled by advice to buy a specific card because some guy told you he’s been running that without problems. He was just lucky.
And as we touch this: get an “A1” or even “A2” rated SD card. It’s not more reliable but quite faster than traditional ones under random access conditions as we have them in openHAB(ian).


Advertising break
If you shop on Amazon, use smile.amazon.com and select openHAB foundation to donate to. Thanks !


Either way, to select a ‘better’ card or ‘proper’ medium is no solution to the corruption problem.
You need to take a complementary measure (#2 below).

There’s two real useful things you can do to fight corruption:

  1. reduce write operations (to SD or flash memory in general)
  • Ideally, put persistence, logs and swap into RAM and sync them to a permanent medium.
    You can use any permanent medium (USB stick, SSD or NFS mount on NAS) to put these on.
    Losing RAM (on reboot) or the medium with these files is not critical. openHAB usually keeps working, and you can restore them from backup.
    Corruption of the system and data you need to keep on the other hand side is critical.

  • in a nutshell: use ZRAM.
    That’s a RAM disk with compression for swap and the most active directories.
    See this thread.
    I recommend to keep existing swap as a fallback solution. Note the ZRAM swap is created with a higher priority so it’s used first.

  • adding an option like commit=60 to /etc/fstab will result in files being written to only every 60 seconds, greatly reducing the number of writes, but note it doesn’t apply to swap or NFS.

Backup

Moving write-intensive files is a small one-time effort and will greatly reduce the risk of a SD card corruption caused crash, but it won’t fully mitigate it. Worst thing to possibly happen is that you loose logs because that medium fails - and eventually this makes OH hang so you need to restart it.

The much more important point is that to offload logging and persistence all by themselves is not sufficient. So either way, you also need to

  1. make daily backups
    This will not increase runtime, but it will mitigate the impact of a SD (or USB stick or USB attached SSD or other disk) crash or accidential admin failure.
    openHABian now comes with Amanda, a professional backup system.

You might be unaware that openHABian is not just a RPi disk image - it is a set of scripts that can be installed on top of any Debian like UNIX as well. Once you installed these, you don’t have to migrate to an openHABian based setup - you can choose to only install some of the optional components such as Amanda.

Use the new auto backup feature in openHABian to clone your SD right at installation time or via menu option. In case of crash, you just need to exchange cards and you are good to go.

Find below what it effectively does so you can execute it manually as well. But you will forget about doing things when they’re not automated so only ever use this in addition an automated solution.
The manual way is to attach an USB card writer and clone the SD card. Basically the command to do so is something like dd if=/dev/mmcblk0 of=/dev/sdX. X depends on your HW setup, so it’s potentially changing when you attach/detach USB devices. fdisk -l will list your current devices.
If you have storage mounted, you can also directly send the output file there (of=/path/to/file) so you have a backup.

19 Likes
OH + RPi2 - Move everything away from SD Card
Best hardware platform
Reduce writes on SD card using tmpfs, but still having persistence using rsync
Hardware Choice Pi or PC?
MicroSD required after moving root to USB
Switching to SSD
OH broke down over night
Pi alternatives
Persistence recommendation
Best hardware platform
Actual hardware= the need for physical connections to the openhab server
[OH 2.4.0 M6] Testing Results
Help JVM on Zulu raspberrypi
[SOLVED] Out of memory: Kill process java, results in openhab being killed too
SSH timeouts on high CPU load / How to change the nice level of openHAB?
Raspberry Pi stops working daily
RPI 2 B+ crashes frequently
Well I had it happen
Start up ver slow, can I adjust startup progress,
Raspberry Pi stops working daily
Sometimes rules don't get executed?
Openhabian and MicroSD Cards, every 2 month a new one cause they break! I don't know what to do here
Stable SD Card for Pi
Setup idea for first openHAB setup
Openhab works slow, CPU of the RPI is high
Remote Access
No rule are triggered after some runtime
[SOLVED] Mapdb not working
Please tell me what your z-wave network size is
Rehost from Linux to Openhabian?
[SOLVED] Reboot Switch (Button) on the UI?
[SOLVED] Write Persistence data to NFS share
Best hardware platform
Overlay FS
Reducing SD card writes with openhabian
RevPi Core 3
Pi alternatives
Comparison of servers to run OpenHAB on
Use RPi to forward wireless signals to OH installation on a Synology NAS?
Sd-card corruption?
Best hardware platform
Pine64 + Zwave module - worth a shot?
Z-Wave binding & remote Razberry?
Thoughts on Backup when using a RPI
OH2 - saving SD Card Writes
I need help/input on recovering from a failing SD card
Starting over
Is it "necessary" to switch from RPI3B to RPI4B?
openHAB crashed, log full of NUL bytes
[SOLVED] Openhabian Running Super Slow
Best/Ideal Hardware to install OpenHAB?
Best/Ideal Hardware to install OpenHAB?
Copy Root to USB is not terminating
openHAB on Raspberry Pi freezes for a couple of minutes
Does openhab2 store all non-essential information in a specific folder?
Best hardware platform
Raspberry and Openhabian on HD
Super simple OH backup script to FTP server
Hardware Recommendation (Raspberry Pi?)
What kind of SSD for Raspberry Pi when using persistence? Sata, mSata or m.2?
EXTRA_JAVA_OPTS -Xmx settings ie Grafana crashes openhab 2.4
OpenHab freezes or crashes after running a few days
OpenHABian on other Systems than a Raspi

I would add moving /var/lib/openhab/persistence in addition to /var/logs to another medium. In a mid to large sized system with persistence configured would result in a whole ton of writes as well, perhaps as much if not more than logging depending on the number of Items and persistence approach.

Though the risk of losing that medium produces more of an impact than just losing the logs, particularly if you are using persistence to track long term data.

Also, moving ‘everything that writes’ isn’t really all that hard. I just had a catastrophic failure of one of my Pis (not SD card related) and just rebuilt it as read only. I’ve written an Ansible playbook (posted below for those interested) to do it for me. I’m not arguing against whether it provides that much benefit (I did it more for security reasons and to learn how to do it) but it isn’t that difficult. NOTE: this Pi is not running openHAB.

My Read Only Jesse Ansible Playbook:
NOTES:

  • Treat as Alpha, it is only partially tested.
  • With a read only runtime Tripwire probably doesn’t add anything so much of that stuff will probably go away
  • The command to run the relink.sh script relinks a bunch of var folders to the new tempfs /tmp. Sometimes ansible gets stuck after that as the networking gets interrupted. I had to put this into a script because sshd refuses new logins after messing with /var/run and /var/lib/dhcp and each line in an Ansible script is a separate ssh into the target machine
  • I’ve configured my Pis to remotely rsyslog to my main server so I don’t bother with any of the techniques to periodically write the logs to the SD card
  • Some of the steps are unnecessary for Jesse Lite but they were part of the posting I based this off of and I’ve not scrubbed it: https://hallard.me/raspberry-pi-read-only/
  • I plan on posting all of my Ansible scripts which include setting up a bunch of other security related changes (UFW, Tripwire, config changes) to github at some point when I have a chance to more fully test it.
---
- name: Add aliases and fancy prompt to show status of FS
  blockinfile:
    state: present
    dest: /etc/bash.bashrc
    block: |
      # set variable identifying the filesystem you work in (used in the prompt below)
      set_bash_prompt(){
          fs_mode=$(mount | sed -n -e "s/^\/dev\/.* on \/ .*(\(r[w|o]\).*/\1/p")
          PS1='\[\033[01;32m\]\u@\h${fs_mode:+($fs_mode)}\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
      }

      alias ro='sudo mount -o remount,ro / ; sudo mount -o remount,ro /boot'
      alias rw='sudo mount -o remount,rw / ; sudo mount -o remount,rw /boot'

      # setup fancy prompt"
      PROMPT_COMMAND=set_bash_prompt

- name: Set up /tmp, /var/log, and /var/tmp as tempfs
  blockinfile:
    state: present
    dest: /etc/fstab
    insertafter: "#   use  dphys-swapfile swap[on|off]  for that"
    block: |
      tmpfs           /tmp            tmpfs   nosuid,nodev         0       0
      tmpfs           /var/log        tmpfs   nosuid,nodev         0       0
      tmpfs           /var/tmp        tmpfs   nosuid,nodev         0       0

- name: Set permissions on /tmp
  file:
    mode: a+rwx
    path: /tmp
    state: directory

- name:  Mount /tmp
  mount:
    name: /tmp
    src: /tmp
    fstype: tmpfs
    state: mounted

- name: Mount /var/log
  mount:
    name: /var/log
    src: /var/log
    fstype: tmpfs
    state: mounted

- name: Mount /var/tmp
  mount:
    name: /var/tmp
    src: /var/tmp
    fstype: tmpfs
    state: mounted

- name: Remap folders to /tmp
  script: relink.sh

- name: Waiting for {{ inventory_hostname }} to come back from reboot
  local_action: wait_for host={{ inventory_hostname }} state=started delay=30 timeout=300
  become: false

- name: Configure boot command line
  replace:
    dest: /boot/cmdline.txt
    regexp: 'otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait'
    replace: 'otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait fastboot noswap ro'

- name: Move dhcpd lock file to temp
  replace:
    dest: /etc/systemd/system/dhcpcd5
    regexp: '\=/run/dhcpcd.pid'
    replace: '\=/var/run/dhcpcd.pid'

- name: update fake-hwclock cron job to remount / rw and then mount it back
  copy:
    dest: /etc/cron.hourly/fake-hwclock
    src: fake-hwclock

- name: Check for presence of Tripwire cron job
  stat: path=/etc/cron.daily/tripwire
  register: tripwire_cron

- name: Update cron job to remount rw before running check
  copy:
    dest: /etc/cron.daily/tripwire
    src: tripwire-cron
  when: tripwire_cron.stat.exists == True

- name: Remove some start scripts
  shell: /sbin/insserv -r bootlogs; /sbin/insserv -r console-setup

- name: Set boot FS as readonly
  replace:
    dest: /etc/fstab
    regexp: '/dev/mmcblk0p1  /boot           vfat    defaults          0       2'
    replace: '/dev/mmcblk0p1  /boot           vfat    defaults,ro          0       2'

- name: Set root as readonly
  replace:
    dest: /etc/fstab
    regexp: '/dev/mmcblk0p2  /               ext4    defaults,noatime  0       1'
    replace: '/dev/mmcblk0p2  /               ext4    defaults,noatime,ro  0       1'

- name: Reboot as read only
  include: tasks/reboot.yml

relink.sh

#!/bin/bash

rm -rf /var/lock
ln -s  /tmp /var/lock

rm -rf /var/spool
ln -s /tmp /var/spool

rm -rf /var/run
ln -s /tmp /var/run

cp -r /var/lib/dhcp/* /tmp
rm -rf /var/lib/dhcp/*
ln -s /tmp /var/lib/dhcp

fake-hwclock: the main addition is a remount of / as rw so it can write then mounting it back to ro

#!/bin/sh
#
# Simple cron script - save the current clock periodically in case of
# a power failure or other crash

if (command -v fake-hwclock >/dev/null 2>&1) ; then
  mount -o remount,rw /
  fake-hwclock save
  mount -o remount,ro /
fi

tripwire-cron, same as above, mount rw then as ro when done

#!/bin/sh -e

mount -o remount,rw /

tripwire=/usr/sbin/tripwire

[ -x $tripwire ] || exit 0

umask 027

$tripwire --check --quiet --email-report

mount -o remount,ro /
1 Like

Agreed, I edited my post.

To be honest. I am not a pro. So maybe someone can help me with concret steps. Wath I have done so far:

  • Installed Ubuntu 16 on a new PC Engines APU 2 on the 16 GB SSD
  • Created a NFS Share on the Synology
  • Mounted the NFS-Share on the APU2 as /mnt/openhab (fstab)
  • Installed OpenHab 1.8.3 (So far without autostart)

Waht I plan to do but not know how:

  • Store the data from rrjd to /mnt/openhab/rrjd
  • store the logs in a tmpfs-place (how to create this as startup and mount to /var/logs ?)
  • Create a script to compres the logs from /var/logs/ and copy them to /mnt/openhab/logs
  • run the scrip every day at 24:00

I can use googl, but I don’t know how to start :slight_smile:

Thanks for any hint

Some standard Ubuntu? If you are only using tihs box for OH, I would disable virtual memory.
Apart from that, something along the lines of Rich suggestions. 16GB isn’t very much to wear-level, so I suspect it might cause some problems some time in the distant future.

Yes, it should be a standard-ubuntu. But since I have to install a package called policykit-1 until I was able to run as a service I am not shur anymore. On top it’s not working even it tells me it’s running:
openhab@openhab:~$ service openhab status
● openhab.service - LSB: OpenHAB Daemon
Loaded: loaded (/etc/init.d/openhab; bad; vendor preset: enabled)
Active: active (exited) since Tue 2016-08-16 21:07:15 CEST; 4min 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 1919 ExecStart=/etc/init.d/openhab start (code=exited, status=0/SUCCESS)

Aug 16 21:07:15 openhab systemd[1]: Starting LSB: OpenHAB Daemon…
Aug 16 21:07:15 openhab openhab[1919]: * Starting Open Home Automation Bus Daemon
Aug 16 21:07:15 openhab openhab[1919]: …done.
Aug 16 21:07:15 openhab systemd[1]: Started LSB: OpenHAB Daemon.

But the log is empty, Http ist not running, nothing is ok!

If I start with the start.sh all is fine?!

I tried all the step-by-step possibillity to run as a service. Do i have to install other packages?
I only have copy my backup to /opt/openhab

btw, I use this box only for OH and as it has 4Gb ram I also disabled the virtual memory.

Any reason you are not using apt-get to install?

If you had, it handles all this for you.

When doing it by hand, particularly if you don’t know what you are doing, there is so much that could have gone wrong that I can’t even begin to guess what went wrong.

Ok, installed using apt-get. Same result. I think I will install a new version of ubuntu with all options enabled. Another 3 days lost :frowning:

The only thing what I not understand is, why can it be all ok if I start OH in console-modus while the system is not working as a service?

When you run it from the console you are running it as a specific user (either yourself or root depending on how you run it) which includes an entire configured environment including shell, permissions, environment variables, etc. When it runs as a service the process gets none of that. When you run it as a service properly (i.e. as a non-privileged user) you don’t even have a shell to rely upon.

Questions:

  • Which version of Ubuntu are you running? Since 15 the startup system moved from upstart to systemd. The two are very different in how they manage starting service when Linux boots.
  • I’ve not tried to install or use policyKit before, but I could see it potentially messing everything up if you just installed it without then configuring it to allow the openhab user access to the resources it needs to run openHAB. I recommend installing openHAB and get it working first and then trying policyKit, and then only install that once you understand how to configure it.
  • The fact that it didn’t work with apt-get makes me thing either you messed something up with systemd previously or policykit is messing things up.

Finally! I installed a new version of OS (Ubuntu 16.04), and followed the steps from the tutorial again. Made the NFS shars on the Synology, mountet them to /mnt/openhab and createt tom symbolic links (for Logs and Persistance)

Now all seams to be ok :slight_smile:

Dear Spaceman could you please post this script?

My file system got corrupted on day 1!

After copying the raspbian image and apt-upgrade and a reboot the file system was corrupted. But after formating the SD-Card I could copy the raspbian image again without fault. So this was clearly no case of wear-leveling in the SD-Card but a failure OS or Firmware keeping the file system in order.

Some SD-Card work for months, some fail almost immediatly, even from the same brand. So this must have to do with the low-level writing procedure in my opinion. But since I cannot fix it myself and the setup is always in danger of getting lost, I keep on with my workaround of using an usb stick (and placing come directories to /tmpfs as well).

Had similar issue, a couple of times. Usuall it happened when power off the raspberry the not so nice way (power outage etc.)

My solution finally was to put the whole raspian installation on my Synology NAS and even boot via the Raspi from nfs. Since then I had 0 issues. Of course a NAS could often run directly openhab as well, which might also be a good option. In my case it was not, because the NAS is located in the cellar and m zwave signal was very poor from this location. The Raspi was therefore much easier to place in a more central location…

There are plenty of articles on how to boot the RPi via nfs. E.g.
http://blogs.wcode.org/2013/09/howto-netboot-a-raspberry-pi/

You could put the controller on your Pi and then user ser2net to let your NAS access it across the network.

You could then make your Raspberry Pi completely Read Only while running.

Lots of options…

… or buy a usb ssd harddisk and boot without the use of any sd card: