Corrupt FileSystems every 2-3 month?

rlkoshak · August 12, 2016, 3:57pm

I would add moving /var/lib/openhab/persistence in addition to /var/logs to another medium. In a mid to large sized system with persistence configured would result in a whole ton of writes as well, perhaps as much if not more than logging depending on the number of Items and persistence approach.

Though the risk of losing that medium produces more of an impact than just losing the logs, particularly if you are using persistence to track long term data.

Also, moving ‘everything that writes’ isn’t really all that hard. I just had a catastrophic failure of one of my Pis (not SD card related) and just rebuilt it as read only. I’ve written an Ansible playbook (posted below for those interested) to do it for me. I’m not arguing against whether it provides that much benefit (I did it more for security reasons and to learn how to do it) but it isn’t that difficult. NOTE: this Pi is not running openHAB.

My Read Only Jesse Ansible Playbook:
NOTES:

Treat as Alpha, it is only partially tested.
With a read only runtime Tripwire probably doesn’t add anything so much of that stuff will probably go away
The command to run the relink.sh script relinks a bunch of var folders to the new tempfs /tmp. Sometimes ansible gets stuck after that as the networking gets interrupted. I had to put this into a script because sshd refuses new logins after messing with /var/run and /var/lib/dhcp and each line in an Ansible script is a separate ssh into the target machine
I’ve configured my Pis to remotely rsyslog to my main server so I don’t bother with any of the techniques to periodically write the logs to the SD card
Some of the steps are unnecessary for Jesse Lite but they were part of the posting I based this off of and I’ve not scrubbed it: Protect your Raspberry PI SD card, use Read-Only filesystem – Charles's Blog
I plan on posting all of my Ansible scripts which include setting up a bunch of other security related changes (UFW, Tripwire, config changes) to github at some point when I have a chance to more fully test it.

---
- name: Add aliases and fancy prompt to show status of FS
  blockinfile:
    state: present
    dest: /etc/bash.bashrc
    block: |
      # set variable identifying the filesystem you work in (used in the prompt below)
      set_bash_prompt(){
          fs_mode=$(mount | sed -n -e "s/^\/dev\/.* on \/ .*(\(r[w|o]\).*/\1/p")
          PS1='\[\033[01;32m\]\u@\h${fs_mode:+($fs_mode)}\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
      }

      alias ro='sudo mount -o remount,ro / ; sudo mount -o remount,ro /boot'
      alias rw='sudo mount -o remount,rw / ; sudo mount -o remount,rw /boot'

      # setup fancy prompt"
      PROMPT_COMMAND=set_bash_prompt

- name: Set up /tmp, /var/log, and /var/tmp as tempfs
  blockinfile:
    state: present
    dest: /etc/fstab
    insertafter: "#   use  dphys-swapfile swap[on|off]  for that"
    block: |
      tmpfs           /tmp            tmpfs   nosuid,nodev         0       0
      tmpfs           /var/log        tmpfs   nosuid,nodev         0       0
      tmpfs           /var/tmp        tmpfs   nosuid,nodev         0       0

- name: Set permissions on /tmp
  file:
    mode: a+rwx
    path: /tmp
    state: directory

- name:  Mount /tmp
  mount:
    name: /tmp
    src: /tmp
    fstype: tmpfs
    state: mounted

- name: Mount /var/log
  mount:
    name: /var/log
    src: /var/log
    fstype: tmpfs
    state: mounted

- name: Mount /var/tmp
  mount:
    name: /var/tmp
    src: /var/tmp
    fstype: tmpfs
    state: mounted

- name: Remap folders to /tmp
  script: relink.sh

- name: Waiting for {{ inventory_hostname }} to come back from reboot
  local_action: wait_for host={{ inventory_hostname }} state=started delay=30 timeout=300
  become: false

- name: Configure boot command line
  replace:
    dest: /boot/cmdline.txt
    regexp: 'otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait'
    replace: 'otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait fastboot noswap ro'

- name: Move dhcpd lock file to temp
  replace:
    dest: /etc/systemd/system/dhcpcd5
    regexp: '\=/run/dhcpcd.pid'
    replace: '\=/var/run/dhcpcd.pid'

- name: update fake-hwclock cron job to remount / rw and then mount it back
  copy:
    dest: /etc/cron.hourly/fake-hwclock
    src: fake-hwclock

- name: Check for presence of Tripwire cron job
  stat: path=/etc/cron.daily/tripwire
  register: tripwire_cron

- name: Update cron job to remount rw before running check
  copy:
    dest: /etc/cron.daily/tripwire
    src: tripwire-cron
  when: tripwire_cron.stat.exists == True

- name: Remove some start scripts
  shell: /sbin/insserv -r bootlogs; /sbin/insserv -r console-setup

- name: Set boot FS as readonly
  replace:
    dest: /etc/fstab
    regexp: '/dev/mmcblk0p1  /boot           vfat    defaults          0       2'
    replace: '/dev/mmcblk0p1  /boot           vfat    defaults,ro          0       2'

- name: Set root as readonly
  replace:
    dest: /etc/fstab
    regexp: '/dev/mmcblk0p2  /               ext4    defaults,noatime  0       1'
    replace: '/dev/mmcblk0p2  /               ext4    defaults,noatime,ro  0       1'

- name: Reboot as read only
  include: tasks/reboot.yml

relink.sh

#!/bin/bash

rm -rf /var/lock
ln -s  /tmp /var/lock

rm -rf /var/spool
ln -s /tmp /var/spool

rm -rf /var/run
ln -s /tmp /var/run

cp -r /var/lib/dhcp/* /tmp
rm -rf /var/lib/dhcp/*
ln -s /tmp /var/lib/dhcp

fake-hwclock: the main addition is a remount of / as rw so it can write then mounting it back to ro

#!/bin/sh
#
# Simple cron script - save the current clock periodically in case of
# a power failure or other crash

if (command -v fake-hwclock >/dev/null 2>&1) ; then
  mount -o remount,rw /
  fake-hwclock save
  mount -o remount,ro /
fi

tripwire-cron, same as above, mount rw then as ro when done

#!/bin/sh -e

mount -o remount,rw /

tripwire=/usr/sbin/tripwire

[ -x $tripwire ] || exit 0

umask 027

$tripwire --check --quiet --email-report

mount -o remount,ro /

mstormi · August 12, 2016, 4:12pm

Agreed, I edited my post.

aco · August 13, 2016, 7:03pm

To be honest. I am not a pro. So maybe someone can help me with concret steps. Wath I have done so far:

Installed Ubuntu 16 on a new PC Engines APU 2 on the 16 GB SSD
Created a NFS Share on the Synology
Mounted the NFS-Share on the APU2 as /mnt/openhab (fstab)
Installed OpenHab 1.8.3 (So far without autostart)

Waht I plan to do but not know how:

Store the data from rrjd to /mnt/openhab/rrjd
store the logs in a tmpfs-place (how to create this as startup and mount to /var/logs ?)
Create a script to compres the logs from /var/logs/ and copy them to /mnt/openhab/logs
run the scrip every day at 24:00

I can use googl, but I don’t know how to start

Thanks for any hint

vespaman · August 13, 2016, 10:54pm

Some standard Ubuntu? If you are only using tihs box for OH, I would disable virtual memory.
Apart from that, something along the lines of Rich suggestions. 16GB isn’t very much to wear-level, so I suspect it might cause some problems some time in the distant future.

aco · August 16, 2016, 7:23pm

Yes, it should be a standard-ubuntu. But since I have to install a package called policykit-1 until I was able to run as a service I am not shur anymore. On top it’s not working even it tells me it’s running:
openhab@openhab:~$ service openhab status
● openhab.service - LSB: OpenHAB Daemon
Loaded: loaded (/etc/init.d/openhab; bad; vendor preset: enabled)
Active: active (exited) since Tue 2016-08-16 21:07:15 CEST; 4min 3s ago
Docs: man:systemd-sysv-generator(8)
Process: 1919 ExecStart=/etc/init.d/openhab start (code=exited, status=0/SUCCESS)

Aug 16 21:07:15 openhab systemd[1]: Starting LSB: OpenHAB Daemon…
Aug 16 21:07:15 openhab openhab[1919]: * Starting Open Home Automation Bus Daemon
Aug 16 21:07:15 openhab openhab[1919]: …done.
Aug 16 21:07:15 openhab systemd[1]: Started LSB: OpenHAB Daemon.

But the log is empty, Http ist not running, nothing is ok!

If I start with the start.sh all is fine?!

I tried all the step-by-step possibillity to run as a service. Do i have to install other packages?
I only have copy my backup to /opt/openhab

btw, I use this box only for OH and as it has 4Gb ram I also disabled the virtual memory.

rlkoshak · August 16, 2016, 7:31pm

Any reason you are not using apt-get to install?

If you had, it handles all this for you.

When doing it by hand, particularly if you don’t know what you are doing, there is so much that could have gone wrong that I can’t even begin to guess what went wrong.

aco · August 16, 2016, 8:37pm

Ok, installed using apt-get. Same result. I think I will install a new version of ubuntu with all options enabled. Another 3 days lost

The only thing what I not understand is, why can it be all ok if I start OH in console-modus while the system is not working as a service?

rlkoshak · August 16, 2016, 9:01pm

When you run it from the console you are running it as a specific user (either yourself or root depending on how you run it) which includes an entire configured environment including shell, permissions, environment variables, etc. When it runs as a service the process gets none of that. When you run it as a service properly (i.e. as a non-privileged user) you don’t even have a shell to rely upon.

Questions:

Which version of Ubuntu are you running? Since 15 the startup system moved from upstart to systemd. The two are very different in how they manage starting service when Linux boots.
I’ve not tried to install or use policyKit before, but I could see it potentially messing everything up if you just installed it without then configuring it to allow the openhab user access to the resources it needs to run openHAB. I recommend installing openHAB and get it working first and then trying policyKit, and then only install that once you understand how to configure it.
The fact that it didn’t work with apt-get makes me thing either you messed something up with systemd previously or policykit is messing things up.

aco · August 17, 2016, 8:24pm

Finally! I installed a new version of OS (Ubuntu 16.04), and followed the steps from the tutorial again. Made the NFS shars on the Synology, mountet them to /mnt/openhab and createt tom symbolic links (for Logs and Persistance)

Now all seams to be ok

nattila1 · August 18, 2016, 8:48am

Dear Spaceman could you please post this script?

greenoid · October 3, 2016, 5:23pm

My file system got corrupted on day 1!

After copying the raspbian image and apt-upgrade and a reboot the file system was corrupted. But after formating the SD-Card I could copy the raspbian image again without fault. So this was clearly no case of wear-leveling in the SD-Card but a failure OS or Firmware keeping the file system in order.

Some SD-Card work for months, some fail almost immediatly, even from the same brand. So this must have to do with the low-level writing procedure in my opinion. But since I cannot fix it myself and the setup is always in danger of getting lost, I keep on with my workaround of using an usb stick (and placing come directories to /tmpfs as well).

reyem · October 3, 2016, 6:13pm

Had similar issue, a couple of times. Usuall it happened when power off the raspberry the not so nice way (power outage etc.)

My solution finally was to put the whole raspian installation on my Synology NAS and even boot via the Raspi from nfs. Since then I had 0 issues. Of course a NAS could often run directly openhab as well, which might also be a good option. In my case it was not, because the NAS is located in the cellar and m zwave signal was very poor from this location. The Raspi was therefore much easier to place in a more central location…

There are plenty of articles on how to boot the RPi via nfs. E.g.
http://blogs.wcode.org/2013/09/howto-netboot-a-raspberry-pi/

rlkoshak · October 3, 2016, 6:37pm

You could put the controller on your Pi and then user ser2net to let your NAS access it across the network.

You could then make your Raspberry Pi completely Read Only while running.

Lots of options…

sihui · October 3, 2016, 8:06pm

… or buy a usb ssd harddisk and boot without the use of any sd card:

greenoid · October 5, 2016, 6:11pm

Only Raspi 3 is able to boot without any sd card, for example from usb. For Raspi1 & 2 best option is a small readonly boot partition on a sd card.

mstormi · October 27, 2017, 12:51pm

And I just added the recommendation to move the swapfile, too.

summx · October 31, 2017, 5:02pm

I wish my Raspi 3 would go more than 3 weeks! It now killed my USB thumb drive and I can´t help but suspect something wrong with the Pi itself. Other than that it is totally unusable for homeautomation and data logging (weather and whatnot) in my case

Saracen · November 2, 2017, 10:55am

My USB flash drive is also corrupt and totally unreadable.

Once I get a new USB stick, at what point in the install process do I copy my backup conf and user data folders to the stick?

I tried without a USB stick (copying to the SD card but everything broke so I’m starting again.

Thanks!

Lionello_Marrelli · November 5, 2017, 5:05pm

I also experienced corruption of the pi3 sd filesystem: I’ve been running openhab on the raspberry pi 3 for one year and a half and corruption happend two or three times. At first I blamed the SD cards but I noticed that corruption was very likely related to improper shutdown due to power outages.

As I wish to keep a very low power solution relying on the Pi itself, I decided to use a power bank as an UPS (selecting a powerbank that simultaneously supply the Pi and recharges its own lithium batteries).
Moreover I implemented a daily sd image rsync backup towards my synology NAS.

The UPS+backup system is in place since one month and no corruption occured so far, but even if it happens I can restore the previous day image in a few minutes.

Saracen · November 10, 2017, 2:00pm

I’ve bought a powerbank and I’ll give it a go.

Did you put any software on your Pi to manage it? A chap here has written some code which computes battery-time remaining. | can see it being useful to trigger a script to let you know when and if the battery back-up is being utilised.