Disaster recovery? Can you restore your OH in minutes?

Max_G · May 18, 2017, 12:35pm

Given some of the very elaborate system configurations – tons of hours and days, months of configuration efforts – have you thought about a hardware fault; e.g the rPi dies. Now what? How do you restore your OH server back onto new (yet the same) hardware; here a rPi?

My wife is scared of the thought the house goes down (as in OH hardware dies), and becomes unusable.

I have two rPis:

provides proxy, dhcp, dns, mail services
runs OH, mosquitto

My idea is to back-up one to the other. (Not sure how yet.)
Or are there better methods? E.g. VM type scenarios?

sihui · May 18, 2017, 12:56pm

Works like a charm:

https://www.linux-tips-and-tricks.de/en/backup

Max_G · May 18, 2017, 1:00pm

Thanks sihui… though I am reading: “All running services should be stopped before the backup is started and restarted when the backup finished to get a consistent backup.”

sihui · May 18, 2017, 1:02pm

I’m using raspiBackup since two years and never stop ANY services (openHAB, influxdb, …) during nightly backups.
Back in the old days where I was using sd cards I had to restore my Pi twice and it worked without any problems.
Now I’m using an USB SSD (no sd card at all needed) and I’m still backing up through this nice piece of software.
Hopefully I don’t need the restore function again

rlkoshak · May 18, 2017, 6:30pm

One of the advantages of putting my configs into a git server and running OH in docker is that I’m a git clone and docker run away from having a new install. In fact, my upgrade scripts do the same.

I’ve automated it all using Ansible which I gotta say I’m loving. Far easier to learn, setup, and use than Chef or Puppet and in the end, my entire collection of server and raspi configs is just some more code checked into git.

I’m putting a lot of these to the test right now, as you know, and it is working well. I should have my OH back up this weekend now that I’ve got by bootstrap machines (Gogs and a desktop VM) backup and running. It is hard to make a lot of progress working 15-30 minutes a day.

However, it doesn’t really completely solve the failover situation you are looking at, but it could. Basically, I’d have hot spares for your Pis that are minimally installed and configured but online and minimally configured with sshd, ssh certs, and python (all required for ansible). In the event of a failure, I’d run some Ansible playbooks (could be automated but I wouldn’t) to configure one or both of the spares and update the DNS to point to the new Pi(s).

The big gotcha is dealing with hardware like the zwave dongle which you would either have to manually transfer or somehow set it up on some other machine and expose to OH using ser2net and socat or the like.

Of course, I’m mainly talking theoretically here. I’ve done this sort of thing at work before and the approach is sound. However, I’ve not yet built up my new environment to practice what I preach, yet. But the main thrust of the theory is that the everything that goes into configuring a server gets captured in code so all you really need to back up is this code and your data.

mstormi · May 18, 2017, 8:36pm

The IMHO best method is to have a raw dump of the SD card ready. You can’t miss any program or file that way you might forget about if you need to list it in a backup tool. Get an external card writer, attach it to your primary Pi and do dd if=/dev/mmcblk0 of=/dev/sdX after significant changes or nightly via cron.
In case of breakdown, exchange SD card and boot. To have volatile data (logs, persistence DB) on an external medium (USB stick, NAS) helps but is no must.

EDIT: I have added the Amanda backup tool to openHABian. It allows to backup/restore selected files from multiple machines in parallel to backing up the raw SD cards. That way you can restore any system state of any date in the past and/or create a SD card clone.

PS: you also need to take care to backup your zwave controller data and any gateway you might be running but obviously there is no generally applicable manual on how to accomplish that.

Brian · May 18, 2017, 11:06pm

I think about this a lot - but it seems that’s all I do. I have yet to put into action any automated plan - it’s all very manual. I backup my configurations, scripts, etc bi-monthly to flash drives that live off our property.

I can also point you to two threads:

A thread about fault tolerance
A thread I had started along these lines as well that generated some very good ideas - ones that I really need to work on!

I’ll be watching this thread too!

Max_G · May 19, 2017, 12:39am

The problem here is that most have a different approach to anything… a different drive, a different skill set, etc.

The background for my post was a blown power supply on my PC at 21:30 Wednesday night. I live in the country, got onto a 24/7 IT service, had a placement power supply in 30 minutes, and was up and running in 1 hour; all systems go an hour later.

This event shook me to the core, still feeling the stress today – entirely a mental thing – and rang home how vulnerable life is, given the technology (literally) creeping up on us.

thinking about this earlier in the week, I started a wiki (based on DokuWiki), documenting systems that go beyond the obvious; e.g. say a light switch.
It turned out that the task will cost me many weeks/months to get to a basic documentation.
Here a snapshot after a week of writing…

Apart from being the easiest and most efficient software I have used in decades (really), it revealed how much there is in systems, which cannot be “fixed” by the average trade, let alone my wife, if I am not around.

What I am thinking of is: two rPis, in a fault-tolerant failover set-up. If one dies, the other takes over; the dead one is replaced; happy to put a sort of boot/reconnect file on it, fire it up and its fixed. My wife can do this!

So even if one were to say OH only augments existing systems, let’s look at your rules, what gets triggered based on systems states and a mix thereof, and would hurt you in some way, if OH dies.

So, I’d say keep it coming guys… I am sure there is a serious solution forthcoming.

mitchmitchell · May 29, 2017, 12:19am

I’m using a script I found on the Raspbx image (FreePBX for the pi) to make nightly image backups. The script using rsync to update the image so if the SD card fails I just burn the image onto a new one and am back up to where I was the night before.

I’ve not spent too much time on the other components your covering in your Wiki, but I probably need to start. My system is architected so that each subsystem (lights, havc, security, sprinklers etc.) operate independently of openHAB so even a total openHAB failure would just eliminate a few automated actions that cross subsystems.

Please keep posting about your activities in this area as I am sure my openHAB system will become more mission critical in the future so I will probably adopt some of your ideas. The thought of a wiki so my wife might have a chance to figure things out is really appealing,