Failover System for a Raspberry Pi at a Remote Location

Mark111 · August 20, 2020, 12:08pm

Openhab on Raspberry Pi.

For a scenario where a Pi (running Openhab) crashes and no-one is on hand to restart it : has anyone explored a second “failover” Pi, also with Openhab installed, with associated scripts to ensure continued smooth operation?

This is not a backup question - as I am assuming that key config files would be backed up to a remote server in any event.

Presumably a solution would involve something like a software watchdog, which runs on startup.
I anticipate that features would include the following:

First Pi:
Openhab is running
First Pi sends regular pings to Second Pi to indicate normal operation, presumably across GPIO pins or LAN.
Pings should come from Openhab itself, rather than some other software
On Crash, Pings will stop

Second Pi
Openhab is not running
Watchdog listener takes no action as long as pings received from First Pi
When Pings Stop:
Triggers power reset for First Pi, waits to see if Pings restart
If no restart after a longer period of time, starts its own Openhab instance.
There’ll be no Failover support for this instance of Openhab, since the First Pi is still not working.

First Pi
On Power reset - unclear at this point if Openhab is running or is still inoperable
Cron script or similar to stop then start Openhab

Thoughts welcome - would be grateful. If anyone has done this before I would appreciate it if you could share steps / code for a beginner…

Thedannymullen · August 20, 2020, 12:15pm

I have seen this discussed. The issue in doing this is all the associated processes that need to be switched over when doing it.

Usb dongles for zwave or zigbee being the big problem. But mqtt can be an issue if you are using it.

The swap of just the pi is the easy part many videos on things like this. What hardware / bindings do you use to run your home automation?

In general it is more economical and less complex to buy more robust hardware vs trying to setup pi failover. For example a small pc with redundant hard drive would be a significantly more robust system vs two pi’s. I say this as usually the hard drive has been the thing I lose on computers in the factory I support. It is sure to them being spinning drives. But in general I hav Unix computers that have lasted over 20yrs and pc that are going to 10 and counting with no issues.

Bruce_Osborne · August 20, 2020, 12:27pm

Just have the second Pi monitor & restart the first?

MDAR · August 20, 2020, 12:36pm

Does it have to be a Pi?

Could a simple ESP01 (or LAN) relay do the trick?

As long as it gets a heartbeat to reset a timer, it won’t power cycle the Pi.

(This is assuming the power to the Pi is connected to the “Normally Closed” relay output)

Udo_Hartmann · August 20, 2020, 12:45pm

Then better use the builtin watchdog (yes, Raspberry has builtin watchdog, you just have to activate it…)

Bruce_Osborne · August 20, 2020, 12:56pm

I assumed something with OH intelligence or similar.

Mark111 · August 20, 2020, 1:10pm

Thank you all!.

Has to be sthg with Openhab (like a Pi) in case the first Pi will not restart so I don’t think Esp01 would work (though as a simple physical watchdog would be great).

I appreciate the idea of sthg more stable than a Pi in the first place… any suggestions? I don’t have any spare Linux pcs lying around - Synology or similar seems like overkill.

I would appreciate any more guidance regarding the videos on the “swap of the Pi” - to be clear, is this without any intervention by an operator (which I would need) ? Web links would be hugely appreciated!

Thank you for the suggestion of the built-in Pi software watchdog - will look into it.

many thanks

MDAR · August 20, 2020, 1:11pm

How clever does it need to be?

Start counting seconds
Count heartbeats from OH
@ n Seconds, “have I seen 2 heartbeats?”
Yes, reset counters and start again
No, pulse relay for x seconds start all over again

MDAR · August 20, 2020, 1:15pm

I’m really liking the ODroid C2 or C4.

There’s a good amount of support on this forum for the platform now too.

https://community.openhab.org/search?q=ODroid

It really comes down to taste / budget & skills

Bruce_Osborne · August 20, 2020, 1:22pm

The OS can respond but OH lock up.

MDAR · August 20, 2020, 1:47pm

I meant, a heartbeat sent from OH

mstormi · August 20, 2020, 4:40pm

A Pi isn’t substantially less stable than any other SBC or a PC if you take care of the SD wearout issue (see ZRAM for that purpose).
Anything else is peripherals related or a software problem and as such as-likely to happen on any other hardware as well.

Marijn_Oudijk · August 20, 2020, 4:55pm

To take care of network and software issue my RPI pings Google and when that fails a couple of times it reboots the Pi.
This ‘automatically’ solves see if the problems related to unexpected network / modem update s etc.

marcel_erkel · August 20, 2020, 5:54pm

Yes, there have been a number of attempts here on the forum to discuss an high available (HA) setup here on the forum e.g.:

When doing a search here on the forum then you’ll find more, but you’ll also notice that the majority of these topics don’t provide a solution. Also in my opinion the vibe in these topics isn’t always very positive to the idea of building a high available setup.

I agree with the RPI is not the ideal platform for openHAB, but don’t tell Markus because according to him RPIs are the holy grail and ZRAM is the solution for everything. Give it a little time and the XY problem will make it’s appearance in this thread as well

mstormi · August 20, 2020, 7:30pm

I never claimed that so why are you digging at me ?

NilsOF · August 21, 2020, 10:48am

redundant failover is cool, but it will get complicated very fast.
And it will be a beast to keep upto date and running/tested.

I would go for remote access and a more or less scripted and manual aproach.
If high avialibility is needed, you need remote access anyway.

Rpi4 is cheap and now with boot from SSD it will be more robust.

mdnx · August 26, 2020, 10:53pm

Hej @Mark111 - I faced the same challenge: my openhab on a rpi stalled from time to time. The effect could be, that only openhab stalled or the whole rpi was unresponsive. The later let me stay away from the above mentioned watchdog solution. Overall I could neither find a clear cause nor an underlying pattern for the crashes. I assume, that a better rpi and specialist knowledge of linux and the internals of my Openhab installation would be good to have to get the system more stable - however at the moment I have neither time nor nerves for any of this )

my failover solution uses a second rpi (OH2) with openhab to monitor the first openhab (OH1). OH2 has only some scripts and openCV stuff running and turned out to be super stable. OH1 has all zwave, logging, influxDb, grafana etc running - and crashes quite regularly. One could as well imagine a setup, where both rpi monitor each other.
As OH2 has a flask server running, which hosts some JSON files, I use this setup to have a OpenHAB rule on OH1 to write every 60sec to this JSON file via a simple http command. OH2 has a Openhab rule running, which checks, whether the value on JSON is updated regularly. If a threshold in time is exceeded OH2 sends a command to a hue plug, which cuts the power for OH1 - causing this to reboot.
This type of reboot might be harsh - however it is do far effective in all my “crash” cases on OH1. I found additionally, that a hue bridge can be addressed from many OpenHab installations, which reduces costs, as my home system is allready having a hue bridge controled by OH1 allready.

In the failover system I have build in some monitor capabilities and can see, that my OH1 is rebooted on a regular basis (unfortunately) - however so far it always rebooted and my “Smart” Home system stayed afloat, even, when I was traveling for several weeks in a row, without me having to intervene from afar via VPN.
Tell if you need more info or scripts for this setup

mstormi · August 27, 2020, 10:33am

These statements are somewhat orthogonal to the original failover discussion and provide a dangerous twist. Why ?
Any single system has to be stable all by itself so if yours isn’t invest time into investigation why it is not.
You must not deploy a failover system to solve instability problems - that’s not solving anything, just hiding problems for some time until these surface again (and they WILL surface)…
There’s lots of things that can go very badly wrong if you keep running instable standalone systems. The failover detection and execution mechanism may fail, configuration sync may go wrong and many more hard to find issues will be waiting for you then.

thaake · August 27, 2020, 3:14pm

I faced the same challenge and wanted to solve it as much as possible with openhab itself.
In short, this is what i did:

i run 2 identical rp4 with identical openhabian, and all the oh conf
interesstingly all my bindungs can work with 2 instances at the same time
- my homematic ccu3 works fine with both oh instances
- my hue works fine with both oh
- my wemo binding works with both oh
- all other bindings as well
  Each oh instance has his own influxdb and each binding, where i control some real thing has a switch item, to tell the oh instance, if it is active or passive.

Both oh instances update each other with their status every other minute and so they can monitor each other. If the passiv oh instance runs into timeout (expire binding) from the status of the other oh instance, it will set itself active and activate all switch items for the controlling bindings.

So, both instances run in parallel and do excatly the same, exept the controlling of i.e rollershuters, actors, my thermostats, my hotwater, etc.

Its running very smoothly for more than a half jear now.
Its a great constellation for testing new things or updates as well. I do that on the passive oh instance and keep the active one away from changes. If all changes are well tested, i’ll activate the instance with the changes and repeat the changes on the other (now passive) oh instance.

Its great for rule development an testing, too. If the changed or new rule runs well, i’ll copy the rule to the passiv instance and both are identical again.

Yes, there are times, when the systems are not identical, but thats short and ok for me.
I someone wanted to be “purer” on this, he has to set up a third oh instance for testing, etc.

Its my kind of failover-system and i never had any outage since…

mstormi · August 27, 2020, 4:01pm

That’s not a failover system, and it isn’t applicable to many technologies and devices.
How does a sensor of yours know which server to send its messages to and when to switch servers ?
How do you control devices in ZWave, ZigBee, BT, EnOcean etc. ?

It’s ok to run like you do but that’s not a comprehensive resilience or even redundance concept.