Failover System for a Raspberry Pi at a Remote Location

Mark111 · August 20, 2020, 1:10pm

Thank you all!.

Has to be sthg with Openhab (like a Pi) in case the first Pi will not restart so I don’t think Esp01 would work (though as a simple physical watchdog would be great).

I appreciate the idea of sthg more stable than a Pi in the first place… any suggestions? I don’t have any spare Linux pcs lying around - Synology or similar seems like overkill.

I would appreciate any more guidance regarding the videos on the “swap of the Pi” - to be clear, is this without any intervention by an operator (which I would need) ? Web links would be hugely appreciated!

Thank you for the suggestion of the built-in Pi software watchdog - will look into it.

many thanks

MDAR · August 20, 2020, 1:11pm

How clever does it need to be?

Start counting seconds
Count heartbeats from OH
@ n Seconds, “have I seen 2 heartbeats?”
Yes, reset counters and start again
No, pulse relay for x seconds start all over again

MDAR · August 20, 2020, 1:15pm

I’m really liking the ODroid C2 or C4.

There’s a good amount of support on this forum for the platform now too.

https://community.openhab.org/search?q=ODroid

It really comes down to taste / budget & skills

Bruce_Osborne · August 20, 2020, 1:22pm

The OS can respond but OH lock up.

MDAR · August 20, 2020, 1:47pm

I meant, a heartbeat sent from OH

mstormi · August 20, 2020, 4:40pm

A Pi isn’t substantially less stable than any other SBC or a PC if you take care of the SD wearout issue (see ZRAM for that purpose).
Anything else is peripherals related or a software problem and as such as-likely to happen on any other hardware as well.

Marijn_Oudijk · August 20, 2020, 4:55pm

To take care of network and software issue my RPI pings Google and when that fails a couple of times it reboots the Pi.
This ‘automatically’ solves see if the problems related to unexpected network / modem update s etc.

marcel_erkel · August 20, 2020, 5:54pm

Yes, there have been a number of attempts here on the forum to discuss an high available (HA) setup here on the forum e.g.:

When doing a search here on the forum then you’ll find more, but you’ll also notice that the majority of these topics don’t provide a solution. Also in my opinion the vibe in these topics isn’t always very positive to the idea of building a high available setup.

I agree with the RPI is not the ideal platform for openHAB, but don’t tell Markus because according to him RPIs are the holy grail and ZRAM is the solution for everything. Give it a little time and the XY problem will make it’s appearance in this thread as well

mstormi · August 20, 2020, 7:30pm

I never claimed that so why are you digging at me ?

NilsOF · August 21, 2020, 10:48am

redundant failover is cool, but it will get complicated very fast.
And it will be a beast to keep upto date and running/tested.

I would go for remote access and a more or less scripted and manual aproach.
If high avialibility is needed, you need remote access anyway.

Rpi4 is cheap and now with boot from SSD it will be more robust.

mdnx · August 26, 2020, 10:53pm

Hej @Mark111 - I faced the same challenge: my openhab on a rpi stalled from time to time. The effect could be, that only openhab stalled or the whole rpi was unresponsive. The later let me stay away from the above mentioned watchdog solution. Overall I could neither find a clear cause nor an underlying pattern for the crashes. I assume, that a better rpi and specialist knowledge of linux and the internals of my Openhab installation would be good to have to get the system more stable - however at the moment I have neither time nor nerves for any of this )

my failover solution uses a second rpi (OH2) with openhab to monitor the first openhab (OH1). OH2 has only some scripts and openCV stuff running and turned out to be super stable. OH1 has all zwave, logging, influxDb, grafana etc running - and crashes quite regularly. One could as well imagine a setup, where both rpi monitor each other.
As OH2 has a flask server running, which hosts some JSON files, I use this setup to have a OpenHAB rule on OH1 to write every 60sec to this JSON file via a simple http command. OH2 has a Openhab rule running, which checks, whether the value on JSON is updated regularly. If a threshold in time is exceeded OH2 sends a command to a hue plug, which cuts the power for OH1 - causing this to reboot.
This type of reboot might be harsh - however it is do far effective in all my “crash” cases on OH1. I found additionally, that a hue bridge can be addressed from many OpenHab installations, which reduces costs, as my home system is allready having a hue bridge controled by OH1 allready.

In the failover system I have build in some monitor capabilities and can see, that my OH1 is rebooted on a regular basis (unfortunately) - however so far it always rebooted and my “Smart” Home system stayed afloat, even, when I was traveling for several weeks in a row, without me having to intervene from afar via VPN.
Tell if you need more info or scripts for this setup

mstormi · August 27, 2020, 10:33am

These statements are somewhat orthogonal to the original failover discussion and provide a dangerous twist. Why ?
Any single system has to be stable all by itself so if yours isn’t invest time into investigation why it is not.
You must not deploy a failover system to solve instability problems - that’s not solving anything, just hiding problems for some time until these surface again (and they WILL surface)…
There’s lots of things that can go very badly wrong if you keep running instable standalone systems. The failover detection and execution mechanism may fail, configuration sync may go wrong and many more hard to find issues will be waiting for you then.

thaake · August 27, 2020, 3:14pm

I faced the same challenge and wanted to solve it as much as possible with openhab itself.
In short, this is what i did:

i run 2 identical rp4 with identical openhabian, and all the oh conf
interesstingly all my bindungs can work with 2 instances at the same time
- my homematic ccu3 works fine with both oh instances
- my hue works fine with both oh
- my wemo binding works with both oh
- all other bindings as well
  Each oh instance has his own influxdb and each binding, where i control some real thing has a switch item, to tell the oh instance, if it is active or passive.

Both oh instances update each other with their status every other minute and so they can monitor each other. If the passiv oh instance runs into timeout (expire binding) from the status of the other oh instance, it will set itself active and activate all switch items for the controlling bindings.

So, both instances run in parallel and do excatly the same, exept the controlling of i.e rollershuters, actors, my thermostats, my hotwater, etc.

Its running very smoothly for more than a half jear now.
Its a great constellation for testing new things or updates as well. I do that on the passive oh instance and keep the active one away from changes. If all changes are well tested, i’ll activate the instance with the changes and repeat the changes on the other (now passive) oh instance.

Its great for rule development an testing, too. If the changed or new rule runs well, i’ll copy the rule to the passiv instance and both are identical again.

Yes, there are times, when the systems are not identical, but thats short and ok for me.
I someone wanted to be “purer” on this, he has to set up a third oh instance for testing, etc.

Its my kind of failover-system and i never had any outage since…

mstormi · August 27, 2020, 4:01pm

That’s not a failover system, and it isn’t applicable to many technologies and devices.
How does a sensor of yours know which server to send its messages to and when to switch servers ?
How do you control devices in ZWave, ZigBee, BT, EnOcean etc. ?

It’s ok to run like you do but that’s not a comprehensive resilience or even redundance concept.

mstormi · August 27, 2020, 4:11pm

FWIW, there’s a feature upcoming in openHABian that lets you mirror your SD card to another SD in a card reader.
Yes that’ll require some helping hands on the remote site, but combined with a full set of (cold) spare hardware, that’ll allow you to recover fast from almost any type of outage.

Mark111 · September 4, 2020, 9:03am

Thanks all for the responses!
Really appreciate the offer of scripts. Will need to think about this some more before reaching out.

Two identical Pis is an interesting solution. Not failover as such, but perhaps achieves the same results for my situation (there’s absolutely no onsite help available at the remote location!).
Presumably I can try to use the second pi to reboot the first pi - in case of a crash.

Does this two-identical-pi-configuration cause issues if all of the Openhab clients/things are (only) http-based? Presumably more network traffic but perhaps this would be manageable… Am thinking it may just work… Thoughts very welcome!

MDAR · September 4, 2020, 11:58am

Just for information sake

If you had a Velbus infrastructure, you can have as many instances of openHAB (and others) as you wish.
(Either through 1 TCP gateway, or multiple USB interfaces)

marcel_erkel · September 6, 2020, 9:38am

That largely depends on the devices you want to operate. How do they respond when they get the same trigger twice. For example, a z-wave switch doesn’t care if you tell it to switch on twice. On the other hand if you’d like to receive a notification when something occurs then you will now get two notifications (one from each openHAB instance). It becomes troublesome when you have a device that toggles on/off using a single command, then the first openHAB instance may toggle it on and the second instance will toggle it off (or vice versa).

thaake · September 23, 2020, 6:37pm

As a tried to explain in my post above:
Each of the identical pi knows if it is active or passive.
Only the active one executes the switching, notifications, etc.
So, I see no harm.
It’s running like a charm.

If u use bindings that „hiccup“ with more than one oh instance connected, then you may have to take other ways. But all my bindings are without „hiccups“.

Thomas

denominator · September 23, 2020, 8:13pm

How do you test the failure?

At work we use different ways of making it quick to fix a problem. Hundreds of people stand idol while I run around fixing it.

Spending your effort in getting the system stable is more beneficial than writng more unstable code that is hard to test.

One method we use is to run vm’s on enterprise servers. We had a memory dimm die I replaced it 0 downtime.

Another is to have 100% cold backup. 2 identical peices of hardware with a physical changeover switch so you can only power one at a time. I used it once because of corrupted sd card. Flick 1 switch and fix later

One is automated the other is manual wich one do you want to put you time towards?

Writing software is like making love becase sometimes you have to look after it for the rest of its life.

If