How to Restore Second System Remotely

softpretzel · August 1, 2021, 10:18pm

Raspberry Pi 4
Openhab 3
Aeotec z-wave USB

I was travelling recently and during a software update, my system went off-line. I lost ability to control my smart system. I’d like to have a second Pi on my network, with Openhab installed, and ready to take over in the event the first system fails and I’m unable to access it over the network. Any suggestions for how to do this?

Let’s assume the first system won’t even boot. It has an Aeotec z-wave USB stick connected to it.

mstormi · August 1, 2021, 10:35pm

This question comes up every now and then. Search the forum yourself but the bottom line is that there are no reasonably working high availability solutions without major drawbacks.

Check out openHABian Auto-backup. For one, openHABian is tested to work by many so a lot less likely that upgrades get you into such a situation.
And it mirrors the SD “disk”. You will need some remote hands to exchange SD cards but it’ll get you back to the latest snapshot before the upgrade. Even in case HW breaks.
It can even be a non IT-savvy person.

softpretzel · August 1, 2021, 10:53pm

Thank you for the reply. I have searched and was hoping for some fresh ideas since I don’t have a solution yet. If I’m traveling, then I won’t have anyone around to physically replace a SD card.

I’ve been using Openhab for a bit over a year and updates or changes have broken the system many times. I don’t mind rebuilding the system or replacing with a SD backup when I’m home.

I wonder if I could give a second Pi access to the z-wave USB over the network? Could the z-wave USB always be online and connected to a separate system, then whatever Pi is live uses it?

rpwong · August 1, 2021, 11:38pm

You could get a separate zwave hub, but what if that fails? You can’t account for every scenario, but you can drive yourself insane trying. It’s an endless rabbit hole, because no matter how much you plan, something can always fail.

Out of curiosity, why are you running OH updates remotely? It seems to me that the solution would be to not do that, but I don’t want to make assumptions about your reasons.

Personally, I won’t run an update without taking the SD card out and cloning it right beforehand. I’ve been meaning to set up mirroring in openhabian, but haven’t gotten around to it.

softpretzel · August 2, 2021, 12:19am

Very good point about not performing updates while I’m out of town. That was the latest mistake I made. I think it was an update outside of Openhab that broke the system.

Also a good point about trying to account for every situation. I just want to account for this situation, where I have a hardware failure on my Pi and I’m away for months.

Let’s say I want to plan for a failure of a Pi with a z-wave stick and Openhab. Hardware failure. I’m out of town for two months. I have remote access to a power distribution unit to cycle power on the Pi, but it won’t boot.

When I determine the first Pi has failed, I want to start Openhab on the second Pi, which has all the same definitions and rules. It’s ready to take over and only needs control of the same z-wave USB stick.

Would I need a third Pi that the z-wave USB stick is connected to, that the other Openhab instances can use over the network? Is that possible?

rpwong · August 2, 2021, 7:14am

Yep, that’s possible using the remote binding, but that third RPi will be just as susceptible to failure as the first one. So if it ends up as your weak point, you’re right back where you started.

You could use some other Z-Wave controller such as a SmartThings hub, but that’s adding complexity, latency, and dependency on an Internet connection. And it’s still a single controller that could fail for some reason.

Alternatively, don’t use Z-Wave, ZigBee, or anything else that requires a controller. A WiFi switch/plug can be connected to multiple OH servers at the same time. Of course, this is only viable if there are WiFi options for your Z-Wave devices.

I’ve only had an RPi die on me once, and that was before ZRAM was introduced to openhabian (and with an aging SD card). I’ve never had a problem since then. So I guess I’m just wondering if this is worth the effort and added cost, because I think your system will likely be fine if you don’t make any major changes while you’re away. However, it doesn’t matter what I think–you have to be comfortable with the risk.

In any case, the zwave controller is always going to be the weak link since it’s not possible to have a backup ready to go at a moment’s notice.

denominator · August 2, 2021, 7:33am

You could get something external to the pi to switch the power off then on again.

mstormi · August 2, 2021, 8:50am

Which itself becomes the SPOF single point of failure.
There is no magical resolution. Russ has pinpointed it:

You said you only want to account for HW outages when away but that’s not what ever happened.
Remember all the trouble you’ve seen so far you caused yourself by updating from remote.
Real HW outages are rare. In professional operations, they only account for 20% of all outages, the majority of reasons are 40% software and 40% people’s mistakes.

Reinstall based on openHABian so you will have a more stable-proof system in the first place which is less likely to fail during operations (think disk full, SD wearout) and on upgrades (40% SW).
Don’t upgrade from remote. Don’t upgrade if not in true need (40% people).
Get your neighbour a key then instruct him via phone how to exchange SD (or other HW) if really in need (those only 20% HW you’re currently looking after).

softpretzel · August 2, 2021, 12:59pm

I like the idea of showing a neighbor how to swap a SD card. We often have our place rented on Airbnb when we travel so maybe I can talk a renter through a SD card swap too.

I’d still like to have a backup Pi system. Having one fail has low odds, so having two fail is even lower odds. I’ll have a third Pi for a VPN system and other stuff I want to play with, even remotely, so if I break that system via software changes, it’s not the Openhab system.

Pi A will run Openhab.
Pi B will have the same Openhab setup, but the service will be stopped. Pi B will have the z-wave USB controller connected to it.
Pi A will connect to the z-wave controller on Pi B via ethernet. I haven’t done that before, but there is software to share a USB connection over the network. Are you saying Openhab also has a network binding that could connect to the remote z-wave controller?

mstormi · August 2, 2021, 1:16pm

Wrong, that’s a common misunderstanding.
The probability of a system that consists of two more-or-less-active boxes to fail is HIGHER due to increased complexity ! Again: it’s 40% SW and another 40% people.

I’ve been working in telco service and systems design for years. Which is the reason why openHABian was designed the way it is - it focuses on reducing MTTR rather than to increase MTBF.

softpretzel · August 2, 2021, 1:38pm

Mathematically speaking, the odds of total system failure is lower with a backup system. I understand you prefer not to have a backup system or redundancy.

I also have a second fridge in our house for convenience and when the first one broke a few years ago, we moved the food to the second one. Low odds of a fridge breaking, but when it did, it was nice to have a second one. Are you saying having a second fridge decreases the odds of the first fridge working or the total two fridge system working?

mstormi · August 2, 2021, 2:34pm

No any that simple math is inappropriate here as it is missing the point. It is about the probability of the combined system to fail which is not just two systems but adds a large, complex, fragile component, the failover automatism.
Your fridge comparison falls shorts. Those are simple, passive elements. You have to be on site and have to move the food yourself.
You must not compare that to complex active components like home automation servers that in turn control many devices and run a complicated application logic.

rlkoshak · August 2, 2021, 3:09pm

This is the part that makes it complicated. It is my understanding though that you can backup and restore the settings on many Zwave controllers. So you could backup and restore your main Zwave controller to a second one and plug that into a clone of your RPi which is kept powered off. Then to restore power off the main RPi and power on the spare RPi and the new Controller will take over.

But beware. There is a story I once heard that I think DaVinci (maybe Aristotle, I can’t find the source again) observed a stone obelisk supported at each end. However, sometimes one of the supports fails so to make the obelisk more secure he added a new support in the middle. Unfortunately that caused the obelisk to break because one support sunk a little under the weight and half the weight was cantilevered over the middle support. By adding the additional support the system was made worse, not better.

All of this fail over stuff adds a ton of complexity to the system and where there is complexity there is a greater chance for failure. It is often a better idea, or at least an idea worth exploring, to approach the problem in other ways. For example, I advocate letting smart devices remain smart devices and retain the ability to act autonomously. So, for example, even if OH is down, my thermostat and sprinkler system will still work on its own using their current settings.

It depends on how the backup is configured and designed. In your case you are increasing the likelihood of failure over all by adding the third OH that serves up the Zwave controller to the other two. There is the chance for failure that the remotely controlled switchover to switch from the running to the backup would fail.

I agree this is not an apt comparison. An apt comparison is if you also created a robot to move the food from the broken one to the working one in the case of a failure. And as anyone who has ever messed with robots know, that greatly increases the chance of failure of the whole system.

If you had two completely independent in total parallel systems, one that is offline and one that is online than maybe the comparison would be apt. But that isn’t what is being proposed. What is proposed is the creation of a whole system of three RPis, one of which is shared by the other two. And that doesn’t even address the issue of all the devices that the OH system itself controls.

I always recommend in cases like this that one run the risk calculation. Risk = Impact * Likelihood. That means one can decrease the risk by either reducing the impact or reducing the likelihood. So far you are focused on reducing the impact. But what about steps to reduce the likelihood? These are steps such as:

don’t update change anything not physically present
put the RPi on an UPS so a power outage doesn’t corrupt the SD card
simplify the system as much as possible, more parts and more subsystems increases the likelihood of failure
where possible make the end devices autonomous so the hose still functions to some degree even when OH is offline

Finally, when you run that risk calculation (and it’s more a mental exercise, not a true equation with right answers) make sure that what ever you do to mitigate the risk:

actually reduces the risk; for example decreasing the impact but increasing the likelihood can result in the same or an even higher risk
the cost of the mitigation (time, money, what ever units you are using) is less than the risk itself; you don’t want the mitigation to cost more than the risk itself. When calculating the cost of the mitigations, don’t forget to account for opportunity costs (e.g. because I’ve done X I can no longer use Y) and the cost of your own time implementing them.

Personally I see it much like Markus and Russ does. The proposed mitigations do not actually reduce the risk enough to be worth the cost. They lower the impact but raise the likelihood and the cost in complexity is as high or higher than the original risk assuming mitigations like “don’t change anything when not physically present” are applied.

As Markus can tell you, this stuff is hard. You have to take a wholistic view of your entire end-to-end system and often what makes intuitive sense to do (e.g. just throw in a backup system) can make matters worse, not better.

rpwong · August 2, 2021, 5:33pm

As the others have said, system complexity is the enemy of reliability.

If you can get a neighbor or renter to assist, then you remove system complexity in favour of human intervention. That seems like a bad thing for home automation, but it’s actually better for disaster recovery since we can make physical changes to the system. Specifically, a human can move not only the SD card, but also the Z-Wave controller. Now this matches your refrigerator example.

All you need is a second Pi, an interface that the person can access to run a clean shutdown (to avoid corrupting the SD card), and written instructions for how to do the swap.

denominator · August 2, 2021, 10:08pm

Agree its adding a SPOF to help mitigate many more SPOF’s only needed by remote management.

I can’t agree more with this statement.

Having 2 fridges is different as others have said.

What you could do is to have a cold backup. It would need to be configured with static IP address.
Then have a second system including all usb sticks and everything. Then you could get a changeover switch for power that can only power one of the PI’s at a time.

Its very hard to manage changes and if persistence is on the PI you will loose that too.

rlkoshak · August 2, 2021, 10:18pm

Pun intended?

njJim · August 4, 2021, 11:47am

Good question. In my opinion a complete redundant OpenHAB Pi, including a secoond z-wave stick is the most robust and simplest solution. Both need to be accessable via remote SSH login. Devise some scheme to mirror the live pi’s sd card to the backup pi periodically or manually whenever signficant changes to the setup are made. There would have to be an item to indicate which Pi is live. Then if the live pi dies while you are away, simply flip the live pi indicator item to the backup pi and it takes over.

softpretzel · August 4, 2021, 12:18pm

Good discussion all. For those who feel adding a second system as backup is too complex to manage, I respect your opinion. Now for those who would like to discuss the solution that involves a second or even third Pi…

Would a second z-wave stick be able to communicate with the devices that are paired to the first z-wave stick without me being home to re-pair devices? I don’t want to have to un-pair and re-pair devices. That’s why I was thinking of using one z-wave stick, then multiple Pis with multiple OH.

njJim · August 4, 2021, 12:57pm

Good point. I forgot about the whole pairing thing. Might have to call that neighbor and move the stick to the other Pi. Of course, if they are going to do that they might as well just swap out the SD card instead.

rpwong · August 4, 2021, 5:50pm

It’s not that I feel it’s too complex to manage. It’s that the Z-Wave controller makes it almost impossible to automate the recovery process, since it’s not possible to have a backup. Hence:

The benefit of the human is that they can determine what has failed and act accordingly to make physical changes. That makes up for the fact that humans are often less reliable than computers.