This is the part that makes it complicated. It is my understanding though that you can backup and restore the settings on many Zwave controllers. So you could backup and restore your main Zwave controller to a second one and plug that into a clone of your RPi which is kept powered off. Then to restore power off the main RPi and power on the spare RPi and the new Controller will take over.
But beware. There is a story I once heard that I think DaVinci (maybe Aristotle, I can’t find the source again) observed a stone obelisk supported at each end. However, sometimes one of the supports fails so to make the obelisk more secure he added a new support in the middle. Unfortunately that caused the obelisk to break because one support sunk a little under the weight and half the weight was cantilevered over the middle support. By adding the additional support the system was made worse, not better.
All of this fail over stuff adds a ton of complexity to the system and where there is complexity there is a greater chance for failure. It is often a better idea, or at least an idea worth exploring, to approach the problem in other ways. For example, I advocate letting smart devices remain smart devices and retain the ability to act autonomously. So, for example, even if OH is down, my thermostat and sprinkler system will still work on its own using their current settings.
It depends on how the backup is configured and designed. In your case you are increasing the likelihood of failure over all by adding the third OH that serves up the Zwave controller to the other two. There is the chance for failure that the remotely controlled switchover to switch from the running to the backup would fail.
I agree this is not an apt comparison. An apt comparison is if you also created a robot to move the food from the broken one to the working one in the case of a failure. And as anyone who has ever messed with robots know, that greatly increases the chance of failure of the whole system.
If you had two completely independent in total parallel systems, one that is offline and one that is online than maybe the comparison would be apt. But that isn’t what is being proposed. What is proposed is the creation of a whole system of three RPis, one of which is shared by the other two. And that doesn’t even address the issue of all the devices that the OH system itself controls.
I always recommend in cases like this that one run the risk calculation. Risk = Impact * Likelihood. That means one can decrease the risk by either reducing the impact or reducing the likelihood. So far you are focused on reducing the impact. But what about steps to reduce the likelihood? These are steps such as:
- don’t update change anything not physically present
- put the RPi on an UPS so a power outage doesn’t corrupt the SD card
- simplify the system as much as possible, more parts and more subsystems increases the likelihood of failure
- where possible make the end devices autonomous so the hose still functions to some degree even when OH is offline
Finally, when you run that risk calculation (and it’s more a mental exercise, not a true equation with right answers) make sure that what ever you do to mitigate the risk:
- actually reduces the risk; for example decreasing the impact but increasing the likelihood can result in the same or an even higher risk
- the cost of the mitigation (time, money, what ever units you are using) is less than the risk itself; you don’t want the mitigation to cost more than the risk itself. When calculating the cost of the mitigations, don’t forget to account for opportunity costs (e.g. because I’ve done X I can no longer use Y) and the cost of your own time implementing them.
Personally I see it much like Markus and Russ does. The proposed mitigations do not actually reduce the risk enough to be worth the cost. They lower the impact but raise the likelihood and the cost in complexity is as high or higher than the original risk assuming mitigations like “don’t change anything when not physically present” are applied.
As Markus can tell you, this stuff is hard. You have to take a wholistic view of your entire end-to-end system and often what makes intuitive sense to do (e.g. just throw in a backup system) can make matters worse, not better.