My openHAB system has become more and more important for controlling our house in a very convenient way. As a result openhab achieved a very high WAF .
I have started to look into various ways of making openhab more resilient by a resonable amount of effort and money which can be implemented and maintained by semi-professionals. Yes, there are many solutions out there but in the end, after reading the docs I am reluctant to start such a complex project with an uncertain outcome.
As a result I want to do a step-by-step approach
Basic conditions are as follows:
High availability of 99,9whatever% is not required. It is ok if the system requires some time to re-establish the service
BUT it has to work in an unattended way
for now, when talking about increasing resilience I mean physical resilience (server and storage) excluding network, excluding “logical” resilience (processes crashing, corrupt DB, etc.).
This is the first draft so far:
Storage: Despite the fact that I have been using a Pi 4 booting from an SSD since it was supported by PiOS without one single problem, I want to move the storage to a Synology NAS with mirrored SSDs.
Server: Reconfigure Pi 4 to PXE Boot from NAS
Later: Add another (standby) Pi 4 to boot from NAS which replaces the Pi if not available anymore. Ok. This is where the magic starts. I can think of approaches where the standby-Pi detects the absence of the primary Pi and reboots the primary image so that there is no change of the IP-Address. To avoid conflicts, the standby Pi would physically terminate power supply of the master by a relais, for example.
Let me know what you think, what can be improved, what I need to consider or simply your experience you have made so far.
IMO, this is a step backwards in resilience and ease of recovery. Although I don’t use a Pi, I’d think using SSD + an extra “cold” mirror would make it much easier to recover.
You could still use a second Pi4 (with its own independent SSDs) and rsync the config if you want it as a hot spare.
Thanks Jim.
But this does not sound like a solution which recovers in an unattended way where the spare Pi can re-use the master’s IP address, if I understand this correctly.
You should be able to still take over the IP if you desire. Afaik, it doesn’t matter what the IP of your openhab is though, so I don’t see why you’d need to do the IP takeover at all.
Edit: it might matter for your web clients for a wall display?
I think this is a step backwards in terms of resilience. Backup to the NAS for sure. Restore from the NAS, absolutely. Run from the NAS? Never. That adds complexity and increases the number of failure points.
See point above and I wholly agree with @JimT, this would be a step backwards. You are creating a single point of failure that is even more likely to fail because now you have a network between the Rpi and the NAS. That is the opposite of resilient.
This path has been gone down before on many threads. I’ve yet to see a solution that is not over complex, brittle, and ultimately leads to a reduction in resilience by most measures.
IMO your time and effort is better spent abandoning the fully automated part, decreasing the likelihood that the RPi will go down in the first place (e.g. UPS) and developing a super simple though maybe not automated way to recover in case of a disaster. The simplest approach I seen for that is the SD card mirroring built into openHABian. In a complete failure all someone needs to do is swap out the SD cards and reboot. Very minimal effort to recover and simple enough a child could do it with a couple minutes training.
Or, if you really want automated fail over, you need to look into building a cluster of some sort. It could be a VM cluster or it can be a container based cluster (e.g. kubernetes). Leverage the work done by countless hours of engineering by people smarter than me solve this problem. Don’t try to roll your own solution.
The NAS only addresses data resiliency and that’s why it’s a great place to back up to and restore from. Setting up your system in the described way does for sure increase data resiliency but it reduces resiliency of almost everything else as a result (not to mention the performance hit to file I/O). It’s not a good trade off. For all that complexity all you get is a little bit of data resilience which you could achieve in other much simpler ways.
And the security engineer in me has to ask, what’s the risk here that you are trying to mitigate? Keep in mind that risk = likelihood * impact.
What’s the likelihood that your system is going to fail in a way that the proposed solution would actually mitigate it?