Increase resilience of openhab server

My openHAB system has become more and more important for controlling our house in a very convenient way. As a result openhab achieved a very high WAF :slight_smile: .
I have started to look into various ways of making openhab more resilient by a resonable amount of effort and money which can be implemented and maintained by semi-professionals. Yes, there are many solutions out there but in the end, after reading the docs I am reluctant to start such a complex project with an uncertain outcome.

As a result I want to do a step-by-step approach
Basic conditions are as follows:

  • High availability of 99,9whatever% is not required. It is ok if the system requires some time to re-establish the service
  • BUT it has to work in an unattended way
  • for now, when talking about increasing resilience I mean physical resilience (server and storage) excluding network, excluding “logical” resilience (processes crashing, corrupt DB, etc.).

This is the first draft so far:

  1. Storage: Despite the fact that I have been using a Pi 4 booting from an SSD since it was supported by PiOS without one single problem, I want to move the storage to a Synology NAS with mirrored SSDs.
  2. Server: Reconfigure Pi 4 to PXE Boot from NAS
  3. Later: Add another (standby) Pi 4 to boot from NAS which replaces the Pi if not available anymore. Ok. This is where the magic starts. I can think of approaches where the standby-Pi detects the absence of the primary Pi and reboots the primary image so that there is no change of the IP-Address. To avoid conflicts, the standby Pi would physically terminate power supply of the master by a relais, for example.

Let me know what you think, what can be improved, what I need to consider or simply your experience you have made so far.

1 Like

IMO, this is a step backwards in resilience and ease of recovery. Although I don’t use a Pi, I’d think using SSD + an extra “cold” mirror would make it much easier to recover.

You could still use a second Pi4 (with its own independent SSDs) and rsync the config if you want it as a hot spare.

Thanks Jim.
But this does not sound like a solution which recovers in an unattended way where the spare Pi can re-use the master’s IP address, if I understand this correctly.

You should be able to still take over the IP if you desire. Afaik, it doesn’t matter what the IP of your openhab is though, so I don’t see why you’d need to do the IP takeover at all.

Edit: it might matter for your web clients for a wall display?

A harder problem is if you have a ZigBee dongle.

You could even have the Pi use dchp dynamic IP and use mdns e.g. openhab.local

Then in your startup script, reboot your wall displays so they’d pick up the new mdns address.

There are other ways of IP takeover e.g. VRRP but keeping things simple is better/easier when simple is possible.

It is not because of openhab. It is because of my devices which communicate to a specific IP-address

Such as?

Shellys and devices which send http-get requests to the openhab server (and where host names are not working)

Why do you think it is a step backwards if you replace a direct attached SSD with a NAS where the storage is mirrored?

Because now you need to fix and set up two things before you’re up and running. Just not my preference but if you’re happy with that, sure.

I think this is a step backwards in terms of resilience. Backup to the NAS for sure. Restore from the NAS, absolutely. Run from the NAS? Never. That adds complexity and increases the number of failure points.

See point above and I wholly agree with @JimT, this would be a step backwards. You are creating a single point of failure that is even more likely to fail because now you have a network between the Rpi and the NAS. That is the opposite of resilient.

This path has been gone down before on many threads. I’ve yet to see a solution that is not over complex, brittle, and ultimately leads to a reduction in resilience by most measures.

IMO your time and effort is better spent abandoning the fully automated part, decreasing the likelihood that the RPi will go down in the first place (e.g. UPS) and developing a super simple though maybe not automated way to recover in case of a disaster. The simplest approach I seen for that is the SD card mirroring built into openHABian. In a complete failure all someone needs to do is swap out the SD cards and reboot. Very minimal effort to recover and simple enough a child could do it with a couple minutes training.

Or, if you really want automated fail over, you need to look into building a cluster of some sort. It could be a VM cluster or it can be a container based cluster (e.g. kubernetes). Leverage the work done by countless hours of engineering by people smarter than me solve this problem. Don’t try to roll your own solution.

The NAS only addresses data resiliency and that’s why it’s a great place to back up to and restore from. Setting up your system in the described way does for sure increase data resiliency but it reduces resiliency of almost everything else as a result (not to mention the performance hit to file I/O). It’s not a good trade off. For all that complexity all you get is a little bit of data resilience which you could achieve in other much simpler ways.

And the security engineer in me has to ask, what’s the risk here that you are trying to mitigate? Keep in mind that risk = likelihood * impact.
What’s the likelihood that your system is going to fail in a way that the proposed solution would actually mitigate it?

Daft question maybe missing something

Why don’t you use VMs on your nas and do away with the pi ?

If the nas failed you’ve lost system either way from what I can tell

1 Like