Best practice with primary and secondary zwave controllers

I used an Aeon zstick with Openhab1 for several months with approx. 30 zwave sensors and outputs. It worked very reliably until the zstick seemed to get clogged up and a few minutes after a reboot the outbound queue built up. Although it seemed unlikely that it was a hardware fault I bought a second zstick and used the backup and restore tool to transfer the configuration from the original zstick to the new one. This didn’t work and the new zstick failed in the same way implying it is something about the state information that has been built up that is causing the problem. I then configured the new zstick as a secondary controller and transferred the configuration from the master. Openhab now uses the secondary controller which is sort of works ok but many of the sensors need to be reconfigured to send reports to the secondary controller rather than the primary.

I could rebuild the whole network with a new master but this would be painful and I still need to ensure it doesn’t fail a few months later.

Is the best plan to start out with two controllers? A master that doesn’t do anything (other than keep the network configuration and hence I hope wouldn’t get clogged up over time. A secondary that can be configured to do all the work and in the even of failure could easily be replaced.

Hmm. With respect to reliability in case of HW failures, the best plan is to have a backup of your (single) working stick and restore it to new hardware in case your old stick fails. You can deploy a secondary controller, but (as you found out yourself) it won’t be able to accomplish everything your failed primary used to handle (unless you manually change a number of things, such as to move the device associations). But it’s probably faster to restore in such a case. Honestly, I never fully understood what a secondary controller is good for w.r.t. reliability.
Now that obviously won’t help you with your problem. To find out the root cause, you could try browsing the zwave debug logs and system logs for any irregularities, but I know that’s far from being a cookbook or guarantee that you’ll find anything.
The only reliable route to a solution I see is, though probably painful, to find out when and what messed up your stick’s config, and then to revert to a state to not have this problem. You could try removing (excluding) the devices you added most recently and go back in time as needed.
I recently had the case where to include a specific new device reproducibly killed part of my OH functionality, and excluding it recovered the system (btw, it was device #125, so I have been longtime and still am running a lot of zwave, but this was the first time I encountered this type of problem).
I don’t know the Aeon tools if they provide a capability to modify stick-stored data such as to ‘zap’ nodes without having to exclude them… I’m using RaZberry where this can be done using the z-way software.

If that all does not result in success, your best option probably is to start from scratch, i.e. wipe the stick and start including all your devices again. This time, make stick backups at several stages (after a couple of devices added).
This just reminded me I should do that, too, as I am overdue to backup my current config.

Thanks for the input. With hindsight I should obviously have taken a backup earlier but if the fault is due to the zstick accumulating garbage over time it is still a painful approach. I would need to take a backup when the network was first built and then, everytime i wnted to include a new device, I would need to restore back to the previous backup, to ensure the zstick was “FRESH”, include the new device and then take a new backup for the next time I wanted to add something.

I have looked at the zwave logs - I’m no expert but all I can see is that the transmit queue seems to start getting longer as it slowly grinds to a halt

I could try your approach of excluding devices in reverse order but the fault didn’t occur shortly after adding something new.

I could try Aeon but I don’t have much hope - the installation of the backup tool hasn’t even been translated into English!