When you name your home automation project after gods, you'd better deliver

Why I spent way too much time picking names — and why it was worth it.

Every serious project deserves a name. Mine got four.

Penates is what the Romans called the gods of the household; the invisible forces that kept the home safe, stocked, and running. Not the grand gods of war or thunder. The quiet ones. The ones nobody noticed until they were gone. That felt exactly right for a home automation system that should do its job without anyone thinking about it.

The design principle behind Penates is simple: if someone in the house notices the system, it has already failed. It should just work. Silently, completely, and without asking for attention.

So Penates became the umbrella. Three gods beneath it.

Vesta: goddess of the eternal flame, the sacred hearth at the heart of every Roman home. In my setup, Vesta is OpenHAB running on a Raspberry Pi 5, the actual brain of the house. Lights, climate, energy monitoring, presence detection, appliance tracking. The fire that keeps everything warm and alive.

What that looks like in practice: when the dryer finishes, Vesta doesn’t use a timer. It watches the actual watt curve and knows the difference between “heating element cycling down” and “cycle complete”. The notification that lands includes the exact kWh consumed, peak wattage, and how that compares to the average across all previous cycles. Not “dryer done.” Data.

Another example: when a device bridge goes offline, a Zigbee gateway, a smart plug hub, Vesta doesn’t wait for someone to notice. It detects the drop, disables the bridge, waits four seconds, and brings it back up. If that doesn’t work, it tries once more. If that doesn’t work either, it sends a notification. Most of the time, nobody ever knows there was a problem. That’s the point.

Hecate: Greek goddess of crossroads and thresholds. She watched who crossed the boundary. In my setup, Hecate watches the network: Pi-hole, ntopng, NetAlertX, arpwatch. Every device that joins gets noticed. Every DNS query gets logged.

What that looks like in practice: a new MAC address appeared on the network at 2am last month. Hecate had an alert on my phone before I’d rolled over. It was a guest device that had reconnected after a firmware update — but Hecate didn’t know that. She just knew it was different.

That distinction: “I don’t know what this is, but I know it’s different” is the whole point.

Heimdall: the Norse watchman who never sleeps, can see for a hundred miles, and hears grass grow. His job is to sound the alarm when something crosses the bridge. In my setup, Heimdall is the active watchdog, monitoring 22+ system parameters, running a honeypot on silent canary ports, and dispatching alerts via a self-hosted notification server.

What that looks like in practice: anything that probes one of those canary ports gets flagged, timestamped, and reported. Not blocked first but logged first. Because knowing that someone knocked is more interesting than just slamming the door.

The deeper layer: Heimdall isn’t just software. If the Pi locks up completely: kernel panic, deadlock, whatever, a hardware timer kicks in after 14 seconds and forces a cold reboot. No remote session needed. No one has to notice. The house just comes back up. You never hope to need this. But it’s there.

All of it runs on a single Pi. No cloud subscriptions. No monthly fees. No data leaving the house, except for one deliberate exception I’ll leave for another post.

Running: OpenHAB 4.3 · InfluxDB · Grafana · Pi-hole · Loki/Alloy · custom Python watchdog

It really starts to get interesting when you branch out to other services (document management, media management, document sharing, automated backups (including synchronized DB dumps), etc.

Are you using arpwatch because NetAlertX’s alerting doesn’t work with the services you use?

I too use mythic names, but mostly for machine hostnames and always monstors/creatures with a couple exceptions. The RPi I use control the garage door openers and otherwise monitor their open/closed status is called “cerberos” for example. It’s always fun naming things.

Ha, Cerberus for the garage door is perfect. He literally guards the entrance
to the underworld. I might have to steal that logic for my own garage :wink:

On the other services: I’ve deliberately kept the Pi focused. One machine, one
purpose, home automation and network security. The moment I start adding
Jellyfin or Nextcloud I know I’ll be debugging resource contention at 2am
instead of sleeping.

On arpwatch vs NetAlertX: they cover different layers. NetAlertX handles device
inventory and new MAC alerts, the “who’s on my network” question. arpwatch sits
at the ARP level and specifically catches flip-flops: a device suddenly
advertising a different MAC for the same IP. That’s the ARP spoofing signal.
NetAlertX wouldn’t catch that the same way.

The third layer is ntopng, which is less about the family and more about the
things. I didn’t buy these IoT devices to trust them. I bought them to control
them. A Zigbee hub, a solar inverter, a smart energy meter, smart plugs, they
all run closed-source firmware and phone home to clouds I didn’t choose. ntopng
tells me what each of them is actually doing: which domains they contact, how
often, and whether that pattern changes. A solar inverter that checks in with
servers in China every few minutes is fine, until the interval suddenly doubles.
Then you want to know why.

On architecture: the subproject structure helps a lot. Vesta, Hecate and Heimdall
each have a clear scope and never touch each other’s config. A system like this
gets too complex to hold entirely in your head. At some point you need something
that remembers the decisions you made six months ago and why. The entire system
state lives in a set of structured markdown files: architecture decisions,
runbooks, integration states, security posture, everything. I keep those up to
date with Claude as a technical partner. That combination has honestly changed
how I work on this. But that’s probably worth its own post.

I would actually recommend moving the network stuff to it’s own machine at some point. But we all work with what hardware we have on hand. But I find having the watchers be separate from the movers provides more flexibility long term.

OK, so you are not using arpwatch for discovery. I was confused because one of the default scanners in NetAlertX does work at the arp level so it wasn’t clear what arpwatch provides over what NetAlertX does.

I tend to block these devices from reaching the internet in the first place.

I used to use ntopng and it does provide a lot of information but I personally found nothing really actionable there that I wasn’t getting from other sources. I also ran it on my opnSense machine and it was taking up an unacceptable amount of resources. I’ve since upgraded that machine so I may go back to it at some point but other stuff is higher priority.

That’s the main reason I use Ansible. Nothing gets manually configured (except on my laptop where it’s a free-for-all). Every package that gets installed, every container that gets deployed, and every configruation change that is made is done through Ansible. I don’t have to remember anything that way. It’s all documented there in code (and comments). And if I want to rebuild I can without missing anything.

One of the things that is high on my list though is to have Claude/Gemini/et al review my playbooks and offer improvements. But first I need to clear out some technical debt built up over the past 12 years of a gradually growing home lab. I mainly wanted to move my openHAB off of a VM and into an LXC on my proxmox server, but to do that I needed to get and move all my zwave/zigbee devices off of the machine and make them accessible over the network to eliminate the hardware passthrough. Then I needed to improve my host migration strategy (when a service is first deployed to a host it should restore from backup on that new host). And that meant improving my backup strategy which is where I am now.

For that I’m using restic and a local NAS for local backup and Backblaze for remote backup (1-2-3 strategy).

Each service I deploy that I want backed up “depends” on a standardized generic_backup role. This role:

  1. install restic if not already installed/updates it if it is already installed (the same Ansible playbooks I run to set everything up I also run to upgrade because the playbooks are idempotent)
  2. deploy a password file for the service (each service’s backup repo could have it’s own encryption password)
  3. install the postgresql client if the service is using postgresql
  4. configure log rotation for the backup logs (the backup script generates logs)
  5. installs the backup script and systemd .service and .timer files to run the backup script on the desired schedule
  6. initializes the restic repos if they were just created
  7. deploy and rebuild the Backrest configuration for monitoring and interacting with the backups through a web based UI
  8. (todo, use the healthchecks API to add this backup to the list of watchdogs so I get an alert if the backup fails to run for some reason)

I’d never remember all of that and doing it manually for each and every service I run would be a major job. But it’s all here in code and I can deploy and configure backups for 20+ services all in an identical and repeatable way. If I decide to change something (e.g. move to Amazon AWS instead of Backblaze) I can change it in one place and it gets deployed for all the services in one run of the playbook.

There are a few things I can’t do (or haven’t bothered to do) through Ansible yet though and those are woefully undocumented at the moment (e.g. my HAProxy config which enables https with a trusted cert to my services locally to my lan). Eventually I’m hoping to use some LLM agents to help build that documentation andor add it to the playbooks. But first I wanted to upgrade my backups from a bunch of cron driven rsyncs to a local NAS to something more professional and supportable long term.

Other apt names:

  • argus: camera NVR/DVR (many eyed guardian of Hera’s sacred cattle)
  • fafnir: NAS (in Norse myth, a dragon that guards a famous treasure)
  • huginn or muninn: Ollama or other LLM host (Memory and Thought, Odin’s ravens)
  • sphinx: captive portal (“Thesius, you must answer this riddle to pass”)
  • hydra: Type one hypervisor like Proxmox (many headed → many VMs)
  • Anansi/Loki/Coyote/Raven/Hermes/…: honey pot (trickster gods)
  • Clio: documents server (muse of history)
  • Euterpe: media server (muse of music)
  • Mnemosyne: notes server like Joplin (goddess of memory)

The separate hardware point is well taken and it’s on my list. The challenge
in my case is one specific constraint: Qbus. It’s a Belgian home automation
brand that controls all the lighting in the house, 38 circuits. The controller
lives on the ISP modem subnet (192.168.1.x) while everything else runs on a
TP-Link Deco mesh (192.168.68.x). The gateway bridges those two worlds. A
previous iptables change taught me the hard way what happens when you get that
wrong. The lights went out. That’s the moment you stop being a hobbyist and
start being the person your family looks at.

I ran OpenHAB 3 with Qbus for about ten years, mostly on autopilot. It worked
but it was messy and upgrading was a genuine pain. I only started taking this
seriously in February. The lighting was always the foundation though. Everything
else, the monitoring, the energy tracking, the watchdogs, that’s all built on
top of something that just has to work. VLAN segmentation would be the clean
architectural move but it’s not worth the risk until I fully understand the
connection path. That’s the todo item I’m most careful about.

On Ansible vs markdown files: I think we’re solving the same problem differently.
You want reproducibility, I want legibility. My system evolved fast enough that
I needed something I could reason about quickly, not just replay. Whether that
scales the way Ansible does is a fair question and probably one for the future.

One thing I’ve been careful about is keeping as much offline as possible. Remote
access runs entirely through NordVPN Meshnet. No open ports, no reverse proxy
exposed to the internet, no dynamic DNS. I had the subscription for years before
I understood that was actually the right way to run this. The Pi is just not
reachable unless you’re on the mesh. Where I’ve deliberately crossed the offline
line is the watchdog layer. When something goes wrong, the alert goes through
the Claude API for analysis before it hits my phone. The notification that
arrives already includes context about what’s likely causing the issue and a set
of action buttons tailored to that specific problem. One tap, problem handled,
no SSH session needed.

That API surface is something I track carefully though. I keep a dedicated
markdown file that logs every field sent to any external API, with a risk
assessment for each one. It starts small and innocent, a CPU temperature here,
a swap value there. But these things grow and at some point you no longer know
what picture you’re painting about your household. Keeping that register strict
is one of the less glamorous parts of the project but probably one of the more
important ones.

You’ve pointed me towards a few paths I hadn’t considered yet. Restic and
Backblaze in particular are going on the research list.

Those name suggestions are fantastic by the way. Huginn and Muninn for an LLM
host is almost too perfect. Memory and Thought. Odin’s ravens that fly out
every day and come back with what they’ve seen. I thought I was the only one
who went down this particular rabbit hole.

Your HAProxy point stuck with me. That’s exactly the problem I was trying to
solve too, the stuff that works so well you never document it until the day you
have to rebuild it from memory.

What changed things for me was flipping the approach. Instead of treating
documentation as something you do after the work, I made it the foundation
before the work. Everything lives in a set of structured markdown files,
architecture decisions, runbooks, integration states, security posture. Before
anything gets built or changed, those get read first. It sounds like overhead
but it’s the opposite. It means I never lose context between sessions and I
never have to hold the whole system in my head.

I should be honest: I’m not a strong programmer. I understand architecture and
I understand what I want a system to do, but the actual implementation has gaps.
What changed is that Claude fills those gaps, not by guessing, but because it
has the full documented context of every decision I’ve made and why. That
combination, knowing what you want the system to be plus a tool that can help
you build it properly, turns out to be surprisingly effective.

On the healthchecks todo you mentioned: I went one step further with ntfy.
When the watchdog fires it doesn’t just alert, it sends a notification with
action buttons specific to the problem detected. Backup failed? Button to run
it immediately. Service down? Button to restart it. All routed through Meshnet,
no open ports, one tap from anywhere. It’s the kind of thing that feels like
a small detail until it’s three steps ahead of where you need to be.

The privacy register I mentioned is worth thinking about early if you go down
the API route. I track every field sent to any external API with a risk
assessment before it goes in. Hardware metrics feel harmless. But you add
process names, then service states, then behavioral patterns and somewhere
along the way you’ve built a detailed portrait of your household without
noticing. Easier to think about upfront than untangle later.

None of this requires being a great programmer. It requires knowing what
you’re building and why.

I would recommend looking very hard to isolate that Qbus part so that you would never be editing anything that will affect it unless you are actually wanting to. Editing iptables for something else should not have impacted that.

Easier said then done, I know. But something like qbusMQTT could help provide that separation. All the Qbus stuff can be isolated off on it’s own, never touched. The only complicated thing is making sure that whatever is hosting qbusMQTT can see both the Qbus network and the MQTT broker. After that, the only way you’ll break Qbus accidentally is if you break one of those connections.

Just an idea. I’ve been the person everyone is staring at more than once. For many years no new device could join my LAN without being given a static lease in DHCP. If they did, they could not get to the internet (I think DNS was ultimately what didn’t work for them). It’s kind of embarrasing to have to say “oh, you want on the wifi? I need to disable random mac addresses on your phone and then add you to the list of static addresses.”

Now I only have to do that if they want to be exempted from the parental controls filtering. :smiling_face_with_horns:

Way back in the time before Google when you actually had to register your website with the search engines I ran one of the largest mythology related link aggregation sites on the Internet. Google pretty much killed that whole concept and the forum I hosted was used too little to keep it up. So I let it go and it lives on only in the vaults of the wayback machine now.

For the interested, here’s my role so far.

Note: an Ansible role is separated into a folder structure.

Prerequisites: an account on Backblaze and create a key to provide access. Some local file storage somewhere accessible over sftp and all the ssh certificates deployed to allow what ever user is running the backup (root in my case) can sftp to the destination.

The rule of thumb is to choose a Backblaze region far from your home so region wide disaster scenarios do not wipe out both your home and your backups.

generic_backup/tasks/main.yml

---
# roles/generic-backup/tasks/main.yml

- name: Check if Restic is installed
  ansible.builtin.command: which restic
  register: restic_check
  failed_when: false
  changed_when: false

- name: Install Restic from official binary (if missing)
  when: restic_check.rc != 0
  block:
    - name: Get latest release version
      ansible.builtin.uri:
        url: https://api.github.com/repos/restic/restic/releases/latest
        method: GET
        return_content: true
      register: restic_latest_release

    - name: Set Restic version fact
      ansible.builtin.set_fact:
        restic_version: "{{ restic_latest_release.json.tag_name | replace('v', '') }}"

    # yamllint disable rule:line-length
    - name: Set Restic download URL
      ansible.builtin.set_fact:
        restic_url: https://github.com/restic/restic/releases/download/v{{ restic_version }}/restic_{{ restic_version }}_linux_{{ 'arm64' if ansible_architecture == 'aarch64' else 'amd64' }}.bz2
        checksum_url: "https://github.com/restic/restic/releases/download/v{{ restic_version }}/SHA256SUMS"
    # yamllint enable rule:colons

    - name: Download Restic binary and checksum file
      ansible.builtin.get_url:
        url: "{{ item }}"
        dest: "/tmp/{{ item | basename }}"
        mode: '0644'
      loop:
        - "{{ restic_url }}"
        - "{{ checksum_url }}"

    - name: Get expected checksum from file
      ansible.builtin.command: >-
        awk '/restic_{{ restic_version }}_linux_{{ "arm64" if ansible_architecture == "aarch64" else "amd64" }}.bz2/ {print $1}' /tmp/SHA256SUMS
      register: expected_checksum
      changed_when: false

    - name: Calculate local binary checksum
      ansible.builtin.stat:
        path: "/tmp/restic_{{ restic_version }}_linux_{{ 'arm64' if ansible_architecture == 'aarch64' else 'amd64' }}.bz2"
        checksum_algorithm: sha256
      register: local_binary

    - name: Verify checksum matches
      ansible.builtin.fail:
        msg: "Checksum mismatch! Expected {{ expected_checksum.stdout }}, got {{ local_binary.stat.checksum }}"
      when: local_binary.stat.checksum != expected_checksum.stdout

    - name: Extract Restic binary
      ansible.builtin.command: "bzip2 -d -f /tmp/restic_{{ restic_version }}_linux_{{ 'arm64' if ansible_architecture == 'aarch64' else 'amd64' }}.bz2"
      changed_when: false

    - name: Move Restic to /usr/local/bin
      ansible.builtin.copy:
        src: "/tmp/restic_{{ restic_version }}_linux_{{ 'arm64' if ansible_architecture == 'aarch64' else 'amd64' }}"
        dest: /usr/local/bin/restic
        mode: '0755'
        remote_src: true
      become: true

- name: Update Restic if already installed
  ansible.builtin.command: restic self-update
  register: restic_update
  changed_when: "'is up to date' not in restic_update.stdout"
  when: restic_check.rc == 0
  become: true

- name: Create secure Restic configuration directory
  ansible.builtin.file:
    path: /etc/restic
    state: directory
    mode: '0700'
  become: true

- name: Deploy password file for {{ service_name }}
  ansible.builtin.copy:
    content: "{{ restic_repo_password }}"
    dest: "/etc/restic/{{ service_name }}.pass"
    mode: '0600'
  become: true

- name: Set Restic repo location with port for {{ service_name }}
  ansible.builtin.set_fact:
    restic_repo: "sftp://{{ restic_user }}@{{ restic_server }}:{{ restic_port | default('22') }}/{{ restic_vault_path }}/{{ ansible_hostname }}/{{ service_name }}"

- name: Set remote Restic repo location for {{ service_name }}
  ansible.builtin.set_fact:
    remote_restic_repo: "s3:{{ restic_backblaze_bucket_endpoint }}/{{ restic_backblaze_bucket }}/{{ ansible_hostname }}/{{ service_name }}"

- name: Install PostgreSQL client for {{ service_name }}
  become: true
  when: (db_name | default('')) | length > 0
  block:
    - name: Add PostgreSQL GPG key
      ansible.builtin.get_url:
        url: https://www.postgresql.org/media/keys/ACCC4CF8.asc
        dest: /usr/share/keyrings/postgresql-archive-keyring.asc
        mode: '0644'
      become: true

    - name: Add PostgreSQL repository
      ansible.builtin.apt_repository:
        # Note the 'signed-by' pointing to the exact file we just downloaded
        repo: "deb [arch=amd64 signed-by=/usr/share/keyrings/postgresql-archive-keyring.asc] http://apt.postgresql.org/pub/repos/apt {{ ansible_distribution_release }}-pgdg main"
        state: present
        filename: pgdg
        update_cache: true
      become: true

    - name: Ensure PostgreSQL client is installed if DB dump is required for {{ service_name }}
      ansible.builtin.package:
        name: postgresql-client
        state: present
      become: true

- name: Configure log rotation for restic backups for {{ service_name }} # noqa: risky-file-permissions
  ansible.builtin.copy:
    dest: "/etc/logrotate.d/restic-{{ service_name }}"
    content: |
      {{ restic_log_dir }}/{{ service_name }}.log {
          rotate 7
          daily
          compress
          missingok
          create 0644 root root
      }
  become: true

- name: Install execution script for {{ service_name }}
  ansible.builtin.template:
    src: restic-backup.sh.j2
    dest: "/usr/local/bin/restic-backup-{{ service_name }}.sh"
    mode: '0700'
  become: true

- name: Deploy systemd service unit for {{ service_name }}
  ansible.builtin.template:
    src: restic-backup.service.j2
    dest: "/etc/systemd/system/restic-backup-{{ service_name }}.service"
    mode: '0644'
  become: true
  register: service_unit

- name: Deploy systemd timer unit for {{ service_name }}
  ansible.builtin.template:
    src: restic-backup.timer.j2
    dest: "/etc/systemd/system/restic-backup-{{ service_name }}.timer"
    mode: '0644'
  become: true
  register: timer_unit

- name: Reload systemd daemon for {{ service_name }}
  ansible.builtin.systemd:
    daemon_reload: true
  when: service_unit.changed or timer_unit.changed
  become: true

- name: Enable and start systemd timer for {{ service_name }}
  ansible.builtin.systemd:
    name: "restic-backup-{{ service_name }}.timer"
    state: started
    enabled: true
  become: true

- name: Check if local Restic repository exists for {{ service_name }}
  ansible.builtin.command: >-
    restic list keys -r {{ restic_repo }}
  register: restic_check_repo
  failed_when: false
  changed_when: false
  environment:
    RESTIC_PASSWORD_FILE: "/etc/restic/{{ service_name }}.pass"
  become: true

- name: Initialize local Restic repository if it does not exist for {{ service_name }}
  ansible.builtin.command: >-
    restic init -r {{ restic_repo }}
  when: restic_check_repo.rc != 0
  changed_when: true
  environment:
    RESTIC_PASSWORD_FILE: "/etc/restic/{{ service_name }}.pass"
  become: true

- name: Check if remote Restic repository exists for {{ service_name }}
  ansible.builtin.command: >-
    restic list keys -r {{ remote_restic_repo }}
  register: restic_check_b2_repo
  failed_when: false
  changed_when: false
  environment:
    RESTIC_PASSWORD_FILE: "/etc/restic/{{ service_name }}.pass"
    AWS_ACCESS_KEY_ID: "{{ backblaze_restic_key_id }}"
    AWS_SECRET_ACCESS_KEY: "{{ backblaze_restic_api_key }}"
  become: true

- name: Initialize remote Restic repository if it does not exist for {{ service_name }}
  ansible.builtin.command: >-
    restic init -r {{ remote_restic_repo }}
  when: restic_check_b2_repo.rc != 0
  changed_when: true
  environment:
    RESTIC_PASSWORD_FILE: "/etc/restic/{{ service_name }}.pass"
    AWS_ACCESS_KEY_ID: "{{ backblaze_restic_key_id }}"
    AWS_SECRET_ACCESS_KEY: "{{ backblaze_restic_api_key }}"
  become: true

- name: Create local repo configuration fragment on backrest host for {{ service_name }}
  delegate_to: "{{ backrest_host }}"
  ansible.builtin.template:
    src: repo_fragment.json.j2
    dest: "{{ backrest_home }}/fragments/local-{{ service_name }}.json"
    mode: '0600'
    owner: "{{ backrest_uid }}"
    group: "{{ backrest_uid }}"
  vars:
    # Pass all variables required by the template here
    repo_type: "sftp"
    repo_id: "local-{{ service_name }}"
    repo_uri: "sftp:{{ restic_user }}@{{ restic_server }}:/{{ restic_vault_path }}/{{ inventory_hostname }}/{{ service_name }}"
  become: true
  register: local_fragment

- name: Create remote repo configuration fragment on backrest host for {{ service_name }}
  delegate_to: "{{ backrest_host }}"
  ansible.builtin.template:
    src: repo_fragment.json.j2
    dest: "{{ backrest_home }}/fragments/remote-{{ service_name }}.json"
    mode: '0600'
    owner: "{{ backrest_uid }}"
    group: "{{ backrest_uid }}"
  vars:
    # Pass all variables required by the template here
    repo_type: "s3"
    repo_id: "remote-{{ service_name }}"
    repo_uri: "{{ remote_restic_repo }}"
  become: true
  register: remote_fragment

- name: Rebuild the Backrest config and restart
  when: local_fragment is changed or remote_fragment is changed
  block:
    - name: Rebuild Backrest Config for {{ service_name }}
      become: true
      delegate_to: "{{ backrest_host }}"
      ansible.builtin.shell: |
        set -o pipefail
        # Assemble and overwrite the config
        jq --slurpfile new_repos <(jq -s '{repos: .}' {{ backrest_home }}/fragments/*.json) \
          '.repos = $new_repos[0].repos' {{ backrest_home }}/config/config.json > {{ backrest_home }}/fragments/config.json.tmp && \
        mv {{ backrest_home }}/fragments/config.json.tmp {{ backrest_home }}/config/config.json
      args:
        executable: /bin/bash
      changed_when: true

    - name: Restart Backrest container for {{ service_name }}
      become: true
      delegate_to: "{{ backrest_host }}"
      community.docker.docker_container:
        name: backrest
        state: started
        restart: true

Stuff in {{ }} are variabled defined elsewhere.

generic_backup/templates/repo_fragment.json.j2

Note: templates are Jinja.

{
  "id": "{{ repo_id }}",
  "uri": "{{ repo_uri }}",
  "guid": "{{ ( inventory_hostname + service_name + repo_uri) | hash('sha256') }}",
  "password": "{{ restic_repo_password }}",
  {% if repo_type == 'sftp' %}
  "flags": ["--option=sftp.args='-oBatchMode=yes -p {{ restic_port | default('22') }}'"],
  {% elif repo_type == 's3' %}
  "env": [
    "AWS_ACCESS_KEY_ID={{ backblaze_restic_key_id }}",
    "AWS_SECRET_ACCESS_KEY={{ backblaze_restic_api_key }}"
  ],
  {% endif %}
  "prunePolicy": {
    "schedule": { "disabled": true, "clock": "CLOCK_LAST_RUN_TIME" },
    "maxUnusedPercent": 10
  },
  "checkPolicy": {
    "schedule": { "cron": "0 0 1 * *", "clock": "CLOCK_LAST_RUN_TIME" }
  },
  "commandPrefix": {},
  "forgetPolicy": {
    "schedule": { "disabled": true, "clock": "CLOCK_LAST_RUN_TIME" },
    "retention": {
      "policyTimeBucketed": { "hourly": 24, "daily": 30, "monthly": 12 }
    }
  }
}

generic_backup/templates/restic_backup.service.j2

[Unit]
Description=Restic backup for {{ service_name }}
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/restic-backup-{{ service_name }}.sh
# Low priority to avoid impacting service performance
CPUSchedulingPolicy=batch
IOSchedulingClass=best-effort
IOSchedulingPriority=7
User=root
Environment=LANG=en_US.UTF-8
Environment=LC_ALL=en_US.UTF-8

generic_backup/templates/restic_backup.sh.j2

#!/bin/bash

set -o pipefail
export LANG=en_US.UTF-8
export LANGUAGE=en_US:en
export LC_ALL=en_US.UTF-8

# Define constants
SERVICE_NAME="{{ service_name }}"
CACHE_DIR="/var/cache/restic/${SERVICE_NAME}"
SENTINEL_FILE="${CACHE_DIR}/last_run_failed"
LOG_FILE="{{ restic_log_dir }}/{{ service_name }}.log"

export RESTIC_PASSWORD_FILE="/etc/restic/${SERVICE_NAME}.pass"
export RESTIC_CACHE_DIR="${CACHE_DIR}"

# --- Create needed directories and files
mkdir -p "${CACHE_DIR}"
mkdir -p "{{ restic_log_dir }}"
[ ! -f "$LOG_FILE" ] && touch "$LOG_FILE" && chmod 0644 "$LOG_FILE"

# --- Helper Functions ---

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}

send_alert() {
    local REASON="$1"
    log "ALERT: $REASON"
    if [ ! -f "${SENTINEL_FILE}" ]; then
        echo -e "Subject: Restic Backup Failed: ${SERVICE_NAME}\n\nBackup failed at $(date).\nReason: ${REASON}" | msmtp {{ email_login }}
        touch "${SENTINEL_FILE}"
    fi
}

run_restic() {
    # $1 = command (e.g., backup)
    # $2 = repository (e.g., sftp:user@host:/path or s3:...)
    # ${@:3} = remaining arguments (e.g., source_dir)
    
    local CMD="$1"
    local REPO="$2"
    shift 2

    local CMD_OUT
    log "Running restic $CMD on repo ${REPO}..."
    
    CMD_OUT=$(/usr/local/bin/restic -r "$REPO" "$CMD" "$@" 2>&1)
    local EXIT_CODE=$?
    
    # Exit codes 0 and 3 are acceptable for "successful enough" backups
    if [ $EXIT_CODE -eq 0 ] || [ $EXIT_CODE -eq 3 ]; then
        log "Restic $CMD completed successfully (Exit $EXIT_CODE)."
    else
        send_alert "Restic $CMD failed with critical error (Exit $EXIT_CODE): ${CMD_OUT}"
        exit $EXIT_CODE
    fi
    
    log "Restic $CMD successful."
}

cleanup() {
    {% if db_name is defined and db_name | length > 0 %}
    # Clean up DB dump
    [ -f "$DUMP_FILE" ] && rm -f "$DUMP_FILE"
    {% endif %}
    
    # Clean up credentials
    unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
}

trap cleanup EXIT

# --- Database Logic ---
{% if db_name is defined and db_name | length > 0 %}
DUMP_FILE="{{ source_dir }}/{{ service_name }}_db_dump.sql"
DB_HOST="{{ db_host | default(postgresql_host | default('localhost')) }}"
DB_PORT="{{ db_port | default(postgresql_port | default('5432')) }}"
DB_USER="{{ db_user | default(postgresql_user) }}"
DB_PASS="{{ db_password | default(postgresql_password) }}"

log "Starting database dump for {{ db_name }} to $DUMP_FILE..."
export PGPASSWORD="${DB_PASS}"
{% if db_name == 'all' %}
ERROR_MSG=$(pg_dumpall -h "${DB_HOST}" -p "${DB_PORT}" -U "${DB_USER}" > "$DUMP_FILE" 2>> "$LOG_FILE")
{% else %}
ERROR_MSG=$(pg_dump -Fc -h "${DB_HOST}" -p "${DB_PORT}" -U "${DB_USER}" {{ db_name }} > "$DUMP_FILE" 2>> "$LOG_FILE")
{% endif %}
if [ $? -ne 0 ]; then
    send_alert "Database dump failed: $ERROR_MSG"
    exit 1
fi
log "Database dump created successfully."

log "Validating dump integrity..."
if [ ! -s "$DUMP_FILE" ]; then
    send_alert "Dump file is empty."
    exit 1
fi

{% if db_name == 'all' %}
# Verify the SQL file is complete
if ! tail -n 5 "$DUMP_FILE" | grep -qEi "PostgreSQL.*database.*cluster.*dump.*complete"; then
    log "Debug: Last 5 lines of dump:"
    tail -n 5 "$DUMP_FILE" >> "$LOG_FILE"
    send_alert "Dump integrity check FAILED: 'dump complete' marker missing or malformed."
    exit 1
fi
{% else %}
# Validate binary dump integrity
ERROR_MSG=$(pg_restore -l "$DUMP_FILE" 2>&1 > /dev/null)
if [ $? -ne 0 ]; then
    send_alert "Invalid DB dump file format: ${ERROR_MSG}"
    exit 1
fi
{% endif %}
{% endif %}

# --- Execution ---

# 1. Run Backup (Primary)
log "Starting primary backup of {{ source_dir }}..."
run_restic "backup" "{{ restic_repo }}" "{{ source_dir }}"

# 2. Run Backup (Secondary - Backblaze B2)
# Ensure these environment variables are set either here or in your system environment
export AWS_ACCESS_KEY_ID="{{ backblaze_restic_key_id }}"
export AWS_SECRET_ACCESS_KEY="{{ backblaze_restic_api_key }}"

log "Starting secondary backup to Backblaze..."
# Note: We use a different repository path/variables here
run_restic "backup" "{{ remote_restic_repo }}" "{{ source_dir }}"

if [ $? -ne 0 ]; then
    send_alert "Secondary backup to Backblaze failed."
fi

# 3. If we get here, the backup succeeded
rm -f "${SENTINEL_FILE}"

# 4. Maintenance & Integrity
log "Backup successful. Starting maintenance..."

# Maintenance on Primary
run_restic "forget" "{{ restic_repo }}" --keep-last 7 --prune
run_restic "check" "{{ restic_repo }}"

# Maintenance on Secondary (Backblaze)
# Note: Ensure the environment variables above are still set for this!
run_restic "forget" "{{ remote_restic_repo }}" --keep-last 7 --prune
run_restic "check" "{{ remote_restic_repo }}"

log "Backup cycle completed successfully."

generic_backup/templates/restic-backup.timer.j2

[Unit]
Description=Timer for {{ service_name }} backup

[Timer]
# Use the passed variable, default to 03:00:00 if not provided
OnCalendar={{ backup_schedule | default('*-*-* 03:00:00') }}
Persistent=true
# Randomly delay start by up to 15 minutes to stagger traffic
RandomizedDelaySec=900 

[Install]
WantedBy=timers.target

In roles that deploy a service that I want to have backed up, in the <rolename>/meta/main.yml add

dependencies:
  - role: generic-backup
    vars:
      service_name: "openhab"
      source_dir: "{{ openhab_home }}"
      backup_schedule: "*-*-* 01:00:00"

If the backup failed, I what to know why it failed and figure out if there is a way I can avoid it in the future. I don’t just want to blindly re-run it and frankly doing so may destroy the evidence I need to diagnose the original problem.

That’s the same reason I don’t automatically restart stuff.

If I’m in a hurry, I can ssh to a machine and run a script, or bring up Semaphore and run the needed Ansible role(s) to get everything back up and returned to a known good state. Both can be done from my phone and with tailscale both can be done from anywhere in the world (ish, I do use geoblocking on my firewall which I need to disable when I go overseas).

By focusing on diagnosing and mitigating problems as they occur the number of times I need to do either of the above is about once a quarter, maybe less. It’s just not worth the extra complexity to add a button to the notification when I wouldn’t push the button anyway.

I’m not arguing against the feature over all, it’s just not for me. A convenience that I’d never use is no convenience at all.

Healthchecks is just a convenient way to get an alert when a backup completelty failed to run at all. If it ran or failed I already get an alert.

Just to be clear, while I do have two open ports (port 80 and port 443), the only thing on those ports are a few services that I’ve found no way to make accessible to my wife’s students (Nextcloud shares) and my inlaws (Vaultwarden) from the internet. I cannot require my wife to install Tailscale on each of her student’s machines and asking my inlaws to do anything on the computer is a recipie for disaster.

But if it were not for these I’d not have any open ports. Yet I would still keep my leased DNS name and use my reverse proxy. Why? Because with a DNS from a service that has a plug-in to ACME I can get a trusted certificate for *.mydomain.com. With that certificate, internal to my LAN, my family can go to https://openhab.mydomain.com and not only does it resolve but the certificate is trusted so I don’t have to mess with setting up my own CA. The DNS server is set up so even those DNS requests do not leave the LAN.

I always approach everything as local first, cloud only if required. If I go down the API route, it will be with LLMs. Only if I exhaust what they can do will I consider a subscription to Claude or one of the others.

One thing you mentioned caught my attention: using LLM agents to document your
HAProxy config and eventually review your playbooks. That’s exactly the problem
I was trying to solve, just from a different starting point.

I’d be happy to write up how the markdown approach works in practice with Claude
as a technical partner. Not as a replacement for what you’ve built, your setup
is far more mature than mine, but as a different angle on the same problem.
Keeping a system legible to someone who wasn’t there when the decisions were
made, including yourself six months later.

The watchdog and AI analysis layer might also be worth a separate post. It’s
not complicated but the combination of structured context, Claude API and
actionable notifications took a few iterations to get right. Given where you
want to go with your playbooks I think there’s something in there that
translates.

Would either of those be useful?

I’m sure many would find such writeups useful, including myself. To keep it relevant one could focus it on analyzing an openHAB config.

I do know the overall setup is generally the same with an ARACHITECTURE.md file explaining how everything is layed out and how it’s supposed to work and README.md files in the various folders to help guide the AI on the the purpose of everything. I just haven’t got past the research phase of things.

Analyzing and documenting HAProxy and such will probably be relatively easy. OPNSense exports it’s entire configuration as one big XML file and I think the AdGuard Home config is in a single file also but if not it’s a small number of files, so it just needs to analyze a few files in a flat directory structure. I could probably just paste them into a chat to analyze. But my Ansible config has dozens of folders and tens of thousands of lines of YAML. I need something local to process that.

On the “something local” problem: the part that makes this tractable actually
has nothing to do with the tool you pick. Your role structure already breaks
down the context problem. You document one role per session, not the whole
repo at once. That holds regardless of what you run it through.

I looked into Aider because its repo-map feature is genuinely interesting for
large codebases. The map works by extracting symbol definitions, functions,
classes, call signatures, using tree-sitter. For Python or Ruby that’s
powerful. For YAML, which is data structure rather than code with named
symbols, the benefit is significantly reduced. I was going to lead with it
but that would have been misleading for your specific case.

The more honest approach for Ansible documentation is a script that iterates
through your roles directory, reads the YAML per role, sends it to an LLM
with a documentation instruction, and writes a README.md back into each role
folder. Fully controlled, repeatable, no third-party tool sitting between you
and the output. The upfront cost is writing the script, but you’re in control
of exactly what goes in and what comes out.

Two paths from there.

If the Claude API is acceptable, Ansible Vault already protects your secrets.
What goes to the API is structure and logic. Your network topology and service
names do travel along for the ride, worth being aware of.

If you want truly offline, Ollama is a real option for documentation generation
specifically. The quality gap versus a cloud model exists but generating
README files from structured YAML is a different task than complex reasoning.
What hardware you can dedicate to it on Proxmox is a better guide than
anything I can tell you from here.

The honest unknown is how much review the output needs. That depends on model
quality and how consistent your conventions are across twelve years of roles,
and that varies more than any tool can predict upfront.

My laptop has the best GPU in the house. My proxmox server is too old to support GPU passthrough and I’m hesitant to install an AI stack on Proxmox itself, though I may do so at some point. It’s a really old NVIDIA GPU though, the one on my Framework 16 is probably better anyway.

What I’ve done is use Open WebUI to pass the prompt to Ollama running on my laptop when it’s present. When it isn’t on the network, stuff the family expects to happen immediately (e.g parsing Mealie recipies) it will fall back to ChatGPT (mainly because it was the first one I played with). Paperless-NGX jobs will just wait. It’s not ideal but I’m not in a position to get better hardware right now.

I haven’t gone down the path of agreements yet though.

But the big things I’m hoping to get some documentation for is the overall structure on the homelab. Analyzing each Ansible role in isolation isn’t actually super useful for me. But there is an overall structure to the playbooks.

For example, inventory.yml defines which roles are running on which hosts. I’ve set up buildAll.yml is set up to run all the roles based on the inventory. And each is tagged so they can easily be run individually. So I can run each task individually, all the task for a single host, or a set of tasks on a single task or a combo.

And “DRY” certainly does not apply here. I’ve list of copy pasta since most track tasks need to do the same thing. Create the user, create the data folders, pull the image, start the container, and deploy the backup. Sometimes there’s a little extra DB stuff to do.

My hope would be that the AI could say “hey, you can consolidate"this” stuff into a separate role since you follow the same pattern over and over". I know I could do this on my own but with nearly 30 roles, if an agent could do it I would be more willing to take the plunge. Right now it’s a little overwhelming thinking about doing it by hand.

Because I almost always start with a new role by copy/paying from other roles, my roles are actually remarkably consistent. So the AI shouldn’t have trouble with that. And I know I can get to a good state on my own. But I also want to learn me some AI for great good and if it can save me some effort in the long run so much the better.

So far all the models I’ve experimented with, local and cloud based, have been very knowledgeable about Ansible. While they’ve lead me astray by making up crap about volumes to mount to containers or environment variables to pass into a container, they’ve yet to lead me astray on Ansible related stuff.

That’s a fair correction and it changes the picture. Analyzing each role in
isolation misses the point entirely if what you’re after is the overall
structure and the consolidation opportunities.

The consistency you describe is actually what makes this tractable in one pass.
The task logic lives in tasks/main.yml in each role. Not the templates, not
the vars, the actual pattern. If each of those is around 100 to 200 lines,
thirty roles lands somewhere around 4000 to 5000 lines of YAML total. That
fits in a single API call to ChatGPT or Claude with room to spare, with
inventory.yml and buildAll.yml added at the top to give the architectural
context.

A short Python script that walks the roles directory, reads tasks/main.yml
from each role with the role name as a header, prepends the inventory and
build files, and sends it in one shot with a specific question: identify the
repeated patterns, flag what could be a shared role. Given that your roles
all come from the same template you’ve been copying for years, the signal
should be clean enough for the model to work with.

The honest constraint is that this is the one case where I’d steer away from
Ollama on the laptop. Local models typically have smaller context windows than
the cloud equivalents and you don’t want the analysis to quietly truncate
halfway through role twenty. ChatGPT already handles your fallback so no new
accounts needed.

One thing worth keeping in mind based on your own observation: these models
are solid on Ansible logic but drift on container specifics, volumes and
environment variables in particular. Any consolidation suggestions that touch
those details are worth reviewing carefully before you act on them.

The script itself is maybe thirty lines. The analysis is one call. The
implementation is still yours to do but at least you go in knowing exactly
what the pattern is and where it repeats.