Bot Filesystem Provisioning: Strategy Analysis#

Current Approach: Full `rsync` Copy#

How it works today (see chart/openclaw/container/init.sh + statefulset.yaml):

Init container runs rsync -axH / /main_container/ on first boot -> copies the entire Debian 13 rootfs (~2.7 GB) to PVC.
On subsequent boots, init container only updates managed scripts (bot-wrapper.sh, bot-install.sh) and systemd units.
Main container mounts PVC subdirectories (/usr, /var, /home, /root, /opt, /etc) via subPath.
Container image itself is pulled but largely unused after initial rsync – it’s just a “source” of the rootfs.

Storage cost per bot: ~2.7 GB base + user data.

✅ Pros	❌ Cons
Simple to understand and debug	2.7 GB × N bots = massive storage waste
Full VM-like experience – user controls everything	ZFS deduplication needed externally (CPU/RAM hungry)
`apt install` works perfectly	Slow first boot (rsync of 2.7 GB)
User can modify any system file	No clean way to update base OS across all bots
Init container update works for our scripts	Users who `apt install` diverge from base -> can’t safely rebase

Alternative 1: OverlayFS Inside the Pod#

Concept: Use Linux OverlayFS inside the container to stack a read-only base layer (from the container image or a shared ReadOnlyMany PVC) with a per-bot writable layer (per-bot PVC).

┌────────────────────────────────────────┐
│          Merged View (what user sees)  │
├────────────────────────────────────────┤
│  upperdir (per-bot PVC: ~200 MB)      │  <- user changes only
│  workdir  (per-bot PVC)               │
├────────────────────────────────────────┤
│  lowerdir (read-only base image)      │  <- 2.7 GB shared across all bots
└────────────────────────────────────────┘

Implementation:

Init container runs mount -t overlay overlay -o lowerdir=/image,upperdir=/data/upper,workdir=/data/work /merged.
Main container uses /merged as the rootfs.
The per-bot PVC only stores diffs (created/modified/deleted files).

✅ Pros	❌ Cons
Dramatic storage savings: per-bot PVC is ~50-500 MB instead of 2.7 GB	Requires `privileged: true` or `SYS_ADMIN` cap in init container for `mount` – conflicts with User Namespaces (`hostUsers: false`) which is your security model
Fast first boot – no rsync needed	OverlayFS inside userns is only supported on kernel ≥5.11, and only for unprivileged overlayfs with specific patches; k3s may not support it out of the box
Clean base update: swap lowerdir image -> all bots see new OS files	`apt` metadata inconsistency: user installs `vim`, you update base -> dpkg DB shows conflicting state (old lowerdir has package DB without vim, upperdir has dpkg fragments)
User still has full VM-like experience	Whiteout files (`.wh.*`) grow over time, especially after `apt upgrade`; can be confusing
ZFS dedup no longer needed for base files	Testing/debugging overlay issues in k8s is painful
Natural deduplication at filesystem level	If user does `apt upgrade`, the entire upgraded file tree goes into upperdir, losing dedup benefit

Verdict: ⚠️ The hostUsers: false (User Namespaces) security model is a hard blocker for OverlayFS mount syscall inside the container on most kernels. You’d need to either drop User Namespaces (bad for security) or run with SYS_ADMIN capability (partially defeats isolation). Not recommended unless you abandon User Namespaces.

Alternative 2: ZFS Clone From Snapshot (Infrastructure-Level)#

Concept: Instead of creating an empty PVC and rsyncing into it, use ZFS to create a clone from a pre-populated snapshot of a “golden” bot volume.

zfs snapshot tank/bot-golden@v0.2.2
zfs clone tank/bot-golden@v0.2.2 tank/bot-u1000b1-data   <- instant, ~0 bytes initially

Implementation:

Pre-provision a “golden” ZFS dataset with the full bot rootfs at a specific version.
Snapshot it: zfs snapshot tank/bot-golden@v0.2.2.
When a new bot PVC is created, use a CSI driver (e.g., OpenEBS ZFS LocalPV) to create the PVC from the snapshot.
Block-level CoW: only blocks that the user modifies are duplicated.
Version update: create new snapshot @v0.2.3, clone it, diff-merge user’s changes.

✅ Pros	❌ Cons
Instant provisioning – clone is O(1)	Requires OpenEBS ZFS CSI driver or custom provisioner
Zero additional storage at creation – true CoW at block level	All bots are pinned to the parent snapshot; can’t delete old snapshots until all clones are destroyed or promoted
No ZFS dedup cron job needed – dedup is inherent via CoW	Version update is complex: need to `zfs promote`, `zfs send
Full VM-like experience preserved	`apt install` works but fragments CoW – user-modified blocks become unique
No kernel version or User Namespace conflict	Requires ZFS-specific PVC provisioner; less portable
Block-level efficiency is better than file-level overlay	Snapshot chain grows; need periodic “squash” operations
User can still modify any file	Backup/restore needs ZFS-aware tooling

Verdict: ✅ Strong candidate. You already use ZFS (storageClass: zfs-raid0). This is the most natural fit for your infrastructure. The OpenEBS ZFS LocalPV CSI driver supports snapshot-based volume creation. The init container becomes trivial: it only needs to update managed scripts on existing volumes, and for new volumes, the ZFS clone is already pre-populated.

Alternative 3: Kubernetes Image Volumes (KEP-4639)#

Concept: Kubernetes v1.31+ (alpha) / v1.33+ (beta) supports image volume sources – mount an OCI image as a read-only volume directly. Combine with a per-bot writable PVC.

volumes:
  - name: base-os
    image:
      reference: git.kabakaev.com/runabot/bot:0.2.2
      pullPolicy: IfNotPresent
  - name: user-data
    persistentVolumeClaim:
      claimName: data

Implementation:

Base OS files are served read-only from the container image itself (no rsync, no PVC copy).
Per-bot PVC stores only /home, /root, /var/lib/dpkg, /etc diffs.
Init container merges: copies specific config files from image volume to user PVC if missing.

✅ Pros	❌ Cons
Zero storage duplication – image layers are shared by containerd	Alpha in k3s 1.31, beta in 1.33; your cluster runs k3s 1.35, so beta should be available
Elegant Kubernetes-native solution	Image volumes are read-only – user can’t `apt install` system packages directly (they’d go to a separate writable layer)
Base update = new image tag, pods restart	Not truly a “VM-like experience” – user sees split between RO base and RW overlay
No ZFS-specific tooling needed	Requires `ImageVolume` feature gate; may not be stable enough for production
Fast pod startup	More complex volume mount setup in StatefulSet
	Users who do `apt install` would need a writable bind-mount overlaid on /usr – back to OverlayFS problem

Verdict: 🟡 Promising for the future but breaks the “full VM-like experience” unless combined with OverlayFS (which has the User Namespace problem). Best suited if you’re willing to constrain what users can do (no apt install, use Homebrew/nix/pip instead).

Alternative 4: Selective Sync + Shared Read-Only Base PVC#

Concept: Instead of copying the whole rootfs, only copy user-mutable directories (/home, /root, /var, /etc) to the per-bot PVC. Mount the rest (/usr, /opt, /lib) directly from the container image (read-only).

# statefulset.yaml changes:
containers:
  - name: bot
    volumeMounts:
      # Read-write from PVC (user-owned)
      - name: data
        mountPath: /home
        subPath: home
      - name: data
        mountPath: /root
        subPath: root
      - name: data
        mountPath: /var
        subPath: var
      - name: data
        mountPath: /etc
        subPath: etc
      # /usr, /opt, /lib come from the image itself (read-only)

Implementation:

Init container only copies /home, /root, /var, /etc if PVC is new (~200-500 MB).
/usr, /opt, /lib, /bin, /sbin come directly from the container image layers (read-only, shared by containerd).
User’s homebrew installs go to /home/linuxbrew (on PVC ✅).
Node.js, npm are in /usr (from image, read-only).

✅ Pros	❌ Cons
~80% storage reduction: PVC only needs ~500 MB base instead of 2.7 GB	`apt install` is broken – `/usr` is read-only from the image
Fast initial sync (~10s instead of ~60s)	User can’t modify system packages or libraries
No special kernel features needed	Less “VM-like” – user notices they can’t write to `/usr`
Base update is automatic with new image tag	Homebrew works (lives in /home), pip works, npm works, but system packages don’t
Works with User Namespaces (no privilege escalation)	`/var/lib/dpkg` is on PVC but `/usr` is RO -> dpkg state mismatch
Simple to implement – just change `statefulset.yaml` and `init.sh`	Some tools expect writable `/opt` (currently on PVC, so OK)

Verdict: ✅ Practical middle ground. This is the easiest to implement and gives 80% of the savings. The tradeoff is that apt install won’t work for system packages, but brew install, pip install, and npm install all work fine since they target user-writable paths. For your target audience of “very inexperienced users,” they won’t be using apt anyway.

Alternative 5: Nix/Guix-Style Immutable Base + User Profile#

Concept: Replace the Debian base with an immutable OS (like NixOS or a read-only Debian). All user-desired packages are installed via Nix or Homebrew into the user’s profile directory on the PVC.

✅ Pros	❌ Cons
Reproducible, immutable base	Steep learning curve for users (nix, not apt)
Per-user package installs are completely isolated	Requires significant Containerfile rework
Trivial base updates	Nix store can grow large (~1 GB+ for common packages)
Good dedup with Nix store	Very different UX from what users expect

Verdict: ❌ Too radical for your user base and timeline. Adds complexity without proportional benefit for your use case.

⚠️ The Clone Drift Problem (Why ZFS Clone Fails Long-Term)#

ZFS clones share blocks with their parent snapshot only – sibling clones do NOT share blocks with each other. This creates a hidden O(N × time) storage tax:

Month 0:  Golden Snapshot @v0.2.2 (2.7 GB)
          ├── Bot A clone: +0 MB (all blocks shared with parent) ✅
          ├── Bot B clone: +0 MB
          └── Bot C clone: +0 MB
          Total: ~2.7 GB

Month 3:  Golden Snapshot @v0.2.2 (2.7 GB, frozen -- never apt-upgraded)
          ├── Bot A: user ran apt upgrade -> glibc, nodejs, clang updated -> +800 MB unique
          ├── Bot B: same apt upgrade -> +800 MB unique (NOT shared with Bot A!)
          └── Bot C: same apt upgrade -> +800 MB unique
          Total: 2.7 GB + 2.4 GB = 5.1 GB ❌

Month 6:  Golden Snapshot @v0.2.2 (2.7 GB, even more stale)
          ├── Bot A: +1.6 GB diverged from base
          ├── Bot B: +1.6 GB diverged (identical to A, but NOT shared!)
          └── Bot C: +1.6 GB diverged
          Total: 2.7 GB + 4.8 GB = 7.5 GB ❌❌ WORSE than just rsyncing!

The critical insight: ZFS clone CoW ≠ ZFS block dedup.

Clone CoW: dedup against the parent snapshot only. Siblings are strangers.
ZFS block dedup: dedup across everything, but costs ~320 bytes RAM per block (~5 GB RAM per 1 TB data). Impractical at scale.
rmlint/file-level dedup: what you do now. Better cross-sibling coverage, but periodic.

Once the golden snapshot falls behind package updates, every bot that runs apt upgrade stores its own unique copy of the same updated packages. After 6 months the clone approach is worse than the current rsync approach because you’re paying for the stale snapshot blocks plus all the diverged per-bot blocks.

You could periodically rebase (create new golden snapshot, migrate clones), but this is operationally complex and error-prone with user data mixed in.

Recommendation Matrix#

Approach	Day 1 Storage	Month 6 Storage	Year 2 Storage	`apt install`	User Namespaces	Recommended?
Current (rsync + rmlint)	2.7 GB/bot	2.7 GB/bot (deduped)	~2.7 GB/bot	✅ Full	✅ Preserved	Baseline
Alt 1 (OverlayFS in pod)	~200 MB	~1 GB (drift)	~2 GB+	✅ Full	❌ Broken	❌
Alt 2 (ZFS Clone)	~0 MB	~1.6 GB (drift!)	~2.7 GB (fully diverged)	✅ Full	✅ Preserved	❌ Trap!
Alt 3 (K8s Image Volumes)	~0 MB	~0 MB (base shared)	~0 MB	❌ RO only	✅ Preserved	🟡 Future
Alt 4 (Selective Sync)	~500 MB	~600 MB	~800 MB	❌ `/usr` RO	✅ Preserved	✅✅ Best
Alt 5 (Nix/Guix)	~1 GB	~1.5 GB	~2 GB	❌ Nix only	✅ Preserved	❌

Recommended Path#

The Answer: Alternative 4 – Selective Sync (Immutable Base)#

Selective Sync is not just a “quick win” – it is the correct long-term architecture:

containerd already deduplicates image layers via its content-addressable store. If 100 bots use runabot/bot:0.2.3, the 2.7 GB of /usr, /lib, /bin exists exactly once on disk as shared OCI layers. This is free, requires zero ZFS trickery, and scales to thousands of bots.
Base OS updates become trivial: bump the image tag in the Helm chart -> all bots get the new glibc/nodejs/clang on next restart. No rebase, no clone migration, no dpkg conflicts.
Per-bot PVC shrinks to ~200-500 MB (only /home, /root, /var, /etc). At this size, even without dedup, 100 bots = ~50 GB instead of ~270 GB.
apt install being broken is actually a feature for your target audience:
- Inexperienced users won’t use apt – and a clear error is better than silent divergence.
- brew install already works (lives in /home/linuxbrew, on PVC ✅).
- pip install, npm install, openclaw plugins install all work.
- Power users get packages via bot addons (Helm-managed, versioned, controlled by you).
Security improves: read-only /usr means a compromised bot can’t trojanize system binaries.

Implementation:

Modify init.sh to only copy /home, /root, /var, /etc on first boot (~10s).
Modify statefulset.yaml to stop mounting /usr and /opt from PVC.
/usr, /opt, /lib, /bin, /sbin come from the container image (read-only, auto-shared).
Risk: low. Rollback: revert to full rsync.

Migration Path for Existing Bots#

For bots that already have a full rsync volume:

# In init.sh, detect old-style volume and migrate:
if [ -d "/main_container/usr/local/bin/bot-wrapper.sh" ] && [ ! -f "/main_container/.migrated-v2" ]; then
    echo "Migrating to selective sync layout..."
    # Save list of user-installed apt packages for reference
    dpkg --root=/main_container --get-selections > /main_container/home/.user-packages.txt 2>/dev/null || true
    # Mark as migrated (old /usr will simply be ignored)
    touch /main_container/.migrated-v2
fi

Long-term Enhancement: K8s Image Volumes (Alt 3)#

When k3s matures ImageVolume support (stable by k8s ~1.37+), consider using it to mount the base OS even more elegantly.
Compatible with Selective Sync – it’s the same philosophy (immutable base from OCI image).

The `apt` Database Consistency Problem#

Problem: If you use any overlay/CoW approach and update the base layer, the dpkg database (/var/lib/dpkg/) on the user’s writable layer reflects packages the user installed, but the files for those packages may have been replaced/removed in the new base layer.

Solutions by approach:

ZFS Clone (Alt 2): No dpkg conflict per se – each clone has its own dpkg DB. But clone drift means you pay double for every updated package.
Selective Sync (Alt 4): apt install fails (RO /usr). No conflict possible. ✅
OverlayFS (Alt 1): Dpkg DB is in upperdir, files in lowerdir. After base swap: dpkg --audit would report issues. Fragile.

Pragmatic answer: For your target audience (“very inexperienced users”), apt install should not be the primary package management story. Point users to brew install (already set up in bot-wrapper.sh) or provide an “addon” system for common tools.

Bot Filesystem Provisioning: Strategy Analysis#

Current Approach: Full rsync Copy#

Alternative 1: OverlayFS Inside the Pod#

Alternative 2: ZFS Clone From Snapshot (Infrastructure-Level)#

Alternative 3: Kubernetes Image Volumes (KEP-4639)#

Alternative 4: Selective Sync + Shared Read-Only Base PVC#

Alternative 5: Nix/Guix-Style Immutable Base + User Profile#

⚠️ The Clone Drift Problem (Why ZFS Clone Fails Long-Term)#

Recommendation Matrix#

Recommended Path#

The Answer: Alternative 4 – Selective Sync (Immutable Base)#

Migration Path for Existing Bots#

Long-term Enhancement: K8s Image Volumes (Alt 3)#

The apt Database Consistency Problem#

Current Approach: Full `rsync` Copy#

The `apt` Database Consistency Problem#