Bot Filesystem Provisioning: Strategy Analysis#
Current Approach: Full rsync Copy#
How it works today (see chart/openclaw/container/init.sh + statefulset.yaml):
- Init container runs
rsync -axH / /main_container/on first boot -> copies the entire Debian 13 rootfs (~2.7 GB) to PVC. - On subsequent boots, init container only updates managed scripts (
bot-wrapper.sh,bot-install.sh) and systemd units. - Main container mounts PVC subdirectories (
/usr,/var,/home,/root,/opt,/etc) viasubPath. - Container image itself is pulled but largely unused after initial rsync – it’s just a “source” of the rootfs.
Storage cost per bot: ~2.7 GB base + user data.
| ✅ Pros | ❌ Cons |
|---|---|
| Simple to understand and debug | 2.7 GB × N bots = massive storage waste |
| Full VM-like experience – user controls everything | ZFS deduplication needed externally (CPU/RAM hungry) |
apt install works perfectly | Slow first boot (rsync of 2.7 GB) |
| User can modify any system file | No clean way to update base OS across all bots |
| Init container update works for our scripts | Users who apt install diverge from base -> can’t safely rebase |
Alternative 1: OverlayFS Inside the Pod#
Concept: Use Linux OverlayFS inside the container to stack a read-only base layer (from the container image or a shared ReadOnlyMany PVC) with a per-bot writable layer (per-bot PVC).
┌────────────────────────────────────────┐
│ Merged View (what user sees) │
├────────────────────────────────────────┤
│ upperdir (per-bot PVC: ~200 MB) │ <- user changes only
│ workdir (per-bot PVC) │
├────────────────────────────────────────┤
│ lowerdir (read-only base image) │ <- 2.7 GB shared across all bots
└────────────────────────────────────────┘Implementation:
- Init container runs
mount -t overlay overlay -o lowerdir=/image,upperdir=/data/upper,workdir=/data/work /merged. - Main container uses
/mergedas the rootfs. - The per-bot PVC only stores diffs (created/modified/deleted files).
| ✅ Pros | ❌ Cons |
|---|---|
| Dramatic storage savings: per-bot PVC is ~50-500 MB instead of 2.7 GB | Requires privileged: true or SYS_ADMIN cap in init container for mount – conflicts with User Namespaces (hostUsers: false) which is your security model |
| Fast first boot – no rsync needed | OverlayFS inside userns is only supported on kernel ≥5.11, and only for unprivileged overlayfs with specific patches; k3s may not support it out of the box |
| Clean base update: swap lowerdir image -> all bots see new OS files | apt metadata inconsistency: user installs vim, you update base -> dpkg DB shows conflicting state (old lowerdir has package DB without vim, upperdir has dpkg fragments) |
| User still has full VM-like experience | Whiteout files (.wh.*) grow over time, especially after apt upgrade; can be confusing |
| ZFS dedup no longer needed for base files | Testing/debugging overlay issues in k8s is painful |
| Natural deduplication at filesystem level | If user does apt upgrade, the entire upgraded file tree goes into upperdir, losing dedup benefit |
Verdict: ⚠️ The hostUsers: false (User Namespaces) security model is a hard blocker for OverlayFS mount syscall inside the container on most kernels. You’d need to either drop User Namespaces (bad for security) or run with SYS_ADMIN capability (partially defeats isolation). Not recommended unless you abandon User Namespaces.
Alternative 2: ZFS Clone From Snapshot (Infrastructure-Level)#
Concept: Instead of creating an empty PVC and rsyncing into it, use ZFS to create a clone from a pre-populated snapshot of a “golden” bot volume.
zfs snapshot tank/bot-golden@v0.2.2
zfs clone tank/bot-golden@v0.2.2 tank/bot-u1000b1-data <- instant, ~0 bytes initiallyImplementation:
- Pre-provision a “golden” ZFS dataset with the full bot rootfs at a specific version.
- Snapshot it:
zfs snapshot tank/bot-golden@v0.2.2. - When a new bot PVC is created, use a CSI driver (e.g., OpenEBS ZFS LocalPV) to create the PVC from the snapshot.
- Block-level CoW: only blocks that the user modifies are duplicated.
- Version update: create new snapshot
@v0.2.3, clone it, diff-merge user’s changes.
| ✅ Pros | ❌ Cons |
|---|---|
| Instant provisioning – clone is O(1) | Requires OpenEBS ZFS CSI driver or custom provisioner |
| Zero additional storage at creation – true CoW at block level | All bots are pinned to the parent snapshot; can’t delete old snapshots until all clones are destroyed or promoted |
| No ZFS dedup cron job needed – dedup is inherent via CoW | Version update is complex: need to zfs promote, `zfs send |
| Full VM-like experience preserved | apt install works but fragments CoW – user-modified blocks become unique |
| No kernel version or User Namespace conflict | Requires ZFS-specific PVC provisioner; less portable |
| Block-level efficiency is better than file-level overlay | Snapshot chain grows; need periodic “squash” operations |
| User can still modify any file | Backup/restore needs ZFS-aware tooling |
Verdict: ✅ Strong candidate. You already use ZFS (storageClass: zfs-raid0). This is the most natural fit for your infrastructure. The OpenEBS ZFS LocalPV CSI driver supports snapshot-based volume creation. The init container becomes trivial: it only needs to update managed scripts on existing volumes, and for new volumes, the ZFS clone is already pre-populated.
Alternative 3: Kubernetes Image Volumes (KEP-4639)#
Concept: Kubernetes v1.31+ (alpha) / v1.33+ (beta) supports image volume sources – mount an OCI image as a read-only volume directly. Combine with a per-bot writable PVC.
volumes:
- name: base-os
image:
reference: git.kabakaev.com/runabot/bot:0.2.2
pullPolicy: IfNotPresent
- name: user-data
persistentVolumeClaim:
claimName: dataImplementation:
- Base OS files are served read-only from the container image itself (no rsync, no PVC copy).
- Per-bot PVC stores only
/home,/root,/var/lib/dpkg,/etcdiffs. - Init container merges: copies specific config files from image volume to user PVC if missing.
| ✅ Pros | ❌ Cons |
|---|---|
| Zero storage duplication – image layers are shared by containerd | Alpha in k3s 1.31, beta in 1.33; your cluster runs k3s 1.35, so beta should be available |
| Elegant Kubernetes-native solution | Image volumes are read-only – user can’t apt install system packages directly (they’d go to a separate writable layer) |
| Base update = new image tag, pods restart | Not truly a “VM-like experience” – user sees split between RO base and RW overlay |
| No ZFS-specific tooling needed | Requires ImageVolume feature gate; may not be stable enough for production |
| Fast pod startup | More complex volume mount setup in StatefulSet |
Users who do apt install would need a writable bind-mount overlaid on /usr – back to OverlayFS problem |
Verdict: 🟡 Promising for the future but breaks the “full VM-like experience” unless combined with OverlayFS (which has the User Namespace problem). Best suited if you’re willing to constrain what users can do (no apt install, use Homebrew/nix/pip instead).
Alternative 4: Selective Sync + Shared Read-Only Base PVC#
Concept: Instead of copying the whole rootfs, only copy user-mutable directories (/home, /root, /var, /etc) to the per-bot PVC. Mount the rest (/usr, /opt, /lib) directly from the container image (read-only).
# statefulset.yaml changes:
containers:
- name: bot
volumeMounts:
# Read-write from PVC (user-owned)
- name: data
mountPath: /home
subPath: home
- name: data
mountPath: /root
subPath: root
- name: data
mountPath: /var
subPath: var
- name: data
mountPath: /etc
subPath: etc
# /usr, /opt, /lib come from the image itself (read-only)Implementation:
- Init container only copies
/home,/root,/var,/etcif PVC is new (~200-500 MB). /usr,/opt,/lib,/bin,/sbincome directly from the container image layers (read-only, shared by containerd).- User’s homebrew installs go to
/home/linuxbrew(on PVC ✅). - Node.js, npm are in
/usr(from image, read-only).
| ✅ Pros | ❌ Cons |
|---|---|
| ~80% storage reduction: PVC only needs ~500 MB base instead of 2.7 GB | apt install is broken – /usr is read-only from the image |
| Fast initial sync (~10s instead of ~60s) | User can’t modify system packages or libraries |
| No special kernel features needed | Less “VM-like” – user notices they can’t write to /usr |
| Base update is automatic with new image tag | Homebrew works (lives in /home), pip works, npm works, but system packages don’t |
| Works with User Namespaces (no privilege escalation) | /var/lib/dpkg is on PVC but /usr is RO -> dpkg state mismatch |
Simple to implement – just change statefulset.yaml and init.sh | Some tools expect writable /opt (currently on PVC, so OK) |
Verdict: ✅ Practical middle ground. This is the easiest to implement and gives 80% of the savings. The tradeoff is that apt install won’t work for system packages, but brew install, pip install, and npm install all work fine since they target user-writable paths. For your target audience of “very inexperienced users,” they won’t be using apt anyway.
Alternative 5: Nix/Guix-Style Immutable Base + User Profile#
Concept: Replace the Debian base with an immutable OS (like NixOS or a read-only Debian). All user-desired packages are installed via Nix or Homebrew into the user’s profile directory on the PVC.
| ✅ Pros | ❌ Cons |
|---|---|
| Reproducible, immutable base | Steep learning curve for users (nix, not apt) |
| Per-user package installs are completely isolated | Requires significant Containerfile rework |
| Trivial base updates | Nix store can grow large (~1 GB+ for common packages) |
| Good dedup with Nix store | Very different UX from what users expect |
Verdict: ❌ Too radical for your user base and timeline. Adds complexity without proportional benefit for your use case.
⚠️ The Clone Drift Problem (Why ZFS Clone Fails Long-Term)#
ZFS clones share blocks with their parent snapshot only – sibling clones do NOT share blocks with each other. This creates a hidden O(N × time) storage tax:
Month 0: Golden Snapshot @v0.2.2 (2.7 GB)
├── Bot A clone: +0 MB (all blocks shared with parent) ✅
├── Bot B clone: +0 MB
└── Bot C clone: +0 MB
Total: ~2.7 GB
Month 3: Golden Snapshot @v0.2.2 (2.7 GB, frozen -- never apt-upgraded)
├── Bot A: user ran apt upgrade -> glibc, nodejs, clang updated -> +800 MB unique
├── Bot B: same apt upgrade -> +800 MB unique (NOT shared with Bot A!)
└── Bot C: same apt upgrade -> +800 MB unique
Total: 2.7 GB + 2.4 GB = 5.1 GB ❌
Month 6: Golden Snapshot @v0.2.2 (2.7 GB, even more stale)
├── Bot A: +1.6 GB diverged from base
├── Bot B: +1.6 GB diverged (identical to A, but NOT shared!)
└── Bot C: +1.6 GB diverged
Total: 2.7 GB + 4.8 GB = 7.5 GB ❌❌ WORSE than just rsyncing!The critical insight: ZFS clone CoW ≠ ZFS block dedup.
- Clone CoW: dedup against the parent snapshot only. Siblings are strangers.
- ZFS block dedup: dedup across everything, but costs ~320 bytes RAM per block (~5 GB RAM per 1 TB data). Impractical at scale.
- rmlint/file-level dedup: what you do now. Better cross-sibling coverage, but periodic.
Once the golden snapshot falls behind package updates, every bot that runs apt upgrade
stores its own unique copy of the same updated packages. After 6 months the clone approach
is worse than the current rsync approach because you’re paying for the stale snapshot
blocks plus all the diverged per-bot blocks.
You could periodically rebase (create new golden snapshot, migrate clones), but this is operationally complex and error-prone with user data mixed in.
Recommendation Matrix#
| Approach | Day 1 Storage | Month 6 Storage | Year 2 Storage | apt install | User Namespaces | Recommended? |
|---|---|---|---|---|---|---|
| Current (rsync + rmlint) | 2.7 GB/bot | 2.7 GB/bot (deduped) | ~2.7 GB/bot | ✅ Full | ✅ Preserved | Baseline |
| Alt 1 (OverlayFS in pod) | ~200 MB | ~1 GB (drift) | ~2 GB+ | ✅ Full | ❌ Broken | ❌ |
| Alt 2 (ZFS Clone) | ~0 MB | ~1.6 GB (drift!) | ~2.7 GB (fully diverged) | ✅ Full | ✅ Preserved | ❌ Trap! |
| Alt 3 (K8s Image Volumes) | ~0 MB | ~0 MB (base shared) | ~0 MB | ❌ RO only | ✅ Preserved | 🟡 Future |
| Alt 4 (Selective Sync) | ~500 MB | ~600 MB | ~800 MB | ❌ /usr RO | ✅ Preserved | ✅✅ Best |
| Alt 5 (Nix/Guix) | ~1 GB | ~1.5 GB | ~2 GB | ❌ Nix only | ✅ Preserved | ❌ |
Recommended Path#
The Answer: Alternative 4 – Selective Sync (Immutable Base)#
Selective Sync is not just a “quick win” – it is the correct long-term architecture:
containerd already deduplicates image layers via its content-addressable store. If 100 bots use
runabot/bot:0.2.3, the 2.7 GB of/usr,/lib,/binexists exactly once on disk as shared OCI layers. This is free, requires zero ZFS trickery, and scales to thousands of bots.Base OS updates become trivial: bump the image tag in the Helm chart -> all bots get the new glibc/nodejs/clang on next restart. No rebase, no clone migration, no dpkg conflicts.
Per-bot PVC shrinks to ~200-500 MB (only
/home,/root,/var,/etc). At this size, even without dedup, 100 bots = ~50 GB instead of ~270 GB.apt installbeing broken is actually a feature for your target audience:- Inexperienced users won’t use
apt– and a clear error is better than silent divergence. brew installalready works (lives in/home/linuxbrew, on PVC ✅).pip install,npm install,openclaw plugins installall work.- Power users get packages via bot addons (Helm-managed, versioned, controlled by you).
- Inexperienced users won’t use
Security improves: read-only
/usrmeans a compromised bot can’t trojanize system binaries.
Implementation:
- Modify
init.shto only copy/home,/root,/var,/etcon first boot (~10s). - Modify
statefulset.yamlto stop mounting/usrand/optfrom PVC. /usr,/opt,/lib,/bin,/sbincome from the container image (read-only, auto-shared).- Risk: low. Rollback: revert to full rsync.
Migration Path for Existing Bots#
For bots that already have a full rsync volume:
# In init.sh, detect old-style volume and migrate:
if [ -d "/main_container/usr/local/bin/bot-wrapper.sh" ] && [ ! -f "/main_container/.migrated-v2" ]; then
echo "Migrating to selective sync layout..."
# Save list of user-installed apt packages for reference
dpkg --root=/main_container --get-selections > /main_container/home/.user-packages.txt 2>/dev/null || true
# Mark as migrated (old /usr will simply be ignored)
touch /main_container/.migrated-v2
fiLong-term Enhancement: K8s Image Volumes (Alt 3)#
- When k3s matures ImageVolume support (stable by k8s ~1.37+), consider using it to mount the base OS even more elegantly.
- Compatible with Selective Sync – it’s the same philosophy (immutable base from OCI image).
The apt Database Consistency Problem#
Problem: If you use any overlay/CoW approach and update the base layer, the dpkg
database (/var/lib/dpkg/) on the user’s writable layer reflects packages the user
installed, but the files for those packages may have been replaced/removed in the new
base layer.
Solutions by approach:
ZFS Clone (Alt 2): No dpkg conflict per se – each clone has its own dpkg DB. But clone drift means you pay double for every updated package.
Selective Sync (Alt 4):
apt installfails (RO/usr). No conflict possible. ✅OverlayFS (Alt 1): Dpkg DB is in upperdir, files in lowerdir. After base swap:
dpkg --auditwould report issues. Fragile.
Pragmatic answer: For your target audience (“very inexperienced users”), apt install
should not be the primary package management story. Point users to brew install (already
set up in bot-wrapper.sh) or provide an “addon” system for common tools.