Commit graph

32 commits

Author SHA1 Message Date
Felix Wolf 4133ad8f24 fix(matrix): raise rc_login burst limit to stop M_LIMIT_EXCEEDED 2026-05-04 20:33:32 +02:00
Felix Wolf 0095c7ee7f feat(matrix): wire Synapse into monitoring stack
- New headless Service matrix-synapse-metrics exposing port 9090
  (Synapse's /_synapse/metrics listener), labeled matrix_metrics=enabled
- VictoriaMetrics scrape job 'matrix' targets endpoints in matrix ns
  with that label + port name 'metrics'
- Grafana picks up the official Synapse dashboard from
  element-hq/synapse v1.152.0 contrib/grafana/synapse.json via URL

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:26:46 +02:00
Felix Wolf 137705bfe0 feat(matrix): add Synapse + Element Web deployment
Personal homeserver with bridges deferred. Single host
matrix.{cluster.domain} with path-based routing: /_matrix, /_synapse,
/.well-known/matrix → Synapse; / → Element Web. Both share matrix-tls.

Stack: ananace/matrix-synapse + element-web charts, CNPG postgres
(LC_COLLATE=C), in-cluster alpine redis (no auth, replaces bitnami
subchart), mittwald-generated synapse-secrets for registration_shared/
macaroon/form_secret, custom idempotent signing-key init Job (replaces
chart's bitnami/kubectl publisher).

Sync waves:
  -3 Namespace
  -2 synapse-secrets (mittwald head-start), signing-key RBAC
  -1 signing-key Job, CNPG Cluster, redis
   0 Synapse, Element, Ingress

Synapse pod waits in extraCommands until synapse-secrets is populated,
then writes zz-overrides.yaml to override chart placeholders for the
three secret values without churning the chart-managed Secret on every
render. Resources tightened for 1-2 user scale: Synapse 256Mi/512Mi,
Postgres 64Mi/128Mi.

ArgoCD destination.namespace overridden to matrix via prototype-level
argocd overlay so both apps share the matrix ns instead of creating
unused matrix-synapse and element-web namespaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:04:29 +02:00
Felix Wolf fe51c8c1bc feat(minikube): add minikube environment with garage S3 backend
Adds a self-contained minikube environment for local development and
testing alongside the existing production env.

env: minikube
  - cluster.domain: minikube (browser DNS routes *.minikube → minikube ip)
  - tls issuer: mkcert (CA-signed via cert-manager mkcert ClusterIssuer)
  - storageClass: standard (minikube hostpath provisioner)
  - backups disabled; storagebox disabled
  - excludes argocd, forgejo, hcloud-csi (manual kubectl apply for testing)

prototypes/garage:
  - hand-rolled S3-compatible object store (single Deployment + PVC)
  - mittwald-generated rpc_secret + admin_token (hex)
  - PostSync init Job: assigns cluster layout, ensures bucket and access
    key, writes ocis-s3-credentials cross-namespace into ocis ns
  - idempotent: skips if k8s secret already populated; otherwise rotates
    the garage key (admin API only returns secretAccessKey on create)
  - cross-ns RBAC re-pinned via zz-cross-ns-rbac-fix overlay (ns.ytt.yaml
    clobbers explicit namespace fields)

ocis:
  - new admin-user-id init Job ensures secret.user-id is a valid UUID v4
    (mittwald can't generate UUIDs; ocis-settings rejects non-UUID ids)
  - mittwald no longer manages user-id; existing prod UUIDs preserved
  - insecure flag (oidcIdpInsecure / ocisHttpApiInsecure / ocmInsecure)
    parameterized; defaults to false; minikube sets true for self-signed
    OIDC issuer URL trust

other prototypes:
  - victoria-metrics-single helm values ytt-ified (storageClassName)
  - grafana admin secret now generated by mittwald (was hand-created in
    prod; manifest is no-op there since mittwald only fills empty fields)

flake.nix: minikube + docker + postgresql added to dev shell.
2026-05-03 17:23:57 +02:00
Felix Wolf 279cd0d19f refactor(prototypes): parameterize env-specific values for multi-env support
Extract domain, ingress class, TLS issuer, storage classes, S3 endpoints,
backup toggles, and forgejo node selector into env-data values. Each
prototype's app-data declares its subdomain alongside namespace; templates
compute host as <subdomain>.<cluster.domain>.

Schema is shape-only with safe defaults; production env-data sets values
explicitly. Backup CronJobs and external-secret prechecks gate on
backups.enabled and ocis.s3.external. Adds mkcert ClusterIssuer + precheck
Job for local-dev TLS, gated on cluster.tls.issuer == "mkcert".

forgejo argocd-deploy-key Job: REPO_URL/FORGEJO_URL moved to container env
vars to keep the script ytt-templatable; runtime behavior unchanged.

Production render verified byte-identical (excluding the deploy-key Job
env-var refactor and chart-volatile UUID ConfigMaps).
2026-05-03 15:08:48 +02:00
Felix Wolf 122e03f3ec feat(ocis-backup): adds oCIS volume backup CronJobs
Implements daily online backups for oCIS persistent volumes.

Each CronJob uses `rclone` to sync its respective PVC to a Storage Box, mounting the volume read-only to ensure zero downtime. Pod affinity is configured to schedule the backup job on the same node as the consuming application pod. This covers `idm`, `storagesystem`, and `storageusers` data volumes.
2026-05-03 02:52:53 +02:00
Felix Wolf d65181de78 fix(ocis-backup): Fix S3 backup permissions and update config IDs
Adds `fsGroup` to the S3 backup cronjob's security context to ensure proper volume ownership. Increases the SSH key secret's `defaultMode` to grant group read access, resolving permission failures when reading the SSH key.
2026-05-03 02:16:02 +02:00
Felix Wolf d048bbb2a5 feat(monitoring): Add comprehensive oCIS monitoring
Integrates oCIS services into the monitoring stack by:
- Adding a new scrape configuration to VictoriaMetrics to collect metrics from oCIS services in the 'ocis' namespace.
- Introducing a new "ocis Overview" Grafana dashboard. This dashboard includes panels for user experience (proxy), service health, storage activity (uploads/downloads), and resource utilization, all leveraging the VictoriaMetrics datasource.
2026-05-03 01:48:11 +02:00
Felix Wolf 33c52be1c5 feat(pss): drop 5 namespaces from PSS privileged to restricted
argocd, cert-manager, cloudnative-pg already compliant — label flip only.
ocis: add overlay injecting seccompProfile=RuntimeDefault, drop ALL caps,
allowPrivilegeEscalation=false across all chart Deployments/CronJobs;
patch idm initContainer; harden custom precheck Job; refactor s3-backup
to rclone/rclone image (avoids apk-add-as-root).
victoria-metrics-single: overlay sets full restricted SC on the StatefulSet
that ships with empty securityContext: {}.

forgejo, traefik, kube-system stay privileged (hostPort / CSI driver).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:24:59 +02:00
Felix Wolf bf0cf0a11d fix(forgejo): force-replace argocd-deploy-key-init Job
Replace=true alone uses kubectl replace, which rejects updates on Job
immutable fields (spec.selector, spec.template.metadata.labels) when
the cluster already has a Job with auto-generated values. Add Force=true
so ArgoCD does kubectl replace --force (delete + recreate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:05:45 +02:00
Felix Wolf 85b8fec6b3 feat: replace secret-init Jobs with mittwald operator + cert-manager
Migrate ~180 LOC of openssl/kubectl init Jobs to declarative Secret
manifests reconciled by mittwald/kubernetes-secret-generator (random
strings, SSH keypair) and cert-manager Certificates (RSA private key +
self-signed CA chain). mittwald only fills empty fields, so existing
populated Secrets keep their current values across the migration.

Changes:

- New prototype kubernetes-secret-generator (chart 3.4.1, mittwald helm
  repo). Cluster-wide informer reconciler, no webhook -> cold-bootstrap
  safe via ArgoCD retries.
- New cert-manager selfsigned ClusterIssuer (in-cluster trust root).
  letsencrypt remains for public-DNS endpoints.
- forgejo: admin-secret Job replaced with a mittwald-annotated Secret
  (hex-encoded 24-char password). Deploy-key Job split: mittwald
  ssh-keypair Secret + slim Job that uploads pubkey to Forgejo and
  copies privkey into the argocd repo Secret.
- ocis: 13 Secrets / 16 random fields now mittwald-managed (UUIDs
  replaced with opaque random hex; ocis treats user-id as opaque). IDP
  RSA signing key, LDAP self-signed CA, and LDAP server cert produced
  by cert-manager. Per-Deployment ytt overlay remaps volume key paths
  (tls.crt -> ldap-ca.crt, tls.key -> private-key.pem, etc.) since the
  ocis chart mounts Secrets raw without items support. Old multi-secret
  s3-secret-job replaced with a slim external-secret precheck Job that
  only validates pre-created Hetzner S3/Storage Box credentials.
- Application sync-wave -10 on cert-manager and kubernetes-secret-
  generator so they install before consumers. ArgoCD selfHeal handles
  any residual races.

CLAUDE.md: remove the "all namespaces use privileged PodSecurity"
convention. Existing namespaces still carry the label and will be
audited separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:00:07 +02:00
Felix Wolf 9112153e8a fix(ocis): resolve large file upload timeouts and enable stale upload cleanup
Increase Traefik readTimeout from 600s to 3600s to prevent connection drops during large uploads, and enable the suspended cleanUpExpiredUploads CronJob so stale TUS sessions are automatically purged.
2026-04-24 20:12:24 +02:00
Felix Wolf 88fa8c4df3 fix(traefik): increase read-timeout to avoid crashing ocis for large uploads
Traefik's default readTimeout of 60s killing the upload connection. The cascade was:

  1. Large upload exceeds 60s → Traefik kills connection
  2. storageusers floods with NetworkTimeoutError
  3. Aborted uploads generate tons of NATS events
  4. NATS gets overwhelmed → no response from stream
  5. Proxy can't resolve user roles → login returns 500
2026-04-12 18:49:02 +02:00
Felix Wolf f57d29d1d3 chore: update service resource requests and identifiers
Increases memory requests for the IDM and NATS services to enhance stability and performance.
Updates application, service account, and storage UUIDs in configuration maps, reflecting a re-initialization or re-rendering of OIDC settings.
2026-04-12 18:26:47 +02:00
Felix Wolf d1eae1546e fix(victoriametrics): Remove nodeselector for old node
The node ubuntu-4gb-nbg1-1 was drained and exchanged with a new x86 C33 machine. So the nodeselctor needs to be removed.
2026-04-12 17:30:26 +02:00
Felix Wolf f442255833 feat: configure storageusers resources and anti-affinity
Assigns specific CPU and memory requests and limits to the storageusers service to ensure stable operation and efficient resource utilization.

Introduces pod anti-affinity for storageusers to prevent it from being scheduled on the same node as victoria-metrics-single, improving resilience and preventing potential resource contention.
2026-04-06 16:39:24 +02:00
Felix Wolf 1122c3f0e2 feat: Implement S3 to Storage Box backup
Introduces a daily Kubernetes CronJob that utilizes rclone to perform compressed backups of oCIS S3 data to a Hetzner Storage Box via SFTP.

This new backup mechanism requires the manual creation of an 'ocis-storagebox-credentials' secret, which holds the Storage Box host, user, and SSH private key. A check is added to the secret initialization job to ensure this essential external secret exists.
2026-04-06 15:24:14 +02:00
Felix Wolf a3143ac33c feat: Configure Ocis for Hetzner Cloud storage
Sets `hcloud-volumes` as the default storage class for Ocis components including storageusers, storagesystem, and idm.
2026-04-06 14:25:35 +02:00
Felix Wolf 4e48df73d3 feat(ocis): Transition to oCIS and enhance deployment
Removes the full Nextcloud stack (PostgreSQL/CNPG, Valkey, Caddy) and
  deploys oCIS at drive.tr1ceracop.de. oCIS is self-contained — no
  external database or cache needed.

  Key design decisions:
  - S3ng storage backend on Hetzner Object Storage (ocis-tr1ceracop)
  - Chart fetched via vendir git source (not published to a Helm repo)
  - All secrets generated in-cluster via PreSync init Job (never in git)
  - Memory requests on all pods to prevent node overcommit
  - Persistence on local-path for metadata (idm, nats, search, storage)

  Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 14:01:55 +02:00
Felix Wolf ffa171bfb0 feat: Replace Nextcloud with oCIS (ownCloud Infinite Scale)
Removes the full Nextcloud stack (PostgreSQL/CNPG, Valkey, Caddy sidecar)
and replaces it with oCIS at drive.tr1ceracop.de. oCIS is self-contained
(no external DB/cache needed) with S3ng storage backend on Hetzner Object
Storage (bucket: ocis-tr1ceracop). Chart sourced from git via vendir since
it is not published to a Helm repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 20:19:54 +02:00
Felix Wolf d1959dd6cf feat: Adds Nextcloud application deployment
Deploys Nextcloud using an FPM-alpine image with a Caddy sidecar for web serving.

Integrates with an external CloudNativePG cluster for PostgreSQL and a dedicated Valkey instance for caching. Configures S3-compatible object storage for file data.

Includes an initialization Job to create essential admin and Valkey secrets. Sets up Ingress for external access with automated TLS provisioning via cert-manager.

Configures local-path persistence for Nextcloud's core data to ensure state is maintained across pod restarts. Centralizes hostname configuration and migrates various Nextcloud settings to environment variables for streamlined management.

Adds ArgoCD ignore rules for `batch/Job` resource selectors and template labels, preventing spurious out-of-sync states caused by Kubernetes mutations and improving synchronization stability.
2026-04-04 19:24:50 +02:00
Felix Wolf 524ccc2611 fix(grafana): Use existing secret for admin credentials
Switch to admin.existingSecret to avoid rendering the admin password
into git. The secret must be created manually in the cluster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:59:58 +02:00
Felix Wolf aa55722803 feat: Add node selector for Victoria Metrics server
Configures the Victoria Metrics single server to be scheduled on a specific host, `ubuntu-4gb-nbg1-1`. This ensures being scheduled on the same node as the pvc is bound ot since it uses local-path volume
2026-04-04 15:35:56 +02:00
Felix Wolf 09ecd5ba78 feat: Add kubelet and cAdvisor scrape jobs
Enables direct scraping of kubelet and cAdvisor metrics from Kubernetes nodes.
This provides more granular insights into node health and container resource utilization.
Configures secure HTTPS scraping using Kubernetes node service discovery.
2026-04-04 15:15:06 +02:00
Felix Wolf 8af1321177 feat: Add metrics-server for pod/node resource metrics
Enables CPU/memory visibility in k9s and kubectl top by deploying
the Kubernetes metrics-server via the metrics.k8s.io API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:34:32 +02:00
Felix Wolf 167fc62b92 feat: Add automated backups for Forgejo (Postgres + git repos)
- CNPG Barman backup to Hetzner S3 (s3://k8s-and-chill-backups/forgejo/cnpg/)
- ScheduledBackup CR: daily at 2 AM, 30d retention, prefer-standby
- Git repo rclone sync to S3 (s3://k8s-and-chill-backups/forgejo/git/) via CronJob at 3 AM
- Requires secrets: forgejo-backup-s3 (S3 creds), hcloud-token (not used but created)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:29:03 +02:00
Felix Wolf 25714eeef6 feat: Migrate Forgejo to CNPG PostgreSQL + Hetzner CSI volumes
- Add hcloud-csi prototype (Hetzner Cloud CSI driver)
- Add cloudnative-pg prototype (CNPG operator)
- Add CNPG Cluster CR for Forgejo (2 instances, lean config for 4GB nodes)
- Add 20Gi hcloud-volumes PVC for Forgejo git repos
- Switch Forgejo from SQLite to PostgreSQL (forgejo-cnpg-rw service)
- Switch Forgejo persistence to hcloud-volumes (forgejo-git-storage)
- Fix ClusterRoleBinding subject namespaces for hcloud-csi and CNPG
- Fix CNPG webhook service namespace references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:37:13 +02:00
Felix Wolf f096bba68b chore(forgejo): scale down forgejo for postgres-migration 2026-04-03 16:28:14 +02:00
Felix Wolf a92c5d8dc2 feat: Add VictoriaMetrics monitoring stack
Adds victoria-metrics-single, grafana, kube-state-metrics, and
node-exporter to the cluster. Enables metrics endpoints on traefik,
argocd, and cert-manager for scraping. Grafana available at
grafana.tr1ceracop.de with VictoriaMetrics as default datasource.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 00:20:23 +02:00
Felix Wolf c7bfd4953c feat: Wire ArgoCD to Forgejo for GitOps management
Configure myks with global repoURL pointing to Forgejo, in-cluster
destination, and disabled placeholder cluster Secret. Implement App of
Apps pattern with a root Application that syncs all child apps.

Add argocd-deploy-key-init Job that generates an ed25519 SSH keypair,
registers it as a deploy key via Forgejo API, and creates the ArgoCD
repository secret with insecure host key verification (avoids
chicken-and-egg with ArgoCD managing its own known hosts ConfigMap).

Additional changes:
- Ignore /status field diffs globally (K8s 1.32 compat)
- Add Replace=true sync option on Jobs (immutable resource compat)
- Switch job images from bitnami/kubectl to alpine/k8s
- Update CLAUDE.md with ArgoCD status and no-bitnami rule

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 23:09:50 +02:00
Felix Wolf 14cb67369d feat: Switch Forgejo SSH to hostPort 222
Use hostPort instead of NodePort for SSH access to avoid cross-node
asymmetric routing issues with kube-proxy nftables mode. Pin Forgejo
pod to node 3 (DNS target) and use port 222 to bypass ISP port 22
blocking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:56:38 +02:00
Felix Wolf 6f717a602f feat: Initial setup of GitOps-managed Kubernetes cluster
Configures `myks` for Helm chart rendering with `ytt` overlays to manage cluster applications.
Defines prototypes and environment-specific configurations for core applications including ArgoCD, Traefik, Cert-Manager, and Forgejo.
Adds comprehensive documentation covering cluster setup, GitOps structure, and development environment.
Integrates `direnv` for environment variable management, `gitignore` for file exclusion, and `sops` for secret encryption.
Includes rendered Kubernetes manifests and ArgoCD application resources for initial deployment.
2026-03-30 18:21:05 +02:00