Commit graph

28 commits

Author SHA1 Message Date
Felix Wolf 0389eb5d20 feat(monitoring): Add comprehensive oCIS monitoring
Integrates oCIS services into the monitoring stack by:
- Adding a new scrape configuration to VictoriaMetrics to collect metrics from oCIS services in the 'ocis' namespace.
- Introducing a new "ocis Overview" Grafana dashboard. This dashboard includes panels for user experience (proxy), service health, storage activity (uploads/downloads), and resource utilization, all leveraging the VictoriaMetrics datasource.
2026-05-03 01:25:15 +02:00
Felix Wolf 33c52be1c5 feat(pss): drop 5 namespaces from PSS privileged to restricted
argocd, cert-manager, cloudnative-pg already compliant — label flip only.
ocis: add overlay injecting seccompProfile=RuntimeDefault, drop ALL caps,
allowPrivilegeEscalation=false across all chart Deployments/CronJobs;
patch idm initContainer; harden custom precheck Job; refactor s3-backup
to rclone/rclone image (avoids apk-add-as-root).
victoria-metrics-single: overlay sets full restricted SC on the StatefulSet
that ships with empty securityContext: {}.

forgejo, traefik, kube-system stay privileged (hostPort / CSI driver).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:24:59 +02:00
Felix Wolf bf0cf0a11d fix(forgejo): force-replace argocd-deploy-key-init Job
Replace=true alone uses kubectl replace, which rejects updates on Job
immutable fields (spec.selector, spec.template.metadata.labels) when
the cluster already has a Job with auto-generated values. Add Force=true
so ArgoCD does kubectl replace --force (delete + recreate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:05:45 +02:00
Felix Wolf 85b8fec6b3 feat: replace secret-init Jobs with mittwald operator + cert-manager
Migrate ~180 LOC of openssl/kubectl init Jobs to declarative Secret
manifests reconciled by mittwald/kubernetes-secret-generator (random
strings, SSH keypair) and cert-manager Certificates (RSA private key +
self-signed CA chain). mittwald only fills empty fields, so existing
populated Secrets keep their current values across the migration.

Changes:

- New prototype kubernetes-secret-generator (chart 3.4.1, mittwald helm
  repo). Cluster-wide informer reconciler, no webhook -> cold-bootstrap
  safe via ArgoCD retries.
- New cert-manager selfsigned ClusterIssuer (in-cluster trust root).
  letsencrypt remains for public-DNS endpoints.
- forgejo: admin-secret Job replaced with a mittwald-annotated Secret
  (hex-encoded 24-char password). Deploy-key Job split: mittwald
  ssh-keypair Secret + slim Job that uploads pubkey to Forgejo and
  copies privkey into the argocd repo Secret.
- ocis: 13 Secrets / 16 random fields now mittwald-managed (UUIDs
  replaced with opaque random hex; ocis treats user-id as opaque). IDP
  RSA signing key, LDAP self-signed CA, and LDAP server cert produced
  by cert-manager. Per-Deployment ytt overlay remaps volume key paths
  (tls.crt -> ldap-ca.crt, tls.key -> private-key.pem, etc.) since the
  ocis chart mounts Secrets raw without items support. Old multi-secret
  s3-secret-job replaced with a slim external-secret precheck Job that
  only validates pre-created Hetzner S3/Storage Box credentials.
- Application sync-wave -10 on cert-manager and kubernetes-secret-
  generator so they install before consumers. ArgoCD selfHeal handles
  any residual races.

CLAUDE.md: remove the "all namespaces use privileged PodSecurity"
convention. Existing namespaces still carry the label and will be
audited separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:00:07 +02:00
Felix Wolf 9112153e8a fix(ocis): resolve large file upload timeouts and enable stale upload cleanup
Increase Traefik readTimeout from 600s to 3600s to prevent connection drops during large uploads, and enable the suspended cleanUpExpiredUploads CronJob so stale TUS sessions are automatically purged.
2026-04-24 20:12:24 +02:00
Felix Wolf 40a32730c9 feat(talos): enable swap with zswap compression on control plane nodes
Configure 2GiB swap volume on system disk with LimitedSwap behavior
and zswap compression (20% max pool) to improve memory utilization
on the CAX11 nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 19:33:06 +02:00
Felix Wolf 88fa8c4df3 fix(traefik): increase read-timeout to avoid crashing ocis for large uploads
Traefik's default readTimeout of 60s killing the upload connection. The cascade was:

  1. Large upload exceeds 60s → Traefik kills connection
  2. storageusers floods with NetworkTimeoutError
  3. Aborted uploads generate tons of NATS events
  4. NATS gets overwhelmed → no response from stream
  5. Proxy can't resolve user roles → login returns 500
2026-04-12 18:49:02 +02:00
Felix Wolf f57d29d1d3 chore: update service resource requests and identifiers
Increases memory requests for the IDM and NATS services to enhance stability and performance.
Updates application, service account, and storage UUIDs in configuration maps, reflecting a re-initialization or re-rendering of OIDC settings.
2026-04-12 18:26:47 +02:00
Felix Wolf d1eae1546e fix(victoriametrics): Remove nodeselector for old node
The node ubuntu-4gb-nbg1-1 was drained and exchanged with a new x86 C33 machine. So the nodeselctor needs to be removed.
2026-04-12 17:30:26 +02:00
Felix Wolf f442255833 feat: configure storageusers resources and anti-affinity
Assigns specific CPU and memory requests and limits to the storageusers service to ensure stable operation and efficient resource utilization.

Introduces pod anti-affinity for storageusers to prevent it from being scheduled on the same node as victoria-metrics-single, improving resilience and preventing potential resource contention.
2026-04-06 16:39:24 +02:00
Felix Wolf 1122c3f0e2 feat: Implement S3 to Storage Box backup
Introduces a daily Kubernetes CronJob that utilizes rclone to perform compressed backups of oCIS S3 data to a Hetzner Storage Box via SFTP.

This new backup mechanism requires the manual creation of an 'ocis-storagebox-credentials' secret, which holds the Storage Box host, user, and SSH private key. A check is added to the secret initialization job to ensure this essential external secret exists.
2026-04-06 15:24:14 +02:00
Felix Wolf a3143ac33c feat: Configure Ocis for Hetzner Cloud storage
Sets `hcloud-volumes` as the default storage class for Ocis components including storageusers, storagesystem, and idm.
2026-04-06 14:25:35 +02:00
Felix Wolf 4e48df73d3 feat(ocis): Transition to oCIS and enhance deployment
Removes the full Nextcloud stack (PostgreSQL/CNPG, Valkey, Caddy) and
  deploys oCIS at drive.tr1ceracop.de. oCIS is self-contained — no
  external database or cache needed.

  Key design decisions:
  - S3ng storage backend on Hetzner Object Storage (ocis-tr1ceracop)
  - Chart fetched via vendir git source (not published to a Helm repo)
  - All secrets generated in-cluster via PreSync init Job (never in git)
  - Memory requests on all pods to prevent node overcommit
  - Persistence on local-path for metadata (idm, nats, search, storage)

  Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 14:01:55 +02:00
Felix Wolf ffa171bfb0 feat: Replace Nextcloud with oCIS (ownCloud Infinite Scale)
Removes the full Nextcloud stack (PostgreSQL/CNPG, Valkey, Caddy sidecar)
and replaces it with oCIS at drive.tr1ceracop.de. oCIS is self-contained
(no external DB/cache needed) with S3ng storage backend on Hetzner Object
Storage (bucket: ocis-tr1ceracop). Chart sourced from git via vendir since
it is not published to a Helm repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 20:19:54 +02:00
Felix Wolf d1959dd6cf feat: Adds Nextcloud application deployment
Deploys Nextcloud using an FPM-alpine image with a Caddy sidecar for web serving.

Integrates with an external CloudNativePG cluster for PostgreSQL and a dedicated Valkey instance for caching. Configures S3-compatible object storage for file data.

Includes an initialization Job to create essential admin and Valkey secrets. Sets up Ingress for external access with automated TLS provisioning via cert-manager.

Configures local-path persistence for Nextcloud's core data to ensure state is maintained across pod restarts. Centralizes hostname configuration and migrates various Nextcloud settings to environment variables for streamlined management.

Adds ArgoCD ignore rules for `batch/Job` resource selectors and template labels, preventing spurious out-of-sync states caused by Kubernetes mutations and improving synchronization stability.
2026-04-04 19:24:50 +02:00
Felix Wolf 27647e6c5c docs: Removes Forgejo admin password hardcoding TODO
The issue regarding the hardcoded Forgejo admin password has been addressed and no longer needs to be tracked as a known issue or TODO.
2026-04-04 17:37:44 +02:00
Felix Wolf 596e10f226 docs: add Nextcloud infrastructure transparency guide
Provides a comprehensive overview of the Nextcloud service setup. Explains data residency, technical architecture for reliability, data safety guarantees with multi-region backups, and the specific technologies utilized.

Also details infrastructure costs, privacy considerations, and recovery plans for different incident types to ensure user data integrity and availability.
2026-04-04 17:18:29 +02:00
Felix Wolf 524ccc2611 fix(grafana): Use existing secret for admin credentials
Switch to admin.existingSecret to avoid rendering the admin password
into git. The secret must be created manually in the cluster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 16:59:58 +02:00
Felix Wolf aa55722803 feat: Add node selector for Victoria Metrics server
Configures the Victoria Metrics single server to be scheduled on a specific host, `ubuntu-4gb-nbg1-1`. This ensures being scheduled on the same node as the pvc is bound ot since it uses local-path volume
2026-04-04 15:35:56 +02:00
Felix Wolf 09ecd5ba78 feat: Add kubelet and cAdvisor scrape jobs
Enables direct scraping of kubelet and cAdvisor metrics from Kubernetes nodes.
This provides more granular insights into node health and container resource utilization.
Configures secure HTTPS scraping using Kubernetes node service discovery.
2026-04-04 15:15:06 +02:00
Felix Wolf 8af1321177 feat: Add metrics-server for pod/node resource metrics
Enables CPU/memory visibility in k9s and kubectl top by deploying
the Kubernetes metrics-server via the metrics.k8s.io API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:34:32 +02:00
Felix Wolf 167fc62b92 feat: Add automated backups for Forgejo (Postgres + git repos)
- CNPG Barman backup to Hetzner S3 (s3://k8s-and-chill-backups/forgejo/cnpg/)
- ScheduledBackup CR: daily at 2 AM, 30d retention, prefer-standby
- Git repo rclone sync to S3 (s3://k8s-and-chill-backups/forgejo/git/) via CronJob at 3 AM
- Requires secrets: forgejo-backup-s3 (S3 creds), hcloud-token (not used but created)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:29:03 +02:00
Felix Wolf 25714eeef6 feat: Migrate Forgejo to CNPG PostgreSQL + Hetzner CSI volumes
- Add hcloud-csi prototype (Hetzner Cloud CSI driver)
- Add cloudnative-pg prototype (CNPG operator)
- Add CNPG Cluster CR for Forgejo (2 instances, lean config for 4GB nodes)
- Add 20Gi hcloud-volumes PVC for Forgejo git repos
- Switch Forgejo from SQLite to PostgreSQL (forgejo-cnpg-rw service)
- Switch Forgejo persistence to hcloud-volumes (forgejo-git-storage)
- Fix ClusterRoleBinding subject namespaces for hcloud-csi and CNPG
- Fix CNPG webhook service namespace references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:37:13 +02:00
Felix Wolf f096bba68b chore(forgejo): scale down forgejo for postgres-migration 2026-04-03 16:28:14 +02:00
Felix Wolf a92c5d8dc2 feat: Add VictoriaMetrics monitoring stack
Adds victoria-metrics-single, grafana, kube-state-metrics, and
node-exporter to the cluster. Enables metrics endpoints on traefik,
argocd, and cert-manager for scraping. Grafana available at
grafana.tr1ceracop.de with VictoriaMetrics as default datasource.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 00:20:23 +02:00
Felix Wolf c7bfd4953c feat: Wire ArgoCD to Forgejo for GitOps management
Configure myks with global repoURL pointing to Forgejo, in-cluster
destination, and disabled placeholder cluster Secret. Implement App of
Apps pattern with a root Application that syncs all child apps.

Add argocd-deploy-key-init Job that generates an ed25519 SSH keypair,
registers it as a deploy key via Forgejo API, and creates the ArgoCD
repository secret with insecure host key verification (avoids
chicken-and-egg with ArgoCD managing its own known hosts ConfigMap).

Additional changes:
- Ignore /status field diffs globally (K8s 1.32 compat)
- Add Replace=true sync option on Jobs (immutable resource compat)
- Switch job images from bitnami/kubectl to alpine/k8s
- Update CLAUDE.md with ArgoCD status and no-bitnami rule

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 23:09:50 +02:00
Felix Wolf 14cb67369d feat: Switch Forgejo SSH to hostPort 222
Use hostPort instead of NodePort for SSH access to avoid cross-node
asymmetric routing issues with kube-proxy nftables mode. Pin Forgejo
pod to node 3 (DNS target) and use port 222 to bypass ISP port 22
blocking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:56:38 +02:00
Felix Wolf 6f717a602f feat: Initial setup of GitOps-managed Kubernetes cluster
Configures `myks` for Helm chart rendering with `ytt` overlays to manage cluster applications.
Defines prototypes and environment-specific configurations for core applications including ArgoCD, Traefik, Cert-Manager, and Forgejo.
Adds comprehensive documentation covering cluster setup, GitOps structure, and development environment.
Integrates `direnv` for environment variable management, `gitignore` for file exclusion, and `sops` for secret encryption.
Includes rendered Kubernetes manifests and ArgoCD application resources for initial deployment.
2026-03-30 18:21:05 +02:00