Skip to content

jfroy/flatops

Repository files navigation

β›΅ flatops

A GitOps-managed Kubernetes homelab cluster running on Talos Linux.

πŸ“‹ Overview

This repository contains the declarative configuration for kantai, a bare-metal Kubernetes cluster. The cluster is designed for home infrastructure workloads with a focus on:

  • GitOps-driven operations via FluxCD
  • Secure networking with Cilium in kube-proxy replacement mode
  • Distributed storage using Rook-Ceph
  • GPU workloads with NVIDIA GPU Operator
  • Comprehensive observability using VictoriaMetrics and Grafana
  • Continuous integration via Renovate

πŸ—οΈ Cluster Architecture

Nodes

Node Role Hardware
kantai1 Hyper-converged control plane and workloads
  • AMD EPYC 7443P, 64 GiB
  • NVIDIA RTX 4000 Ada Generation, 24 GB
  • Micron 9300 PRO, 4 TB, x7
  • Seagate Exos X20, 18 TB, x15
  • NVIDIA ConnectX-5
  • LSI 9500-8e
  • 45Drives HL-15
kantai2 Virtual arm64 control plane and workloads
  • Apple M2 Mac Mini, 16 GB (mem), 500 GB (block)
  • UTM + QEMU hypervisor
kantai3 Hyper-converged control plane and workloads
  • AMD Ryzen Embedded V1500B, 32 GB
  • NVIDIA T400, 4 GB
  • Seagate Exos X18, 18 TB, x6
  • NVIDIA ConnectX-3
  • QNAP TS-673A

Infrastructure Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              Applications                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Envoy Gateway β”‚ external-dns β”‚ Tailscale β”‚ cert-manager β”‚ Pocket ID    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  VictoriaMetrics β”‚ Grafana β”‚ fluent-bit β”‚ kube-prometheus-stack         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Rook-Ceph β”‚ OpenEBS ZFS β”‚ Samba β”‚ VolSync β†’ Cloudflare R2              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  CloudNative-PG β”‚ NVIDIA GPU Operator β”‚ Multus CNI                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Cilium (kube-proxy replacement, BGP, Network Policies)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                        Talos Linux + Kubernetes                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Network infrastructure

kantai sits on top of an all-Ubiquiti network, with a Hi-Capacity Aggregation as the TOR and a Dream Machine Pro as the gateway/router/firewall. Recent versions of Unifi Network and Unifi OS support BGP, which is used to advertise load balancer addresses and thus provide node-balanced cluster services to the network.

πŸ”§ Core Components

GitOps & Cluster Management

FluxCD

The cluster is managed entirely through GitOps using FluxCD. All resources are declared in this repository and automatically reconciled to the cluster. The Flux Operator manages the FluxCD instance.

  • Kustomizations define the desired state of each application
  • HelmReleases manage Helm chart deployments
  • OCIRepositories pull charts from OCI registries
  • Drift detection ensures cluster state matches Git

tuppr

Automated Talos and Kubernetes upgrades are managed by tuppr. Upgrade CRDs (TalosUpgrade, KubernetesUpgrade) define version targets with health checks that ensure VolSync backups complete and Ceph cluster health is OK before proceeding.

Renovate

This repository is constantly updated using Renovate and flux-local. Minor and patch updates are applied automatically while major releases require human approval.

Networking

Cilium

Cilium serves as the CNI in kube-proxy replacement mode, providing:

  • eBPF-based networking with native routing
  • BGP Control Plane for advertising service IPs to the network with load-balancing
  • Network Policies for pod-level traffic control
  • Bandwidth Manager with BBR congestion control
  • IPv4/IPv6 dual-stack with BIG TCP support

Envoy Gateway

Envoy Gateway implements the Kubernetes Gateway API for HTTP/HTTPS routes and load balancing. It provides the primary entry points for cluster services.

external-dns

external-dns automatically manages DNS records for services:

  • Cloudflare for public DNS
  • UniFi for internal DNS

Tailscale

The Tailscale Operator provides secure remote access to cluster services via a mesh VPN, including API server proxy functionality.

Multus

Multus CNI enables attaching multiple network interfaces to pods. Used for workloads requiring direct LAN access via macvlan interfaces with dual-stack networking support.

Secrets Management

external-secrets + 1Password

external-secrets synchronizes secrets from 1Password into Kubernetes using the 1Password Connect server. A ClusterSecretStore provides cluster-wide access to secrets.

Certificate Management

cert-manager + trust-manager

cert-manager automates certificate lifecycle management:

  • ACME (Let's Encrypt) certificates for public services
  • Internal CA for cluster services
  • trust-manager distributes CA bundles across namespaces

Identity & Authentication

Pocket ID

Pocket ID serves as the in-cluster OIDC provider, enabling:

  • Kubernetes API server OIDC authentication
  • OAuth2 authentication for cluster services via Envoy Gateway
  • Centralized identity management for applications

Storage

Rook-Ceph

Rook-Ceph provides distributed storage across the cluster:

  • Block Storage (ceph-block) - Default storage class with 3-way replication, LZ4 compression
  • Object Storage (ceph-bucket) - S3-compatible storage with erasure coding (2+1)
  • Dashboard exposed via Envoy Gateway
  • Encrypted OSDs for data-at-rest security

OpenEBS ZFS

OpenEBS ZFS LocalPV exposes existing ZFS pools on nodes as Kubernetes storage:

  • Provides access to large media and data pools
  • Supports ZFS features (compression, snapshots, datasets)
  • Used for workloads requiring high-capacity local storage

Samba

Samba deployments on storage nodes share ZFS-backed volumes to the local network via SMB, enabling access to cluster-managed data from non-Kubernetes clients.

VolSync + Kopia

VolSync backs up persistent volumes to Cloudflare R2 using Kopia:

  • Daily snapshots with 7 daily, 4 weekly, 12 monthly retention
  • Clone-based backups (no application downtime)
  • Zstd compression for efficient storage

Database

CloudNative-PG

CloudNative-PG manages PostgreSQL clusters for applications:

  • PostgreSQL 18 with vchord vector extensions for AI/ML workloads
  • WAL archiving via barman-cloud plugin
  • Automated backups and point-in-time recovery

GPU Compute

NVIDIA GPU Operator

The NVIDIA GPU Operator enables GPU workloads:

  • Automatic container toolkit management
  • CDI (Container Device Interface) support
  • Time-slicing for GPU sharing
  • DCGM metrics for monitoring

Observability

Metrics: VictoriaMetrics

The VictoriaMetrics Operator manages the metrics stack:

  • VMSingle for metrics storage (12-week retention on Ceph block storage)
  • VMAgent for metric collection
  • VMAlert + VMAlertmanager for alerting
  • OpenTelemetry integration with Prometheus naming

Dashboards: Grafana Operator

The Grafana Operator manages Grafana instances and dashboards:

  • Declarative dashboard management via GrafanaDashboard CRDs
  • Automated datasource configuration
  • Integrated with VictoriaMetrics

Logs: fluent-bit

fluent-bit collects container logs from all nodes, running as a DaemonSet in the observability-agents namespace.

kube-prometheus-stack

The kube-prometheus-stack provides:

  • ServiceMonitors for Kubernetes components (API server, kubelet, etcd, scheduler, controller-manager)
  • kube-state-metrics for resource metrics
  • Dashboards via Grafana Operator integration

Note: Prometheus and Alertmanager from this stack are disabled in favor of VictoriaMetrics. The stack is primarily used for its comprehensive ServiceMonitor definitions and dashboards.

πŸ“ Repository Structure

β”œβ”€β”€ kubernetes/                  # Kubernetes resources
β”‚   β”œβ”€β”€ apps/                    # Deployments by namespace
β”‚   β”‚   β”œβ”€β”€ cert-manager/
β”‚   β”‚   β”œβ”€β”€ cnpg-system/
β”‚   β”‚   β”œβ”€β”€ database/            # Databases (postgres, influxdb)
β”‚   β”‚   β”œβ”€β”€ default/             # Most applications
β”‚   β”‚   β”œβ”€β”€ external-secrets/
β”‚   β”‚   β”œβ”€β”€ flux-system/
β”‚   β”‚   β”œβ”€β”€ gpu-operator/        # NVIDIA GPU operator
β”‚   β”‚   β”œβ”€β”€ kube-system/         # Core infrastructure (Cilium, CoreDNS, etc.)
β”‚   β”‚   β”œβ”€β”€ network/             # Networking (Envoy Gateway, external-dns, etc.)
β”‚   β”‚   β”œβ”€β”€ observability/       # Observability stack
β”‚   β”‚   β”œβ”€β”€ observability-agents/# Privileged observability agents
β”‚   β”‚   β”œβ”€β”€ openebs-system/
β”‚   β”‚   β”œβ”€β”€ rook-ceph/
β”‚   β”‚   β”œβ”€β”€ storage/             # Samba
β”‚   β”‚   β”œβ”€β”€ tailscale/
β”‚   β”‚   β”œβ”€β”€ talos-admin/         # Talos management (backups, tuppr)
β”‚   β”‚   └── volsync-system/
β”‚   β”œβ”€β”€ components/              # Reusable Kustomize components
β”‚   └── transformers/            # Global Kustomize transformers
β”œβ”€β”€ talos/                       # Talos configuration
└── Taskfile.yaml                # Task runner commands

πŸš€ Getting Started

Bootstrap

Bootstrap is currently broken and unusable. I love my pets.

Maintenance

Update Talos node configuration:

task talos:gen-mc
task talos:apply-mc

πŸ”’ Security

  • Talos Linux provides an immutable, minimal OS with no SSH access
  • Secure Boot enabled on supported nodes with TPM-backed disk encryption
  • Pod Security Standards enforced via ValidatingAdmissionPolicies
  • Network Policies via Cilium restrict pod-to-pod traffic
  • OIDC authentication for Kubernetes API via Pocket ID

πŸ“Š Monitoring

Lots of dashboards available on the on-cluster Grafana instance. Alerts go out to Discord.

πŸ™ Acknowledgments

  • This cluster originally started from onedr0p/cluster-template, which is absolutely amazing. It makes running Kubernetes at home easy.
  • The Home Operations community is amazing as well and will help you. Please join us.
  • Sidero Labs for creating an amazing Kubernetes-native system.
  • All the Kubernetes SIG groups for maintaining and evolving the world's open, extensible, at-scale resources and workloads orchestration system.

About

My homelab Kubernetes cluster

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors 15