Skip to content

A playground for learning and testing Loki, Grafana, Tempo, Mimir, OpenTelemetry, Cilium, Istio, and Linkerd

Notifications You must be signed in to change notification settings

agalue/LGTM-PoC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LGTM PoC: Multi-Tenant Centralized Observability

Transform your multi-cluster Kubernetes observability with a centralized, scalable, and secure LGTM stack.

This Proof of Concept demonstrates how to deploy Grafana's LGTM Stack (Loki + Grafana + Tempo + Mimir) on a central Kubernetes cluster, enabling secure collection of metrics, logs, and traces from multiple remote clusters with proper tenant isolation.

🎯 What You'll Achieve

  • Centralized Observability: Consolidate telemetry data from multiple Kubernetes clusters
  • Multi-Tenant Architecture: Secure data isolation between different teams/environments
  • Zero Trust Security: Encrypted inter-cluster communication via Service Mesh or Cilium ClusterMesh
  • Scalable Design: Battle-tested components that grow with your infrastructure
  • Flexible Deployment: Multiple agent configurations (Prometheus + Vector + Alloy, Alloy-only, or OpenTelemetry Collector)

πŸ—οΈ Key Technologies

Component Purpose Why This Choice
Mimir Metrics storage Horizontally scalable Prometheus backend with multi-tenancy
Loki Log aggregation Simpler than ELK stack, designed for cloud-native environments
Tempo Distributed tracing Cost-effective trace storage with seamless Grafana integration
Grafana Visualization Unified dashboards for metrics, logs, and traces

Service Mesh Options: Linkerd, Istio, or Cilium ClusterMesh for secure inter-cluster communication.

Architecture Overview

Architecture

Core Design Principles

Our architecture follows these principles:

  • Security First: All inter-cluster communication is encrypted and authenticated
  • Tenant Isolation: Each cluster operates as a separate tenant with data isolation
  • Observability Coverage: Complete telemetry collection (metrics, logs, traces)
  • Operational Simplicity: Minimal configuration required for new cluster onboarding

Deployment Scenarios

We demonstrate three different agent deployment patterns:

Scenario 1: Traditional Stack

  • Prometheus: Kubernetes metrics + ServiceMonitor/PodMonitor CRDs
  • Vector: Log collection and forwarding
  • Grafana Alloy: Trace collection and processing

Scenario 2: Unified Agent (Grafana Alloy)

  • Single Agent: Alloy handles all telemetry types
  • Simplified Operations: Fewer components to manage
  • Prometheus Compatibility: Supports existing ServiceMonitor configurations

Scenario 3: OpenTelemetry Native

  • OTEL Collector: Industry-standard telemetry pipeline
  • OTLP Protocol: Direct application instrumentation support
  • Hybrid Approach: Can coexist with Prometheus for cluster metrics

We have a central cluster running Grafana's LGTM stack on Kubernetes. Then, several client or remote clusters connected via "Cluster Mesh" to the central cluster to send metrics, logs, and traces to the LGTM stack.

The remote clusters show different possibilities for deploying the solution.

Detailed Architecture

This PoC implements a hub-and-spoke model where:

  • Central cluster (lgtm-central): Hosts the complete LGTM stack with Grafana UI
  • Remote clusters: Send telemetry data to the central cluster via secure service mesh connections
  • Tenant isolation: Each cluster operates as a separate tenant in Mimir, Loki, and Tempo

Data Flow Patterns

Scenario 1: Traditional Stack Remote clusters use specialized agents for each telemetry type:

  • Prometheus β†’ Remote Write β†’ Central Mimir (metrics)
  • Vector DaemonSet β†’ Central Loki (logs)
  • Grafana Alloy β†’ Central Tempo (traces)

Scenario 2: Unified Agent (Grafana Alloy) Hybrid architecture using two Alloy installations:

  • Alloy DaemonSet β†’ Pod logs, kubelet metrics, cAdvisor metrics (node-local)
  • Alloy Deployment β†’ ServiceMonitor/PodMonitor scraping, traces, events (cluster-wide)
  • Native support for Prometheus Operator CRDs (no Prometheus Operator required)
  • Single agent type reduces operational complexity

Scenario 3: OpenTelemetry Native Hybrid approach combining cloud-native standards:

  • Prometheus β†’ Central Mimir (cluster metrics)
  • OTEL Collector β†’ Central LGTM Stack (application telemetry via OTLP)

Demo Clusters

Cluster Purpose Demo Application
lgtm-central LGTM Stack + Grafana UI Internal monitoring
lgtm-remote Scenario 1 demonstration TNS Demo App
lgtm-remote-alloy Scenario 2 demonstration TNS Demo App
lgtm-remote-otel Scenario 3 demonstration OpenTelemetry Demo

Infrastructure Details

Kubernetes Distribution: Kind for local development

  • Better performance than minikube for multi-node clusters
  • Excellent ARM Mac compatibility
  • Native Docker integration

Container Networking Interface (CNI): Cilium

  • eBPF-based networking for performance
  • Built-in LoadBalancer capabilities (eliminates MetalLB dependency)
  • Optional: Can be disabled in favor of default CNI + MetalLB

Load Balancer IP Segments:

  • Central cluster: x.x.x.248/29
  • Remote cluster: x.x.x.240/29
  • Alloy remote cluster: x.x.x.224/29
  • OTEL remote cluster: x.x.x.232/29

Security: Zero Trust communication via Service Mesh

  • Automatic encryption for inter-cluster communication
  • Mutual TLS (mTLS) without manual certificate management
  • Service discovery across cluster boundaries

Service Mesh Options

Why Service Mesh? Traditional Kubernetes networking lacks:

  • Automatic encryption between clusters
  • Identity-based access control
  • Advanced traffic management

Cilium ClusterMesh vs Traditional Service Mesh:

  • Cilium: eBPF kernel-level performance, WireGuard encryption between nodes
  • Linkerd/Istio: Full mTLS encryption including same-node pod communication

Prerequisites

System Requirements

  • CPU: 8 cores minimum (tested on Intel i3-8350K @ 4.00GHz, Intel i9 @ 2.4GHz, and Apple M1 Pro)
  • RAM: 32GB recommended (16GB minimum for central + one remote cluster on Intel; 32GB minimum required for Apple Silicon)
  • OS: macOS or Linux (tested on Intel-based MBP with OrbStack, Apple Silicon M1 Pro with Docker Desktop, and Rocky Linux 9/10)

πŸ’‘ Performance Tip: OrbStack significantly outperforms Docker Desktop on macOS and provides native IP access to containers.

Required Tools

Tool Purpose Installation
Docker Container runtime Download
Kind Local Kubernetes clusters brew install kind or releases
Kubectl Kubernetes CLI Installation guide
Helm Package manager for Kubernetes Installation guide
Step CLI Certificate generation Installation guide
Jq JSON processing brew install jq or download

Service Mesh Tools (Choose One)

Service Mesh CLI Tool When to Use
Linkerd Linkerd CLI Simplicity, automatic mTLS, low resource overhead
Istio Istio CLI Advanced traffic management, enterprise features
Cilium Cilium CLI eBPF-based networking, kernel-level performance

⚠️ Important for Linkerd Users: Always use the latest edge release to avoid multicluster regressions:

curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install-edge | sh
export PATH=$HOME/.linkerd2/bin:$PATH

πŸš€ Quick Start

Choose Your Service Mesh

Default (Linkerd) - Best for getting started:

# No additional setup required - Linkerd is the default

Istio with Proxy Mode - For advanced traffic management:

export CILIUM_CLUSTER_MESH_ENABLED=no
export ISTIO_ENABLED=yes

Istio with Ambient Mode - For sidecar-less mesh:

export CILIUM_CLUSTER_MESH_ENABLED=no
export ISTIO_ENABLED=yes
export ISTIO_PROFILE=ambient

Cilium ClusterMesh - For eBPF-based networking:

export CILIUM_CLUSTER_MESH_ENABLED=yes

Disable Cilium - Use Kind's default CNI + MetalLB:

export CILIUM_ENABLED=no

πŸ’‘ Note: All scripts automatically handle these configurations. The above commands disable conflicting service mesh options as needed.

Deploy the Stack

  1. Generate certificates:

    ./deploy-certs.sh
  2. Deploy central cluster (LGTM Stack):

    ./deploy-central.sh
  3. Deploy remote cluster (TNS Demo App with Traditional Stack):

    ./deploy-remote.sh
  4. Optional: Deploy unified Alloy cluster (TNS Demo App with Alloy):

    ./deploy-remote-alloy.sh
  5. Optional: Deploy OTEL demo cluster:

    ./deploy-remote-otel.sh

Access Grafana

Get the ingress gateway IP:

kubectl get service --context kind-lgtm-central \
  -n observability cilium-gateway-lgtm-external-gateway \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Add to /etc/hosts:

# Add to /etc/hosts (replace with actual IP)
192.168.x.x grafana.example.com

Visit: https://grafana.example.com (accept the self-signed certificate warning)

🐳 Docker Desktop Users: Run ./deploy-proxy.sh and use 127.0.0.1 grafana.example.com instead.

🎯 Success Criteria

After completing the deployment, you should be able to:

βœ… Access Grafana Dashboard

  • Navigate to https://grafana.example.com
  • Login with admin / Adm1nAdm1n
  • See healthy data sources in Configuration > Data Sources

βœ… Query Telemetry Data

Verify data collection using Grafana's Explore tab:

Query Type Example Query Expected Result
Metrics (PromQL) up{cluster="lgtm-central"} Show healthy targets from central cluster
Logs (LogQL) {cluster="lgtm-remote"} | json Display structured logs from remote cluster
Traces (TraceQL) {service.name="tns-app"} Show distributed traces from TNS application

βœ… Multi-Cluster Visibility

Confirm tenant isolation by switching between data sources:

  • Local data sources: Mimir Local, Loki Local, Tempo Local
  • Remote data sources: Mimir Remote TNS, Loki Remote TNS, Tempo Remote TNS

βœ… Service Mesh Connectivity

Use the validation commands in the respective service mesh sections to verify secure inter-cluster communication.

πŸ“š Learn More: New to observability query languages? Check out PromQL tutorial, LogQL guide, and TraceQL documentation.

πŸ”— Understanding Service Mesh Integration

Each service mesh approach provides different trade-offs:

Linkerd Multi-Cluster

  • Automatic mTLS: Zero-configuration mutual TLS between clusters
  • Service Mirroring: Creates servicename-clustername mirrors automatically
  • Low Overhead: Minimal resource consumption with Rust-based proxy
  • Example: Access central Mimir from remote: mimir-distributor-lgtm-central.mimir.svc

Istio Multi-Cluster

  • Cross-Network Support: Designed for clusters across different networks
  • Gateway-Based: Uses Istio Gateway for secure inter-cluster communication
  • Transparent Routing: Services accessible via original FQDN across clusters
  • Protocol Intelligence: Automatic protocol detection with appProtocol hints

Cilium ClusterMesh

  • eBPF Foundation: Kernel-level networking with superior performance
  • Shared Services: Manual service replication with service.cilium.io/shared=false
  • WireGuard Encryption: Secure node-to-node communication
  • Limitation: No pod-to-pod encryption within the same node

🎨 Deployment Scenario Details

Scenario 1: Traditional Multi-Agent Stack

Architecture: Specialized agents for each telemetry type

Components:

  • Prometheus Operator + Prometheus: Metrics collection with ServiceMonitor/PodMonitor CRDs
  • Vector DaemonSet: Log collection from /var/log and container logs
  • Grafana Alloy Deployment: Trace collection (OTLP, Jaeger, OpenCensus)

Pros:

  • Mature, battle-tested components
  • Rich ecosystem of ServiceMonitor configurations
  • Separate resource allocation per telemetry type

Cons:

  • Multiple components to manage and upgrade
  • Higher resource overhead (3 different agents)
  • Complex troubleshooting across multiple systems

Deployment: ./deploy-remote.sh

Tenant ID: remote01


Scenario 2: Unified Grafana Alloy

Architecture: Hybrid DaemonSet + Deployment with single agent type

Unified Alloy Architecture

Why Hybrid? The Grafana Alloy Helm chart supports only one controller.type per installation. To achieve complete observability coverage, we deploy two separate Helm releases:

Alloy DaemonSet (alloy-daemonset)

Purpose: Node-local data collection requiring host path access

Responsibilities:

  • Pod Logs: Collects logs from all pods via loki.source.kubernetes
    • Mounts /var/log and /var/lib/docker/containers from host
    • Filters out service mesh proxy logs (linkerd-proxy, istio-proxy)
    • Adds cluster labels for multi-tenant routing
  • Kubelet Metrics: Scrapes node-level metrics from kubelet API
    • CPU, memory, disk usage per node
    • Requires service account with node proxy access
  • cAdvisor Metrics: Container runtime metrics
    • Per-container resource usage
    • Network and filesystem statistics

Key Configuration:

controller:
  type: daemonset
alloy:
  mounts:
    varlog: true
    dockercontainers: true

Resource Profile: 100m CPU / 128Mi memory per node

Alloy Deployment (alloy-deployment)

Purpose: Cluster-wide discovery and trace collection

Responsibilities:

  • ServiceMonitor Discovery: Native support via prometheus.operator.servicemonitors
    • No Prometheus Operator installation required
    • Automatic target discovery from CRDs
    • Clustering enabled for distributed scrape load
  • PodMonitor Discovery: Support via prometheus.operator.podmonitors
    • Direct pod-level metric collection
    • Label-based pod selection
  • Kubernetes Events: Captures cluster events via loki.source.kubernetes_events
  • Distributed Tracing: Multi-protocol trace receivers
    • OTLP (gRPC/HTTP): Modern instrumentation
    • Jaeger (Thrift/gRPC): Legacy compatibility
    • OpenCensus: Service mesh telemetry (Linkerd)

Key Configuration:

controller:
  type: deployment
  replicas: 2
alloy:
  clustering:
    enabled: true
  extraPorts:
    - name: otlp-grpc
      port: 4317
    - name: jaeger-thrift-compact
      port: 6831

Resource Profile: 200m CPU / 256Mi memory per replica

Benefits Over Traditional Stack:

  • βœ… Single Agent Type: One component to learn, upgrade, and monitor
  • βœ… Native CRD Support: Use existing ServiceMonitor/PodMonitor without Prometheus Operator
  • βœ… Reduced Resource Usage: ~40% less memory than Prometheus + Vector + Alloy combined
  • βœ… Simplified Configuration: Unified Alloy configuration language for all telemetry
  • βœ… Built-in Clustering: HA support with automatic scrape target distribution

Migration Path: Existing ServiceMonitor/PodMonitor resources work without modification

Deployment: ./deploy-remote-alloy.sh

Tenant ID: remote02

Verification:

# Check DaemonSet (should have one pod per node)
kubectl --context kind-lgtm-remote-alloy -n observability get ds grafana-alloy-daemonset

# Check Deployment (should have 2 replicas)
kubectl --context kind-lgtm-remote-alloy -n observability get deployment grafana-alloy-deployment

# View DaemonSet logs (log collection)
kubectl --context kind-lgtm-remote-alloy -n observability logs ds/grafana-alloy-daemonset

# View Deployment logs (metrics and traces)
kubectl --context kind-lgtm-remote-alloy -n observability logs deployment/grafana-alloy-deployment

Scenario 3: OpenTelemetry Native

Architecture: Hybrid Prometheus + OTEL Collector

Components:

  • Prometheus: Cluster-level metrics (kubelet, cAdvisor, node-exporter)
  • OpenTelemetry Collector: Application telemetry via OTLP protocol
  • OTEL Demo App: Pre-instrumented microservices showing OTLP in action

Pros:

  • Industry-standard OTLP protocol
  • Rich application instrumentation libraries
  • Vendor-neutral approach

Cons:

  • Dual collection stack (Prometheus + OTEL)
  • ServiceMonitor support requires OTEL Collector configuration
  • Learning curve for OTLP instrumentation

Deployment: ./deploy-remote-otel.sh

Tenant ID: remote03


Linkerd Multi Cluster

Linkerd MC Architecture

Linkerd creates a mirrored service automatically when linking clusters, appending the cluster name as a suffix to the service name. For instance, in lgtm-central, accessing Mimir locally would be mimir-distributor.mimir.svc, whereas accessing it from the lgtm-remote cluster would be mimir-distributor-lgtm-central.mimir.svc.

Service Naming Comparison:

Service Mesh Service Discovery Pattern Example
Linkerd Mirrored service with cluster suffix mimir-distributor-lgtm-central.mimir.svc
Istio Original service name, cross-cluster DNS mimir-distributor.mimir.svc
Cilium ClusterMesh Original service name, shared services mimir-distributor.mimir.svc

πŸ’‘ Note: The deployment scripts automatically patch configuration files when using Istio or Cilium ClusterMesh, removing the -lgtm-central suffix from service URLs to match their respective service discovery patterns.

Due to a change introduced by Buoyant about the Linkerd artifacts, the latest stable version available via Helm charts is 2.14 (even if the actual latest version is newer). Because of that, we'll be using the edge release by default.

⚠️ Version Requirement: This PoC requires Linkerd edge-25.12.x or later (stable 2.18.x+) for multicluster functionality. Earlier versions used a deprecated linking approach that no longer works correctly.

Istio Multi Cluster

Setting appProtocol: tcp for all GRPC services (especially memberlist) helps with protocol selection and ensuring the presence of headless services (i.e., clusterIP: None) improves traffic routing guaranteeing that the proxy will have endpoints per Pod IP address, allowing all Grafana applications to work correctly (as some microservices require direct pod-to-pod communication by Pod IP). Modern Helm charts for Loki, Tempo, and Mimir allow configuration appProtocol; there are already headless services for all the microservices. The configuration flexibility varies, but everything seems to be working.

The PoC assumes Istio multi-cluster using multi-network, which requires an Istio Gateway. In other words, the environment assumes we're interconnecting two clusters from different networks using Istio.

Unlike Linkerd, the services declared on the central cluster are reachable using the same FQDN as in the local cluster. The Istio Proxies are configured so that the DNS resolution and routing works as intended.

Cilium ClusterMesh

When using Cilium ClusterMesh, the user is responsible for creating the service with the same configuration on each cluster (although annotated with service.cilium.io/shared=false). That means reaching Mimir from lgtm-remote would be exactly like accessing it from lgtm-central (similar to Istio).

Validation

Linkerd Multi-Cluster

The linkerd CLI can help to verify if the inter-cluster communication is working. From the lgtm-remote cluster, you can run:

Check multicluster status:

linkerd mc check --context kind-lgtm-remote

Expected output:

linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
	* lgtm-central
√ remote cluster access credentials are valid
	* lgtm-central
√ clusters share trust anchors
	* lgtm-central
√ service mirror controller has required permissions
	* lgtm-central
√ service mirror controllers are running
	* lgtm-central
√ all gateway mirrors are healthy
	* lgtm-central
√ all mirror services have endpoints
√ all mirror services are part of a Link
√ multicluster extension proxies are healthy
√ multicluster extension proxies are up-to-date
√ multicluster extension proxies and cli versions match

Status check results are √

Check gateway connectivity:

linkerd mc gateways --context kind-lgtm-remote

Expected output:

CLUSTER       ALIVE    NUM_SVC      LATENCY
lgtm-central  True           4          2ms

Verify mirrored services:

# List mirrored services from the central cluster
kubectl get svc --context kind-lgtm-remote -A | grep lgtm-central

You should see services like mimir-distributor-lgtm-central, tempo-distributor-lgtm-central, etc.

πŸ’‘ Note: If you're using the OpenTelemetry Demo cluster, replace lgtm-remote with lgtm-remote-otel.

Istio Multi-Cluster

Here is a sequence of commands that demonstrate that multi-cluster works, assuming you deployed the TLS remote cluster:

❯ istioctl remote-clusters --context kind-lgtm-remote
NAME             SECRET                                            STATUS     ISTIOD
lgtm-remote                                                        synced     istiod-64f7d85469-ljhhm
lgtm-central     istio-system/istio-remote-secret-lgtm-central     synced     istiod-64f7d85469-ljhhm

If you're running in proxy-mode (using mimir-distributor as reference):

❯ istioctl --context kind-lgtm-remote proxy-config endpoint $(kubectl --context kind-lgtm-remote get pod -l name=app -n tns -o name | sed 's|.*/||').tns | grep mimir-distributor
192.168.97.249:15443                                    HEALTHY     OK                outbound|8080||mimir-distributor.mimir.svc.cluster.local
192.168.97.249:15443                                    HEALTHY     OK                outbound|9095||mimir-distributor.mimir.svc.cluster.local

❯ kubectl get svc -n istio-system lgtm-gateway --context kind-lgtm-central
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                                                           AGE
lgtm-gateway   LoadBalancer   10.12.201.116   192.168.97.249   15021:31614/TCP,15443:32226/TCP,15012:32733/TCP,15017:30681/TCP   21m

❯ kubectl --context kind-lgtm-remote exec -it -n tns $(kubectl --context kind-lgtm-remote get pod -n tns -l name=app -o name) -- nslookup mimir-distributor.mimir.svc.cluster.local
Name:      mimir-distributor.mimir.svc.cluster.local
Address 1: 10.12.92.57

❯ kubectl --context kind-lgtm-central get svc -n mimir mimir-distributor
NAME                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGE
mimir-distributor   ClusterIP   10.12.92.57   <none>        8080/TCP,9095/TCP   17m

❯ kubectl --context kind-lgtm-central get pod -n mimir -l app.kubernetes.io/component=distributor -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP           NODE                   NOMINATED NODE   READINESS GATES
mimir-distributor-78b6d8b96b-72cmn   2/2     Running   0          15m   10.11.3.14   lgtm-central-worker2   <none>           <none>
mimir-distributor-78b6d8b96b-k8w6g   2/2     Running   0          15m   10.11.2.59   lgtm-central-worker    <none>           <none>

If you're running in ambient-mode:

❯ kubectl get gatewayclass
NAME              CONTROLLER                     ACCEPTED   AGE
istio             istio.io/gateway-controller    True       4m30s
istio-east-west   istio.io/eastwest-controller   True       4m30s
istio-remote      istio.io/unmanaged-gateway     True       4m30s
istio-waypoint    istio.io/mesh-controller       True       4m30s

❯ kubectl get gateway -A
NAMESPACE      NAME                    CLASS             ADDRESS          PROGRAMMED   AGE
istio-system   istio-eastwestgateway   istio-east-west   192.168.97.248   True         4m13s

The following uses the mimir-distributor as reference:

❯ istioctl zc service --service-namespace mimir --context kind-lgtm-remote
NAMESPACE SERVICE NAME      SERVICE VIP                 WAYPOINT ENDPOINTS
mimir     mimir-distributor 10.12.168.116,10.22.130.140 None     1/1

❯ istioctl zc workload --workload-namespace mimir -o json --context kind-lgtm-remote
[
    {
        "uid": "lgtm-central/SplitHorizonWorkload/istio-system/istio-eastwestgateway/192.168.97.248/mimir/mimir-distributor.mimir.svc.cluster.local",
        "workloadIps": [],
        "networkGateway": {
            "destination": "lgtm-central/192.168.97.248"
        },
        "protocol": "HBONE",
        "name": "lgtm-central/SplitHorizonWorkload/istio-system/istio-eastwestgateway/192.168.97.248/mimir/mimir-distributor.mimir.svc.cluster.local",
        "namespace": "mimir",
        "serviceAccount": "default",
        "workloadName": "",
        "workloadType": "pod",
        "canonicalName": "",
        "canonicalRevision": "",
        "clusterId": "lgtm-central",
        "trustDomain": "cluster.local",
        "locality": {},
        "node": "",
        "network": "lgtm-central",
        "status": "Healthy",
        "hostname": "",
        "capacity": 2,
        "applicationTunnel": {
            "protocol": ""
        }
    }
]

❯ istioctl zc services --service-namespace mimir -o json --context kind-lgtm-remote
[
    {
        "name": "mimir-distributor",
        "namespace": "mimir",
        "hostname": "mimir-distributor.mimir.svc.cluster.local",
        "vips": [
            "lgtm-central/10.12.168.116",
            "lgtm-remote/10.22.130.140"
        ],
        "ports": {
            "8080": 0,
            "9095": 0
        },
        "endpoints": {
            "lgtm-central/SplitHorizonWorkload/istio-system/istio-eastwestgateway/192.168.97.248/mimir/mimir-distributor.mimir.svc.cluster.local": {
                "workloadUid": "lgtm-central/SplitHorizonWorkload/istio-system/istio-eastwestgateway/192.168.97.248/mimir/mimir-distributor.mimir.svc.cluster.local",
                "service": "",
                "port": {
                    "8080": 0,
                    "9095": 0
                }
            }
        },
        "subjectAltNames": [
            "spiffe://cluster.local/ns/mimir/sa/mimir-sa"
        ],
        "ipFamilies": "IPv4"
    }
]

From DNS resolution perspective:

❯ kubectl --context kind-lgtm-remote exec -it -n tns $(kubectl --context kind-lgtm-remote get pod -n tns -l name=app -o name) -- nslookup mimir-distributor.mimir.svc.cluster.local
nslookup: can't resolve '(null)': Name does not resolve

Name:      mimir-distributor.mimir.svc.cluster.local
Address 1: 10.22.130.140 mimir-distributor.mimir.svc.cluster.local

Cilium ClusterMesh

The cilium CLI can help to verify if the inter-cluster communication is working. From each context, you can run the following:

cilium clustermesh status --context ${ctx}

The following shows how it looks like when having both remote clusters deployed:

for ctx in central remote remote-otel; do
  echo "Checking cluster ${ctx}"
  cilium clustermesh status --context kind-lgtm-${ctx}
  echo
done

The result is:

Checking cluster central
βœ… Service "clustermesh-apiserver" of type "LoadBalancer" found
βœ… Cluster access information is available:
  - 172.19.255.249:2379
βœ… Deployment clustermesh-apiserver is ready
βœ… All 4 nodes are connected to all clusters [min:2 / avg:2.0 / max:2]
πŸ”Œ Cluster Connections:
  - lgtm-remote: 4/4 configured, 4/4 connected
  - lgtm-remote-otel: 4/4 configured, 4/4 connected
πŸ”€ Global services: [ min:0 / avg:0.0 / max:0 ]

Checking cluster remote
βœ… Service "clustermesh-apiserver" of type "LoadBalancer" found
βœ… Cluster access information is available:
  - 172.19.255.241:2379
βœ… Deployment clustermesh-apiserver is ready
βœ… All 2 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
πŸ”Œ Cluster Connections:
  - lgtm-central: 2/2 configured, 2/2 connected
πŸ”€ Global services: [ min:4 / avg:4.0 / max:4 ]

Checking cluster remote-otel
βœ… Service "clustermesh-apiserver" of type "LoadBalancer" found
βœ… Cluster access information is available:
  - 172.19.255.233:2379
βœ… Deployment clustermesh-apiserver is ready
βœ… All 2 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
πŸ”Œ Cluster Connections:
  - lgtm-central: 2/2 configured, 2/2 connected
πŸ”€ Global services: [ min:4 / avg:4.0 / max:4 ]

πŸ”§ Troubleshooting

Common Issues & Solutions

Problem Solution
"too many open files" on Linux sudo sysctl fs.inotify.max_user_watches=524288 fs.inotify.max_user_instances=512
High resource usage Deploy only central + one remote cluster, or increase system resources
Certificate errors Regenerate with ./deploy-certs.sh and redeploy affected clusters
Service mesh connectivity issues Check validation commands in respective service mesh sections
Istio Ambient: Metrics not flowing Restart ztunnel: kubectl rollout restart daemonset/ztunnel -n istio-system
Kind cluster creation fails Ensure Docker has sufficient resources allocated (8GB+ recommended)
Pods stuck in Pending state Check node resources with kubectl top nodes --context <cluster-context>

Validation Commands

Check cluster connectivity:

# Linkerd
linkerd --context kind-lgtm-central multicluster gateways
linkerd --context kind-lgtm-remote multicluster gateways

# Istio
istioctl --context kind-lgtm-central proxy-status
istioctl --context kind-lgtm-remote proxy-status

# Cilium
cilium --context kind-lgtm-central status
cilium --context kind-lgtm-remote status

Verify data sources in Grafana:

  1. Navigate to Configuration > Data Sources
  2. Test each data source connection
  3. Look for green "Data source is working" messages

Shutdown

kind delete cluster --name lgtm-central
kind delete cluster --name lgtm-remote
kind delete cluster --name lgtm-remote-otel

Or,

kind delete clusters --all

Warning: Be careful with the above command if you have clusters you don't want to remove.

If you started the HAProxy:

docker stop haproxy
docker rm haproxy

About

A playground for learning and testing Loki, Grafana, Tempo, Mimir, OpenTelemetry, Cilium, Istio, and Linkerd

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages