NVIDIA
diff --git a/‎INSTALL_RHEL.md‎
Lines changed: 392 additions & 0 deletions b/‎INSTALL_RHEL.md‎
Lines changed: 392 additions & 0 deletions
@@ -0,0 +1,392 @@
+# Installing NVIDIA GPU Operator on RHEL
+
+This guide provides instructions for installing the NVIDIA GPU Operator from the cloud-native-stack repository on Red Hat Enterprise Linux (RHEL) systems.
+
+## Prerequisites
+
+### System Requirements
+
+1. **RHEL Version**: RHEL 8.x or 9.x
+2. **Kubernetes Cluster**: Running Kubernetes cluster (version >= 1.16.0)
+3. **GPU Hardware**: NVIDIA GPU(s) installed on worker nodes
+4. **Helm**: Version 3.x installed
+5. **kubectl**: Configured to access your cluster
+
+### Pre-installation Checks
+
+Before installing, ensure:
+
+- Your RHEL system is registered and has access to required repositories
+- The Kubernetes cluster is running and accessible
+- You have cluster-admin privileges
+- Nodes with GPUs are properly labeled (optional, GPU Operator will auto-detect)
+
+## Installation Methods
+
+You can install the GPU Operator using either:
+1. **Local Helm Chart** (from this repository)
+2. **NVIDIA Helm Repository** (recommended for production)
+
+---
+
+## Method 1: Install from Local Helm Chart (Development/Testing)
+
+This method uses the Helm chart directly from this cloud-native-stack repository.
+
+### Step 1: Navigate to the Repository
+
+```bash
+cd /Users/shivaku/go/src/gitlab.com/nvidia/cloud-native-stack/test/gpu-operator
+```
+
+### Step 2: Install Helm Dependencies
+
+```bash
+# Update Helm dependencies for the chart
+helm dependency update deployments/gpu-operator
+```
+
+### Step 3: Create Namespace
+
+```bash
+kubectl create namespace gpu-operator
+```
+
+### Step 4: Install GPU Operator
+
+**Basic Installation:**
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --wait
+```
+
+**With Custom Values (RHEL-specific):**
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --set operator.defaultRuntime=containerd \
+  --set driver.enabled=true \
+  --wait
+```
+
+**For RHEL with Docker Runtime:**
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --set operator.defaultRuntime=docker \
+  --wait
+```
+
+### Step 5: Verify Installation
+
+```bash
+# Check operator pod
+kubectl get pods -n gpu-operator
+
+# Check all GPU operator components
+kubectl get pods -n gpu-operator --show-labels
+
+# Verify GPU nodes are recognized
+kubectl get nodes -l nvidia.com/gpu.present=true
+```
+
+---
+
+## Method 2: Install from NVIDIA Helm Repository (Recommended)
+
+### Step 1: Add NVIDIA Helm Repository
+
+```bash
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
+helm repo update
+```
+
+### Step 2: Install GPU Operator
+
+```bash
+helm install gpu-operator nvidia/gpu-operator \
+  -n gpu-operator \
+  --create-namespace \
+  --wait
+```
+
+---
+
+## RHEL-Specific Configuration Options
+
+### For RHEL with SELinux Enabled
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --set operator.defaultRuntime=containerd \
+  --set driver.enabled=true \
+  --set toolkit.enabled=true \
+  --wait
+```
+
+### For RHEL with Specific Kernel Version
+
+If you need to specify a driver version compatible with your RHEL kernel:
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --set driver.version="535.129.03" \
+  --wait
+```
+
+### For RHEL with Custom Driver Installation Path
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --set hostPaths.driverInstallDir="/opt/nvidia/driver" \
+  --wait
+```
+
+---
+
+## Advanced Configuration
+
+### Custom Values File
+
+Create a custom values file for RHEL:
+
+```bash
+cat > rhel-values.yaml <<EOF
+platform:
+  openshift: false
+
+operator:
+  defaultRuntime: containerd
+  upgradeCRD: true
+
+driver:
+  enabled: true
+  repository: nvcr.io/nvidia
+  image: driver
+  version: "535.129.03"
+
+toolkit:
+  enabled: true
+
+devicePlugin:
+  enabled: true
+
+dcgm:
+  enabled: true
+  
+dcgmExporter:
+  enabled: true
+
+nodeStatusExporter:
+  enabled: true
+
+gfd:
+  enabled: true
+
+daemonsets:
+  tolerations:
+  - key: nvidia.com/gpu
+    operator: Exists
+    effect: NoSchedule
+  priorityClassName: system-node-critical
+EOF
+```
+
+Install with custom values:
+
+```bash
+helm install gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  -f rhel-values.yaml \
+  --wait
+```
+
+---
+
+## Verification Steps
+
+### 1. Check Operator Status
+
+```bash
+kubectl get pods -n gpu-operator
+```
+
+Expected output should show all pods in Running state:
+- gpu-operator-*
+- gpu-feature-discovery-*
+- nvidia-container-toolkit-daemonset-*
+- nvidia-dcgm-exporter-*
+- nvidia-device-plugin-daemonset-*
+- nvidia-driver-daemonset-* (on GPU nodes)
+
+### 2. Verify GPU Resources
+
+```bash
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
+```
+
+### 3. Test GPU Workload
+
+Create a test pod:
+
+```bash
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-test
+spec:
+  restartPolicy: OnFailure
+  containers:
+  - name: cuda-container
+    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+EOF
+```
+
+Check the logs:
+
+```bash
+kubectl logs gpu-test
+```
+
+Expected output should show: `Test PASSED`
+
+### 4. Check DCGM Metrics (if enabled)
+
+```bash
+kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
+```
+
+---
+
+## Troubleshooting
+
+### Common Issues on RHEL
+
+#### 1. Driver Installation Fails
+
+**Check driver logs:**
+```bash
+kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=100
+```
+
+**Solution:** Ensure kernel-devel and kernel-headers are installed:
+```bash
+sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
+```
+
+#### 2. SELinux Denials
+
+**Check SELinux status:**
+```bash
+sudo getenforce
+sudo ausearch -m avc -ts recent
+```
+
+**Solution:** You may need to set SELinux to permissive mode temporarily:
+```bash
+sudo setenforce 0
+```
+
+Or create custom SELinux policies for GPU Operator.
+
+#### 3. Container Runtime Issues
+
+**Verify container runtime:**
+```bash
+kubectl get nodes -o wide
+```
+
+**Check containerd/docker configuration:**
+```bash
+# For containerd
+sudo cat /etc/containerd/config.toml | grep nvidia
+
+# For docker
+sudo cat /etc/docker/daemon.json
+```
+
+#### 4. Node Not Recognized as GPU Node
+
+**Manually label the node:**
+```bash
+kubectl label nodes <node-name> nvidia.com/gpu=true
+```
+
+---
+
+## Uninstallation
+
+### Remove GPU Operator
+
+```bash
+helm uninstall gpu-operator -n gpu-operator
+```
+
+### Clean up namespace and CRDs (optional)
+
+```bash
+kubectl delete namespace gpu-operator
+
+# If CRDs need to be removed
+kubectl delete crd clusterpolicies.nvidia.com
+kubectl delete crd nvidiadrivers.nvidia.com
+```
+
+---
+
+## Upgrade
+
+### Upgrade from Local Chart
+
+```bash
+# Update the repository
+cd /Users/shivaku/go/src/gitlab.com/nvidia/cloud-native-stack/test/gpu-operator
+git pull
+
+# Update dependencies
+helm dependency update deployments/gpu-operator
+
+# Upgrade
+helm upgrade gpu-operator deployments/gpu-operator \
+  -n gpu-operator \
+  --wait
+```
+
+### Upgrade from NVIDIA Repository
+
+```bash
+helm repo update
+helm upgrade gpu-operator nvidia/gpu-operator \
+  -n gpu-operator \
+  --wait
+```
+
+---
+
+## Additional Resources
+
+- **Official Documentation**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html
+- **Platform Support**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html
+- **Troubleshooting Guide**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
+- **GitHub Repository**: https://github.com/NVIDIA/gpu-operator
+
+---
+
+## Notes
+
+- The GPU Operator will automatically install NVIDIA drivers on GPU nodes if `driver.enabled=true`
+- On RHEL systems, ensure your subscription is active and repos are configured
+- For air-gapped environments, you'll need to mirror the required container images
+- The operator requires privileged access to install drivers and configure the container runtime
+