Skip to content

Commit dae7499

Browse files
Shiva Kumarshivakunv
authored andcommitted
vgpu-manager: enable kernel module configuration via KernelModuleConfig
Signed-off-by: Shiva Kumar (SW-CLOUD) <[email protected]>
1 parent 8ca5c55 commit dae7499

File tree

12 files changed

+885
-22
lines changed

12 files changed

+885
-22
lines changed

INSTALL_RHEL.md

Lines changed: 392 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,392 @@
1+
# Installing NVIDIA GPU Operator on RHEL
2+
3+
This guide provides instructions for installing the NVIDIA GPU Operator from the cloud-native-stack repository on Red Hat Enterprise Linux (RHEL) systems.
4+
5+
## Prerequisites
6+
7+
### System Requirements
8+
9+
1. **RHEL Version**: RHEL 8.x or 9.x
10+
2. **Kubernetes Cluster**: Running Kubernetes cluster (version >= 1.16.0)
11+
3. **GPU Hardware**: NVIDIA GPU(s) installed on worker nodes
12+
4. **Helm**: Version 3.x installed
13+
5. **kubectl**: Configured to access your cluster
14+
15+
### Pre-installation Checks
16+
17+
Before installing, ensure:
18+
19+
- Your RHEL system is registered and has access to required repositories
20+
- The Kubernetes cluster is running and accessible
21+
- You have cluster-admin privileges
22+
- Nodes with GPUs are properly labeled (optional, GPU Operator will auto-detect)
23+
24+
## Installation Methods
25+
26+
You can install the GPU Operator using either:
27+
1. **Local Helm Chart** (from this repository)
28+
2. **NVIDIA Helm Repository** (recommended for production)
29+
30+
---
31+
32+
## Method 1: Install from Local Helm Chart (Development/Testing)
33+
34+
This method uses the Helm chart directly from this cloud-native-stack repository.
35+
36+
### Step 1: Navigate to the Repository
37+
38+
```bash
39+
cd /Users/shivaku/go/src/gitlab.com/nvidia/cloud-native-stack/test/gpu-operator
40+
```
41+
42+
### Step 2: Install Helm Dependencies
43+
44+
```bash
45+
# Update Helm dependencies for the chart
46+
helm dependency update deployments/gpu-operator
47+
```
48+
49+
### Step 3: Create Namespace
50+
51+
```bash
52+
kubectl create namespace gpu-operator
53+
```
54+
55+
### Step 4: Install GPU Operator
56+
57+
**Basic Installation:**
58+
59+
```bash
60+
helm install gpu-operator deployments/gpu-operator \
61+
-n gpu-operator \
62+
--wait
63+
```
64+
65+
**With Custom Values (RHEL-specific):**
66+
67+
```bash
68+
helm install gpu-operator deployments/gpu-operator \
69+
-n gpu-operator \
70+
--set operator.defaultRuntime=containerd \
71+
--set driver.enabled=true \
72+
--wait
73+
```
74+
75+
**For RHEL with Docker Runtime:**
76+
77+
```bash
78+
helm install gpu-operator deployments/gpu-operator \
79+
-n gpu-operator \
80+
--set operator.defaultRuntime=docker \
81+
--wait
82+
```
83+
84+
### Step 5: Verify Installation
85+
86+
```bash
87+
# Check operator pod
88+
kubectl get pods -n gpu-operator
89+
90+
# Check all GPU operator components
91+
kubectl get pods -n gpu-operator --show-labels
92+
93+
# Verify GPU nodes are recognized
94+
kubectl get nodes -l nvidia.com/gpu.present=true
95+
```
96+
97+
---
98+
99+
## Method 2: Install from NVIDIA Helm Repository (Recommended)
100+
101+
### Step 1: Add NVIDIA Helm Repository
102+
103+
```bash
104+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
105+
helm repo update
106+
```
107+
108+
### Step 2: Install GPU Operator
109+
110+
```bash
111+
helm install gpu-operator nvidia/gpu-operator \
112+
-n gpu-operator \
113+
--create-namespace \
114+
--wait
115+
```
116+
117+
---
118+
119+
## RHEL-Specific Configuration Options
120+
121+
### For RHEL with SELinux Enabled
122+
123+
```bash
124+
helm install gpu-operator deployments/gpu-operator \
125+
-n gpu-operator \
126+
--set operator.defaultRuntime=containerd \
127+
--set driver.enabled=true \
128+
--set toolkit.enabled=true \
129+
--wait
130+
```
131+
132+
### For RHEL with Specific Kernel Version
133+
134+
If you need to specify a driver version compatible with your RHEL kernel:
135+
136+
```bash
137+
helm install gpu-operator deployments/gpu-operator \
138+
-n gpu-operator \
139+
--set driver.version="535.129.03" \
140+
--wait
141+
```
142+
143+
### For RHEL with Custom Driver Installation Path
144+
145+
```bash
146+
helm install gpu-operator deployments/gpu-operator \
147+
-n gpu-operator \
148+
--set hostPaths.driverInstallDir="/opt/nvidia/driver" \
149+
--wait
150+
```
151+
152+
---
153+
154+
## Advanced Configuration
155+
156+
### Custom Values File
157+
158+
Create a custom values file for RHEL:
159+
160+
```bash
161+
cat > rhel-values.yaml <<EOF
162+
platform:
163+
openshift: false
164+
165+
operator:
166+
defaultRuntime: containerd
167+
upgradeCRD: true
168+
169+
driver:
170+
enabled: true
171+
repository: nvcr.io/nvidia
172+
image: driver
173+
version: "535.129.03"
174+
175+
toolkit:
176+
enabled: true
177+
178+
devicePlugin:
179+
enabled: true
180+
181+
dcgm:
182+
enabled: true
183+
184+
dcgmExporter:
185+
enabled: true
186+
187+
nodeStatusExporter:
188+
enabled: true
189+
190+
gfd:
191+
enabled: true
192+
193+
daemonsets:
194+
tolerations:
195+
- key: nvidia.com/gpu
196+
operator: Exists
197+
effect: NoSchedule
198+
priorityClassName: system-node-critical
199+
EOF
200+
```
201+
202+
Install with custom values:
203+
204+
```bash
205+
helm install gpu-operator deployments/gpu-operator \
206+
-n gpu-operator \
207+
-f rhel-values.yaml \
208+
--wait
209+
```
210+
211+
---
212+
213+
## Verification Steps
214+
215+
### 1. Check Operator Status
216+
217+
```bash
218+
kubectl get pods -n gpu-operator
219+
```
220+
221+
Expected output should show all pods in Running state:
222+
- gpu-operator-*
223+
- gpu-feature-discovery-*
224+
- nvidia-container-toolkit-daemonset-*
225+
- nvidia-dcgm-exporter-*
226+
- nvidia-device-plugin-daemonset-*
227+
- nvidia-driver-daemonset-* (on GPU nodes)
228+
229+
### 2. Verify GPU Resources
230+
231+
```bash
232+
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
233+
```
234+
235+
### 3. Test GPU Workload
236+
237+
Create a test pod:
238+
239+
```bash
240+
cat <<EOF | kubectl apply -f -
241+
apiVersion: v1
242+
kind: Pod
243+
metadata:
244+
name: gpu-test
245+
spec:
246+
restartPolicy: OnFailure
247+
containers:
248+
- name: cuda-container
249+
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
250+
resources:
251+
limits:
252+
nvidia.com/gpu: 1
253+
EOF
254+
```
255+
256+
Check the logs:
257+
258+
```bash
259+
kubectl logs gpu-test
260+
```
261+
262+
Expected output should show: `Test PASSED`
263+
264+
### 4. Check DCGM Metrics (if enabled)
265+
266+
```bash
267+
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
268+
```
269+
270+
---
271+
272+
## Troubleshooting
273+
274+
### Common Issues on RHEL
275+
276+
#### 1. Driver Installation Fails
277+
278+
**Check driver logs:**
279+
```bash
280+
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=100
281+
```
282+
283+
**Solution:** Ensure kernel-devel and kernel-headers are installed:
284+
```bash
285+
sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
286+
```
287+
288+
#### 2. SELinux Denials
289+
290+
**Check SELinux status:**
291+
```bash
292+
sudo getenforce
293+
sudo ausearch -m avc -ts recent
294+
```
295+
296+
**Solution:** You may need to set SELinux to permissive mode temporarily:
297+
```bash
298+
sudo setenforce 0
299+
```
300+
301+
Or create custom SELinux policies for GPU Operator.
302+
303+
#### 3. Container Runtime Issues
304+
305+
**Verify container runtime:**
306+
```bash
307+
kubectl get nodes -o wide
308+
```
309+
310+
**Check containerd/docker configuration:**
311+
```bash
312+
# For containerd
313+
sudo cat /etc/containerd/config.toml | grep nvidia
314+
315+
# For docker
316+
sudo cat /etc/docker/daemon.json
317+
```
318+
319+
#### 4. Node Not Recognized as GPU Node
320+
321+
**Manually label the node:**
322+
```bash
323+
kubectl label nodes <node-name> nvidia.com/gpu=true
324+
```
325+
326+
---
327+
328+
## Uninstallation
329+
330+
### Remove GPU Operator
331+
332+
```bash
333+
helm uninstall gpu-operator -n gpu-operator
334+
```
335+
336+
### Clean up namespace and CRDs (optional)
337+
338+
```bash
339+
kubectl delete namespace gpu-operator
340+
341+
# If CRDs need to be removed
342+
kubectl delete crd clusterpolicies.nvidia.com
343+
kubectl delete crd nvidiadrivers.nvidia.com
344+
```
345+
346+
---
347+
348+
## Upgrade
349+
350+
### Upgrade from Local Chart
351+
352+
```bash
353+
# Update the repository
354+
cd /Users/shivaku/go/src/gitlab.com/nvidia/cloud-native-stack/test/gpu-operator
355+
git pull
356+
357+
# Update dependencies
358+
helm dependency update deployments/gpu-operator
359+
360+
# Upgrade
361+
helm upgrade gpu-operator deployments/gpu-operator \
362+
-n gpu-operator \
363+
--wait
364+
```
365+
366+
### Upgrade from NVIDIA Repository
367+
368+
```bash
369+
helm repo update
370+
helm upgrade gpu-operator nvidia/gpu-operator \
371+
-n gpu-operator \
372+
--wait
373+
```
374+
375+
---
376+
377+
## Additional Resources
378+
379+
- **Official Documentation**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html
380+
- **Platform Support**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html
381+
- **Troubleshooting Guide**: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
382+
- **GitHub Repository**: https://github.com/NVIDIA/gpu-operator
383+
384+
---
385+
386+
## Notes
387+
388+
- The GPU Operator will automatically install NVIDIA drivers on GPU nodes if `driver.enabled=true`
389+
- On RHEL systems, ensure your subscription is active and repos are configured
390+
- For air-gapped environments, you'll need to mirror the required container images
391+
- The operator requires privileged access to install drivers and configure the container runtime
392+

0 commit comments

Comments
 (0)