Skip to content

Commit 51c92ab

Browse files
Merge pull request #2296 from MicrosoftDocs/main
Auto Publish – main to live - 2025-09-19 17:00 UTC
2 parents 3359f98 + d72c0e7 commit 51c92ab

10 files changed

+334
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: Azure HPC Guest Health Reporting - FAQ
3+
description: Frequently asked questions for Azure Guest Health Reporting.
4+
author: rolandnyamo
5+
ms.author: ronyamo
6+
ms.topic: faq
7+
ms.service: azure
8+
ms.date: 09/18/2025
9+
ms.custom: template-overview
10+
---
11+
12+
# Azure Guest Health Reporting FAQ (preview)
13+
> [!IMPORTANT]
14+
> Azure Guest Health Reporting is currently in Preview. See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
15+
16+
Here are answers to common questions about Azure Guest Health Reporting.
17+
18+
## What happens if I don’t deallocate the node after sending the request to GHR?
19+
20+
For regular GHR request to UA/OFR the node, if customer doesn't deallocate VMs in 30 days after the node is UA, the node will automatically get into HI (HumanInvestigate). For reset request, there's no timeout as it doesn't require customers to deallocate VMs. For reboot request, if customer doesn't deallocate VMs in 30 days after the node is UA, the node will be set to Available, means the customer's request to reboot the node will get ignored.
21+
22+
## How do I upload logs?
23+
24+
1. Get access token to customers storage account/container via
25+
`/subscriptions/[subscriotionId]/providers/Microsft.Impact/getUploadtoken?api-version=2025-01-01preview`.
26+
27+
2. Upload logs using the upload URL/token:
28+
```bash
29+
az storage blob upload –file “path/to/local/file.zip” –blob-url
30+
https://[storageAccount].blob.core.windows.net/[container]/[datetime]_[randomHash].zip?[SasToken]
31+
```
32+
3. Trim off SAS token and send report with `LogUrl` filed:
33+
```json
34+
{
35+
"properties": {
36+
"startDateTime": "2025-09-15T01:06:21.3886467Z",
37+
"impactCategory": "Resource.Hpc.Unhealthy.IBPerformance",
38+
"impactDescription": "IB low bandwidth",
39+
"impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>",
40+
"additionalProperties": {
41+
"LogUrl": "https://someurl.blob.core.windows.net/rma",
42+
"PhysicalHostName": "GGBB90904476",
43+
"VmUniqueId": "1111111-22dr-3345-22rf-34454g89j"
44+
}
45+
}
46+
}
47+
48+
```
49+
50+
## Next steps
51+
* [What is Guest Health Reporting](guest-health-overview.md)
52+
* [Report Node Health](guest-health-impact-report.md)
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
title: Azure HPC Guest Health Reporting - Impact Categories
3+
description: View GHR impact categories
4+
author: rolandnyamo
5+
ms.author: ronyamo
6+
ms.service: azure
7+
ms.topic: overview
8+
ms.date: 09/18/2025
9+
ms.custom: template-overview
10+
---
11+
12+
# Guest Health Reporting impact categories (preview)
13+
> [!IMPORTANT]
14+
> Guest Health Reporting is currently in Preview. See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
15+
16+
To properly report issues to Guest Health Reporting, you must use an impact category that starts with `Resource.HPC`.
17+
18+
There are three main types of HPC impact categories:
19+
1. `Reset`: Request a refresh of the node health state.
20+
2. `Reboot`: Request a node reboot.
21+
3. `Unhealthy`: Issues are observed on the node. Node should be taken out of production for further diagnostics and repair.
22+
23+
## Detailed HPC impact categories
24+
25+
| Category | Description | Mark OFR |
26+
|--------------------------------------------------|-----------------------------------------------|----------|
27+
| Resource.Hpc.Reset | Reset node health status | No |
28+
| Resource.Hpc.Reboot | Reboot node | No |
29+
| Resource.Hpc.Unhealthy.HpcMissingGpu | Missing GPU | Yes |
30+
| Resource.Hpc.Unhealthy.MissingIB | Missing InfiniBand port | Yes |
31+
| Resource.Hpc.Unhealthy.IBPerformance | Degraded InfiniBand performance | Yes |
32+
| Resource.Hpc.Unhealthy.IBPortDown | InfiniBand port is in DOWN state | Yes |
33+
| Resource.Hpc.Unhealthy.IBPortFlapping | InfiniBand port flapping | Yes |
34+
| Resource.Hpc.Unhealthy.HpcGpuDcgmDiagFailure | GPU DCGMI diagnostic failure | Yes |
35+
| Resource.Hpc.Unhealthy.HpcRowRemapFailure | GPU row remap failure | Yes |
36+
| Resource.Hpc.Unhealthy.HpcInforomCorruption | GPU infoROM corruption | Yes |
37+
| Resource.Hpc.Unhealthy.HpcGenericFailure | Issue doesn't fall into any other category | Yes |
38+
| Resource.Hpc.Unhealthy.ManualInvestigation | Request further manual investigation by HPC team | Yes |
39+
| Resource.Hpc.Unhealthy.XID95UncontainedECCError | GPU uncontained ECC error (Xid 95) | Yes |
40+
| Resource.Hpc.Unhealthy.XID94ContainedECCError | GPU contained ECC error (Xid 94) | Yes |
41+
| Resource.Hpc.Unhealthy.XID79FallenOffBus | GPU fallen off PCIe bus (Xid 79) | Yes |
42+
| Resource.Hpc.Unhealthy.XID48DoubleBitECC | GPU reports double bit ECC error (Xid 48) | Yes |
43+
| Resource.Hpc.Unhealthy.UnhealthyGPUNvidiasmi | nvidia-smi hangs and might not recover | Yes |
44+
| Resource.Hpc.Unhealthy.NvLink | NvLink is down | Yes |
45+
| Resource.Hpc.Unhealthy.HpcDcgmiThermalReport | DCGMI reports thermal violations | Yes |
46+
| Resource.Hpc.Unhealthy.ECCPageRetirementTableFull| Double-bit ECC error page retirements over threshold | Yes |
47+
| Resource.Hpc.Unhealthy.DBEOverLimit | GPU has more than 10 double-bit ECC retired pages in seven days | Yes |
48+
| Resource.Hpc.Unhealthy.GpuXIDError | GPU reports Xid error other than 48,79,94,95 | Yes |
49+
| Resource.Hpc.Unhealthy.AmdGpuResetFailed | AMD GPU unrecoverable reset failure error | Yes |
50+
| Resource.Hpc.Unhealthy.EROTFailure | GPU memory EROT failure | Yes |
51+
| Resource.Hpc.Unhealthy.GPUMemoryBWFailure | GPU memory bandwidth failure | Yes |
52+
| Resource.Hpc.Unhealthy.CPUPerformance | CPU performance issue | Yes |
53+
54+
## Next steps
55+
* [What is Guest Health Reporting](guest-health-overview.md)
56+
* [Report Node Health](guest-health-impact-report.md)
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
title: Azure Guest Health Reporting - Report Node Health
3+
description: Share supercomputing VM device health status with Azure.
4+
author: rolandnyamo
5+
ms.author: ronyamo
6+
ms.service: azure
7+
ms.topic: overview
8+
ms.date: 09/18/2025
9+
ms.custom: template-overview
10+
---
11+
12+
# Report Guest Health status (preview)
13+
> [!IMPORTANT]
14+
> Guest Health Reporting is currently in Preview. See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
15+
16+
Review health reporting [prerequisites](guest-health-overview.md#onboarding-process).
17+
18+
## REST Client Reporting
19+
```
20+
PUT https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Impact/workloadImpacts/{workloadImpactName}?api-version=2023-02-01-preview
21+
```
22+
Descriptions of URI parameters are as follows:
23+
24+
| **Field Name** | **Description** |
25+
|---------------------|--------------------|
26+
| subscriptionId | Subscription previously allow-listed. |
27+
| subscriptionId | A unique name that would identify a specific impact. You can use a GUID as well. |
28+
| api-version | API version to be used for this operation. Use `2023-02-01-preview` |
29+
30+
### [Healthy Node](#tab/healthy/)
31+
32+
```json
33+
{
34+
"properties": {
35+
"startDateTime": "2025-09-15T01:06:21.3886467Z",
36+
"impactCategory": "Resource.Hpc.Healthy",
37+
"impactDescription": "Missing GPU device",
38+
"impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>",
39+
"additionalProperties": {
40+
"PhysicalHostName": "GGBB90904476",
41+
}
42+
}
43+
}
44+
45+
```
46+
47+
### [Missing GPU](#tab/missingGPU/)
48+
49+
```json
50+
{
51+
"properties": {
52+
"startDateTime": "2025-09-15T01:06:21.3886467Z",
53+
"impactCategory": "Resource.Hpc.Unhealthy.HpcMissingGpu",
54+
"impactDescription": "Missing GPU device",
55+
"impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>",
56+
"additionalProperties": {
57+
"LogUrl": "https://someurl.blob.core.windows.net/rma",
58+
"PhysicalHostName": "GGBB90904476",
59+
"VmUniqueId": "1111111-22dr-3345-22rf-34454g89j", //GUID
60+
"Manufacturer": "Nvidia",
61+
"SerialNumber": "12345679",
62+
"ModelNumber": "NV3LB225",
63+
"Location": "0"
64+
}
65+
}
66+
}
67+
68+
```
69+
70+
### [Investigate Node](#tab/investigate/)
71+
72+
```json
73+
{
74+
"properties": {
75+
"startDateTime": "2025-09-15T01:06:21.3886467Z",
76+
"impactCategory": "Resource.Hpc.Investigate.NVLink",
77+
"impactDescription": "NvLink may be down",
78+
"impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>",
79+
"additionalProperties": {
80+
"LogUrl": "https://someurl.blob.core.windows.net/rma",
81+
"VmUniqueId": "1111111-22dr-3345-22rf-34454g89j", //GUID
82+
"CollectTelemtery": "0"
83+
}
84+
}
85+
}
86+
87+
```
88+
89+
### [Unhealthy Non GPU](#tab/unhealthynongpu/)
90+
91+
```json
92+
{
93+
"properties": {
94+
"startDateTime": "2025-09-15T01:06:21.3886467Z",
95+
"impactCategory": "Resource.Hpc.Unhealthy.IBPerformance",
96+
"impactDescription": "IB low bandwidth",
97+
"impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>",
98+
"additionalProperties": {
99+
"LogUrl": "https://someurl.blob.core.windows.net/rma",
100+
"PhysicalHostName": "GGBB90904476",
101+
"VmUniqueId": "1111111-22dr-3345-22rf-34454g89j"
102+
}
103+
}
104+
}
105+
106+
```
107+
---
108+
109+
110+
| **Field Name** | **Required** | **Data Type** | **Description** |
111+
|-----------------------|--------------|---------------|---------------------------------------------------------------------------------|
112+
| startDateTime | Y | datetime | Time (UTC) when the impact happened. |
113+
| impactCategory | Y | string | Observation type/ Fault Scenario. Only approved string list allowed. |
114+
| impactDescription | Y | string | Description of the reported impact. |
115+
| impactedResourceId | Y | string | Fully qualified resource URI for the Azure resource. |
116+
| physicalHostName | Y | string | Node identifier, available in metadata. |
117+
| VmUniqueId | Y | string | Virtual machine unique ID. Queryable inside VM. |
118+
| logUrl | N* | string | URL to saved logs. |
119+
| manufacturer | N* | string | GPU Manufacturer. |
120+
| serialNumber | N* | string | GPU serial number. |
121+
| modelNumber | N* | string | Model number. |
122+
| location | N* | string | PCIe Location. |
123+
124+
>[!NOTE]
125+
> Providing optional information can speed up the node recovery time.
126+
> PhysicalHostName can be retrieved from within the VM using this script: [Utilities/kvp_client.c at main·jeseszhang1010/Utilities·GitHub](https://github.com/jeseszhang1010/Utilities/blob/main/kvp_client.c)
127+
128+
**Use the following command to get the PhysicalHostName**
129+
```shell
130+
timeout 100 gcc -o /root/scripts/GPU/kvp_client /root/scripts/GPU/kvp_client.c
131+
timeout 60 sudo /root/scripts/GPU/kvp_client | grep "PhysicalHostName;" | awk '{print$4}' | tee PhysicalHostName.txt
132+
```
133+
134+
### HPC additional properties
135+
136+
To aid Guest Health Reporting in taking the correct action, you can provide more information about the issue using the additionalProperties field. <br>
137+
`Resource.Hpc.*` fields:
138+
* `LogUrl` (string) – URL to relevant log file
139+
* `PhysicalHostName` (string) – physical host name of the node (alphanumeric)
140+
* `VmUniqueId` (string) – virtual machine unique ID(GUID)
141+
142+
> [!IMPORTANT]
143+
> All HPC impact requests must include either a PhysicalHostName or VmUniqueId (PhysicalHostName is preferred). The VM in question can be from any subscription and isn't limited to the VMs in the subscription that you're reporting from.
144+
145+
`Resource.Hpc.Unhealthy.*` fields that are specific to GPUs only:
146+
* `Manufacturer` (string) – manufacturer of GPU
147+
* `SerialNumber` (string) – serial number of GPU
148+
* `ModelNumber` (string) – model number of GPU
149+
* `Location` (string) – physical location of GPU
150+
151+
`Resource.Hpc.Investigate.*` fields:
152+
* `CollectTelemetry` (Boolean - 0/1) – tell HPC to collect telemetry from the impacted VM
153+
154+
`gpu_row_remap_failure` field:
155+
* SerialNumber – string – serial number of GPU
156+
* Row remapping flag:
157+
* "`gpu_row_remap_failure`: GPU # (SXM# SN:#): row remap failure. This is an official end of life condition: decommission the GPU”
158+
159+
`gpu_row_remap_*` fields:
160+
* `UCE` (string) - count of uncorrectable errors in histogram data
161+
* `SerialNumber` (string) – serial number of GPU
162+
*`gpu_row_remap_*`: GPU # (SXM# SN:#): bank with multiple row remaps: partial 1, low 0, none 0. CE: 0, UCE: #”
163+
164+
> [!IMPORTANT]
165+
> Customers are advised to include detailed row remapping fields with the specified information in their claims to expedite node restoration.
166+
167+
168+
## Next steps
169+
* [What is Guest Health Reporting](guest-health-overview.md)
170+
* [HPC Impact Categories](guest-health-impact-categories.md)
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: Azure HPC Guest Health Reporting - Overview
3+
description: Report Azure supercomputing VM device health status to Microsoft.
4+
author: rolandnyamo
5+
ms.author: ronyamo
6+
ms.service: azure
7+
ms.topic: overview
8+
ms.date: 09/18/2025
9+
ms.custom: template-overview
10+
---
11+
12+
# What is Guest Health Reporting (GHR)? (preview)
13+
> [!IMPORTANT]
14+
> Guest Health Reporting is currently in Preview. See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
15+
16+
The Guest Health Reporting service allows Azure supercomputing customers to provide VM device health statuses to Azure. Based on these status updates, Azure HPC can make decisions to remove problematic nodes out of production and send them for repair.
17+
18+
## Onboarding process
19+
20+
To use Guest Health Reporting to report the health of a node, the subscription that hosts the resources needs to be onboarded to the Impact service using the following steps:
21+
22+
1. Go to the Azure portal -> Subscriptions (select) -> Resource Providers in the left menu. <br>
23+
[ ![Screenshot that shows subscription settings with Resource Providers link.](images/guest-health-onboarding-subscription.png) ](images/guest-health-onboarding-subscription.png#lightbox)
24+
2. Search for the `Microsoft.Impact` resource provider.
25+
3. Select and Register it.<br>
26+
[ ![Screenshot that shows the Microsoft.Impact RP selection and registration option.](images/guest-health-registration.png) ](images/guest-health-registration.png#lightbox)
27+
4. Once registered, in the left pane select Settings -> Preview Features, search for "Allow Impact Reporting", select it and select "Register". <br>
28+
[ ![Screenshot that shows guest health reporting preview feature registration.](images/guest-health-preview-feature-selection.png) ](images/guest-health-preview-feature-selection.png#lightbox)
29+
5. Go to the left pane -> Settings -> Overview and retrieve your Subscription ID and send it to the Azure team member assisting you to complete the onboarding process.
30+
6. **Wait for confirmation that the onboarding process is complete before proceeding with using GHR requests submission.**
31+
32+
## Access management and role assignment
33+
34+
To submit GHR requests from a resource within Azure, the appropriate access management roles must be assigned.
35+
1. Create a User or System Assigned Managed Identity.
36+
2. Go to Access Control (IAM) in the left menu -> select Add Role Assignment. <br>
37+
[ ![Screenshot that shows how to add a role assignment.](images/guest-health-add-role.png) ](images/guest-health-add-role.png#lightbox)
38+
3. Search for `Impact Reporter` role in the search box and select it. <br>
39+
[ ![Screenshot that shows the impact reporter role.](images/guest-health-impact-reporter-role.png) ](images/guest-health-impact-reporter-role.png#lightbox)
40+
4. Go to the Members tab and search for the user-identity/ app-id/ service-principal in the search box and select it -> Select Members. The app-ID is the service-principal for the app to be used for reporting.
41+
5. Once the app-id/ managed-identity is selected, review-assign it.
42+
43+
44+
## Next steps
45+
* [Report node health](guest-health-impact-report.md)
46+
* [HPC Impact Categories](guest-health-impact-categories.md)
69.4 KB
Loading
143 KB
Loading
74.1 KB
Loading
56.1 KB
Loading
36 KB
Loading

articles/azure-impact-reporting/toc.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,16 @@
2828
href: connectors-troubleshooting-guide.md
2929
- name: Impact Reporting Connectors FAQ
3030
href: connectors-faq.md
31+
- name: Guest Health Reporting (Preview)
32+
items:
33+
- name: Overview
34+
href: guest-health-overview.md
35+
- name: Report Guest Health Impact
36+
href: guest-health-impact-report.md
37+
- name: HPC Impact Categories
38+
href: guest-health-impact-categories.md
39+
- name: FAQ
40+
href: guest-health-faq.md
3141
- name: FAQ
3242
href: faq.md
3343

0 commit comments

Comments
 (0)