|
| 1 | +--- |
| 2 | +title: Azure Guest Health Reporting - Report Node Health |
| 3 | +description: Share supercomputing VM device health status with Azure. |
| 4 | +author: rolandnyamo |
| 5 | +ms.author: ronyamo |
| 6 | +ms.service: azure |
| 7 | +ms.topic: overview |
| 8 | +ms.date: 09/18/2025 |
| 9 | +ms.custom: template-overview |
| 10 | +--- |
| 11 | + |
| 12 | +# Report Guest Health status (preview) |
| 13 | +> [!IMPORTANT] |
| 14 | +> Guest Health Reporting is currently in Preview. See the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/) for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability. |
| 15 | +
|
| 16 | +Review health reporting [prerequisites](guest-health-overview.md#onboarding-process). |
| 17 | + |
| 18 | +## REST Client Reporting |
| 19 | +``` |
| 20 | +PUT https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Impact/workloadImpacts/{workloadImpactName}?api-version=2023-02-01-preview |
| 21 | +``` |
| 22 | +Descriptions of URI parameters are as follows: |
| 23 | + |
| 24 | +| **Field Name** | **Description** | |
| 25 | +|---------------------|--------------------| |
| 26 | +| subscriptionId | Subscription previously allow-listed. | |
| 27 | +| subscriptionId | A unique name that would identify a specific impact. You can use a GUID as well. | |
| 28 | +| api-version | API version to be used for this operation. Use `2023-02-01-preview` | |
| 29 | + |
| 30 | +### [Healthy Node](#tab/healthy/) |
| 31 | + |
| 32 | +```json |
| 33 | +{ |
| 34 | + "properties": { |
| 35 | + "startDateTime": "2025-09-15T01:06:21.3886467Z", |
| 36 | + "impactCategory": "Resource.Hpc.Healthy", |
| 37 | + "impactDescription": "Missing GPU device", |
| 38 | + "impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>", |
| 39 | + "additionalProperties": { |
| 40 | + "PhysicalHostName": "GGBB90904476", |
| 41 | + } |
| 42 | + } |
| 43 | +} |
| 44 | + |
| 45 | +``` |
| 46 | + |
| 47 | +### [Missing GPU](#tab/missingGPU/) |
| 48 | + |
| 49 | +```json |
| 50 | +{ |
| 51 | + "properties": { |
| 52 | + "startDateTime": "2025-09-15T01:06:21.3886467Z", |
| 53 | + "impactCategory": "Resource.Hpc.Unhealthy.HpcMissingGpu", |
| 54 | + "impactDescription": "Missing GPU device", |
| 55 | + "impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>", |
| 56 | + "additionalProperties": { |
| 57 | + "LogUrl": "https://someurl.blob.core.windows.net/rma", |
| 58 | + "PhysicalHostName": "GGBB90904476", |
| 59 | + "VmUniqueId": "1111111-22dr-3345-22rf-34454g89j", //GUID |
| 60 | + "Manufacturer": "Nvidia", |
| 61 | + "SerialNumber": "12345679", |
| 62 | + "ModelNumber": "NV3LB225", |
| 63 | + "Location": "0" |
| 64 | + } |
| 65 | + } |
| 66 | +} |
| 67 | + |
| 68 | +``` |
| 69 | + |
| 70 | +### [Investigate Node](#tab/investigate/) |
| 71 | + |
| 72 | +```json |
| 73 | +{ |
| 74 | + "properties": { |
| 75 | + "startDateTime": "2025-09-15T01:06:21.3886467Z", |
| 76 | + "impactCategory": "Resource.Hpc.Investigate.NVLink", |
| 77 | + "impactDescription": "NvLink may be down", |
| 78 | + "impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>", |
| 79 | + "additionalProperties": { |
| 80 | + "LogUrl": "https://someurl.blob.core.windows.net/rma", |
| 81 | + "VmUniqueId": "1111111-22dr-3345-22rf-34454g89j", //GUID |
| 82 | + "CollectTelemtery": "0" |
| 83 | + } |
| 84 | + } |
| 85 | +} |
| 86 | + |
| 87 | +``` |
| 88 | + |
| 89 | +### [Unhealthy Non GPU](#tab/unhealthynongpu/) |
| 90 | + |
| 91 | +```json |
| 92 | +{ |
| 93 | + "properties": { |
| 94 | + "startDateTime": "2025-09-15T01:06:21.3886467Z", |
| 95 | + "impactCategory": "Resource.Hpc.Unhealthy.IBPerformance", |
| 96 | + "impactDescription": "IB low bandwidth", |
| 97 | + "impactedResourceId": "/subscriptions/111111-f1122-2233-11bc-bb00123/resourceGroups/<rg_name>/providers/Microsoft.Compute/virtualMachines/<vm_name>", |
| 98 | + "additionalProperties": { |
| 99 | + "LogUrl": "https://someurl.blob.core.windows.net/rma", |
| 100 | + "PhysicalHostName": "GGBB90904476", |
| 101 | + "VmUniqueId": "1111111-22dr-3345-22rf-34454g89j" |
| 102 | + } |
| 103 | + } |
| 104 | +} |
| 105 | + |
| 106 | +``` |
| 107 | +--- |
| 108 | + |
| 109 | + |
| 110 | +| **Field Name** | **Required** | **Data Type** | **Description** | |
| 111 | +|-----------------------|--------------|---------------|---------------------------------------------------------------------------------| |
| 112 | +| startDateTime | Y | datetime | Time (UTC) when the impact happened. | |
| 113 | +| impactCategory | Y | string | Observation type/ Fault Scenario. Only approved string list allowed. | |
| 114 | +| impactDescription | Y | string | Description of the reported impact. | |
| 115 | +| impactedResourceId | Y | string | Fully qualified resource URI for the Azure resource. | |
| 116 | +| physicalHostName | Y | string | Node identifier, available in metadata. | |
| 117 | +| VmUniqueId | Y | string | Virtual machine unique ID. Queryable inside VM. | |
| 118 | +| logUrl | N* | string | URL to saved logs. | |
| 119 | +| manufacturer | N* | string | GPU Manufacturer. | |
| 120 | +| serialNumber | N* | string | GPU serial number. | |
| 121 | +| modelNumber | N* | string | Model number. | |
| 122 | +| location | N* | string | PCIe Location. | |
| 123 | + |
| 124 | +>[!NOTE] |
| 125 | +> Providing optional information can speed up the node recovery time. |
| 126 | +> PhysicalHostName can be retrieved from within the VM using this script: [Utilities/kvp_client.c at main·jeseszhang1010/Utilities·GitHub](https://github.com/jeseszhang1010/Utilities/blob/main/kvp_client.c) |
| 127 | +
|
| 128 | +**Use the following command to get the PhysicalHostName** |
| 129 | +```shell |
| 130 | +timeout 100 gcc -o /root/scripts/GPU/kvp_client /root/scripts/GPU/kvp_client.c |
| 131 | +timeout 60 sudo /root/scripts/GPU/kvp_client | grep "PhysicalHostName;" | awk '{print$4}' | tee PhysicalHostName.txt |
| 132 | +``` |
| 133 | + |
| 134 | +### HPC additional properties |
| 135 | + |
| 136 | +To aid Guest Health Reporting in taking the correct action, you can provide more information about the issue using the additionalProperties field. <br> |
| 137 | +`Resource.Hpc.*` fields: |
| 138 | +* `LogUrl` (string) – URL to relevant log file |
| 139 | +* `PhysicalHostName` (string) – physical host name of the node (alphanumeric) |
| 140 | +* `VmUniqueId` (string) – virtual machine unique ID(GUID) |
| 141 | + |
| 142 | +> [!IMPORTANT] |
| 143 | +> All HPC impact requests must include either a PhysicalHostName or VmUniqueId (PhysicalHostName is preferred). The VM in question can be from any subscription and isn't limited to the VMs in the subscription that you're reporting from. |
| 144 | +
|
| 145 | +`Resource.Hpc.Unhealthy.*` fields that are specific to GPUs only: |
| 146 | +* `Manufacturer` (string) – manufacturer of GPU |
| 147 | +* `SerialNumber` (string) – serial number of GPU |
| 148 | +* `ModelNumber` (string) – model number of GPU |
| 149 | +* `Location` (string) – physical location of GPU |
| 150 | + |
| 151 | +`Resource.Hpc.Investigate.*` fields: |
| 152 | +* `CollectTelemetry` (Boolean - 0/1) – tell HPC to collect telemetry from the impacted VM |
| 153 | + |
| 154 | +`gpu_row_remap_failure` field: |
| 155 | +* SerialNumber – string – serial number of GPU |
| 156 | +* Row remapping flag: |
| 157 | + * "`gpu_row_remap_failure`: GPU # (SXM# SN:#): row remap failure. This is an official end of life condition: decommission the GPU” |
| 158 | + |
| 159 | +`gpu_row_remap_*` fields: |
| 160 | +* `UCE` (string) - count of uncorrectable errors in histogram data |
| 161 | +* `SerialNumber` (string) – serial number of GPU |
| 162 | + * “`gpu_row_remap_*`: GPU # (SXM# SN:#): bank with multiple row remaps: partial 1, low 0, none 0. CE: 0, UCE: #” |
| 163 | + |
| 164 | +> [!IMPORTANT] |
| 165 | +> Customers are advised to include detailed row remapping fields with the specified information in their claims to expedite node restoration. |
| 166 | +
|
| 167 | + |
| 168 | +## Next steps |
| 169 | +* [What is Guest Health Reporting](guest-health-overview.md) |
| 170 | +* [HPC Impact Categories](guest-health-impact-categories.md) |
0 commit comments