Add crashdump example and include snapshot/scratch in core dumps by jsturtevant · Pull Request #1264 · hyperlight-dev/hyperlight

jsturtevant · 2026-02-26T01:10:51Z

Core dumps generated by Hyperlight were missing the snapshot and scratch memory regions, making post-mortem debugging with GDB incomplete — register state was present but the guest's code, stack, heap, and page tables were absent. This adds the snapshot and scratch regions to the ELF core dump alongside any dynamically mapped regions so that GDB can show full backtraces, disassemble at the crash site, and inspect guest memory. A new runnable crashdump example demonstrates automatic dumps (VM-level faults), on-demand dumps (guest-caught exceptions), and per-sandbox opt-out, with GDB-based integration tests that validate register and memory content in the generated ELF files. The debugging docs are also updated with practical GDB commands for inspecting crash dumps.

src/hyperlight_host/examples/crashdump/main.rs

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs

syntactically · 2026-03-03T13:32:17Z

@jsturtevant I believe you were working on another version of this that explicitly walks through the whole guest virtual address space?

jsturtevant · 2026-03-03T16:36:51Z

@jsturtevant I believe you were working on another version of this that explicitly walks through the whole guest virtual address space?

Yes, been working with @dblnz on getting it working. Should have something today

Co-authored-by: Doru Blânzeanu <dblnz@pm.me> Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

The runtime config was passed by reference into set_up_hypervisor_partition then immediately cloned, but no caller needs it afterward so it is now passed by value. The entry point field uses Option<u64> instead of a bare zero default so a missing value is detectable rather than silently producing a bogus AT_ENTRY. Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

generate_crashdump_to_dir accepts the output directory as a parameter instead of requiring callers to set the HYPERLIGHT_CORE_DUMP_DIR environment variable. This removes the need for unsafe std::env::set_var in tests while preserving the existing env-var fallback path for the automatic dump and the no-argument generate_crashdump method. Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

dblnz · 2026-03-04T13:29:51Z

src/hyperlight_host/src/sandbox/snapshot.rs

-    } else if gpa >= SandboxMemoryLayout::BASE_ADDRESS as u64 {
+    } else if gpa >= SandboxMemoryLayout::BASE_ADDRESS as u64
+        && gpa < SandboxMemoryLayout::BASE_ADDRESS as u64 + snap.mem_size() as u64
+    {


I remember this was a bug fix for some regions that are not in the snapshot and not in the scratch region either, but in between, but the previous logic returned the snapshot memory

this makes it explicit that it will look for a GPA in both were as previously it was also looking at values that where past the end of either. Here snap is

hyperlight/src/hyperlight_host/src/mem/layout.rs

Lines 20 to 35 in 7741d51

//! A snapshot freshly created from an empty VM will result in roughly

//! the following physical layout:

//!

//! +-------------------------------------------+

//! | Guest Page Tables |

//! +-------------------------------------------+

//! | Init Data | (GuestBlob size)

//! +-------------------------------------------+

//! | Guest Heap |

//! +-------------------------------------------+

//! | PEB Struct | (HyperlightPEB size)

//! +-------------------------------------------+

//! | Guest Code |

//! +-------------------------------------------+ 0x1_000

//! | NULL guard page |

//! +-------------------------------------------+ 0x0_000

and the scratch is

hyperlight/src/hyperlight_host/src/mem/layout.rs

Lines 48 to 61 in 7741d51

//! There is also a scratch region at the top of physical memory,

//! which is mostly laid out as a large undifferentiated blob of

//! memory, although at present the snapshot process specially

//! privileges the statically allocated input and output data regions:

//!

//! +-------------------------------------------+ (top of physical memory)

//! | Exception Stack, Metadata |

//! +-------------------------------------------+ (1 page below)

//! | Scratch Memory |

//! +-------------------------------------------+

//! | Output Data |

//! +-------------------------------------------+

//! | Input Data |

//! +-------------------------------------------+ (scratch size)

Yes, in the case of the snapshot code, which was formerly the only user of this code, this was being handled in the function guest_page, and the entire call to access_gpa was avoided. Now that there are more users of this VA enumeration, and crashdump/gdb are likely to continue to need it, we should see if the interface can be factored out a bit better (made to not rely on exclusive memory access, but to take advantage of it where available; and made to unify scratch/snapshot handling and mapped region handling).

src/hyperlight_host/src/sandbox/initialized_multi_use.rs

dblnz · 2026-03-04T13:52:52Z

src/hyperlight_host/src/mem/mgr.rs

+                        });
+                    } else {
+                        // GPA not in snapshot/scratch — check dynamic mmap regions
+                        resolve_from_mmap_regions(


I am not 100%, but from what I see, this function creates a region that exactly matches the mapping in size.
I am wondering, wouldn't the inclusion of dynamic mmap regions below take care of this by including the whole mmap region?

Note: I remember that when I was working on this, the dumps wouldn't correctly load without this function call, but I'm just saying, maybe I'm missing something.

this makes sure the dump preserves the GVA (virt_base, virt_end) instead of the physical address which is what the other loop does. The below mmap regions includes the GVA locations just for full dump but likely won't be used in gdb

src/hyperlight_host/src/mem/mgr.rs

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

dblnz

LGTM! Having participated in these changes, I'd feel more comfortable if other people had a look/approved this 😄

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

ludfjig

this is great, i like the added tests!

ludfjig · 2026-03-05T00:16:16Z

src/hyperlight_host/src/mem/memory_region.rs

+/// `usize` host addresses.  The crash dump path only reads host memory
+/// through raw pointers, so it never needs the file-mapping metadata
+/// stored in [`HostRegionBase`] on Windows.
+#[cfg(feature = "crashdump")]


The new types in memory_region.rs and helpers in mgr.rs use #[cfg(feature = "crashdump")] while most of the rest of the crate the #[cfg(crashdump)] alias. Maybe we can remove the alias since it doens't serve a purpose (I seem to remember it was only usable in debug profile before, but that no longer seems to be the case), OR switch to using the alias everywhere

ludfjig · 2026-03-05T00:17:17Z

src/hyperlight_host/src/mem/mgr.rs

+    /// With `init-paging` enabled, walks the guest page tables to discover
+    /// GVA→GPA mappings and translates them to host-backed regions. Also
+    /// includes any dynamic mmap regions that may not be covered by the
+    /// page table walk, and a sentinel page at the stack top for clean


outdated doc since the sentinel page was removed i believe

ludfjig · 2026-03-05T00:24:44Z

src/hyperlight_host/examples/crashdump/main.rs

+        if !gdb_is_available() {
+            eprintln!("Skipping test: {GDB_COMMAND} not found on PATH");
+            return;
+        }


Was this needed in CI? Do we not want the test to fail here?

syntactically

The direction of this looks good! I have a few comments, most of which are minor nits or just signposting things I thought were interesting points to discuss; the only thing that I'm actually particularly concerned about with the present state of the code is the mmap regions being present in the dump at VAs where they are not actually present in the sandbox.

At a higher level, it occurs to me that an alternative for crashdumps might be to just take a snapshot, and then write a separate utility that prepares a crashdump from a snapshot. That's probably worse, since in the future we might not be able to take a snapshot at arbitrary times, but only when I/O to the guest is quiet? However, I wanted to mention the idea to see what anyone else thought.

syntactically · 2026-03-04T17:56:20Z

src/hyperlight_host/examples/crashdump/main.rs

+/// back to the host as `GuestAborted` errors.
+///
+/// However, some faults are intercepted by the **hypervisor** before the
+/// CPU delivers them to the guest. For example, when the guest writes to


Minor nit, but this example is not the hypervisor "intercepting a fault that was originally destined for the guest": the fault was always targeted at the hypervisor, because it was the hypervisor's own page tables that caused it. There are intercepts that allow us to redirect faults to the hv (on amd64), but we don't use a tonne of them.

syntactically · 2026-03-04T17:57:32Z

src/hyperlight_host/examples/crashdump/main.rs

+
+    // This call triggers a ud2 instruction in the guest. The guest's IDT
+    // catches the #UD exception and reports it back to the host as a
+    // GuestAborted error via I/O. This does NOT trigger an automatic crash


Why do we have this distinction? We should be clear about exactly which exceptions result in dumps and which do not, and whether there is any relationship between that and which faults poison the sandbox.

syntactically · 2026-03-05T00:34:01Z

src/hyperlight_host/src/sandbox/uninitialized.rs

+    /// so GDB can compute the correct load offset for PIE binaries.
+    #[cfg(crashdump)]
+    pub(crate) entry_point: u64,
 }


This is not new to this PR, so feel free to file an issue and defer, but I wanted to see if you had thoughts in this context first. This addition brings to my attention the question: what is the expected lifecycle of SandboxRuntimeConfig? Some parts, like debug_info and guest_core_dump, seem like they are config of a specific sandbox instance (regardless of what guest/snapshot is being run inside of that instance), while things like entry_point and guest_core_dump seem like they are related fundamentally to the snapshot that's underlying the sandbox, and would need to be reset on restore to have sensible behaviour for a single sandbox that urns multiple different guests at different times.

syntactically · 2026-03-05T00:39:46Z

src/hyperlight_host/src/sandbox/snapshot.rs

 ) -> Option<(&'a ExclusiveSharedMemory, usize)> {
    let scratch_base = scratch_base_gpa(scratch_size);
-    if gpa >= scratch_base {
+    if gpa >= scratch_base && gpa < scratch_base + scratch_size as u64 {


I think this change makes sense, since it's good for the contract here to be that the returned offset exists in the memory (and an error would happen fast if not).

However, I'm curious if this change was motivated by actually finding a case where it would make a difference or not---it feels like a bug if there is a pointer to above the scratch region (only 1 unmapped page exists there), so if that was the cause of the change I'd be curious where the access was coming from.

syntactically · 2026-03-05T00:43:09Z

src/hyperlight_host/src/sandbox/snapshot.rs

-    } else if gpa >= SandboxMemoryLayout::BASE_ADDRESS as u64 {
+    } else if gpa >= SandboxMemoryLayout::BASE_ADDRESS as u64
+        && gpa < SandboxMemoryLayout::BASE_ADDRESS as u64 + snap.mem_size() as u64
+    {


Yes, in the case of the snapshot code, which was formerly the only user of this code, this was being handled in the function guest_page, and the entire call to access_gpa was avoided. Now that there are more users of this VA enumeration, and crashdump/gdb are likely to continue to need it, we should see if the interface can be factored out a bit better (made to not rely on exclusive memory access, but to take advantage of it where available; and made to unify scratch/snapshot handling and mapped region handling).

syntactically · 2026-03-05T00:45:47Z

src/hyperlight_host/src/mem/mgr.rs

+            if *executable {
+                flags |= MemoryRegionFlags::EXECUTE;
+            }
+            (flags, MemoryRegionType::Code)


This MemoryRegionType doesn't seem right, but then again I'm not sure if we're actually using the MemoryRegionTypes for anything, and they were already a bit inconsistent/nonsensical...

syntactically · 2026-03-05T00:49:58Z

src/hyperlight_host/src/mem/mgr.rs

+            })
+        })???;
+
+        // Include dynamic mmap regions at their GVA addresses as a fallback


This doesn't seem quite right. The mmap region guest_regions are GPAs not GVAs, and the rest of the crashdump is (much like a snapshot) handled entirely in GVA space. Code could well be doing something important at some GVA that is equal to a GPA where something entirely different is present, and this would overwrite the former with the contents of the latter as far as I understand it. Regions that are mapped into some guest physical address space, but not mapped from there to anywhere in VA space, are basically inaccessible to any code (unless it bothers to map them in first), since we don't have a complete physmap anyway...

syntactically · 2026-03-05T00:50:37Z

src/hyperlight_host/src/mem/mgr.rs

+            let host_base = HostGuestMemoryRegion::to_addr(rgn.host_region.start) + offset;
+            let host_end = host_base + mapping.len as usize;
+
+            regions.push(CrashDumpRegion {


Should the coalesce logic above run here as well?

jsturtevant requested review from danbugs, dblnz, devigned, jprendes, ludfjig, marosset, simongdavies and syntactically as code owners February 26, 2026 01:10

jsturtevant marked this pull request as draft February 26, 2026 01:11

dblnz reviewed Feb 27, 2026

View reviewed changes

src/hyperlight_host/examples/crashdump/main.rs Show resolved Hide resolved

src/hyperlight_host/examples/crashdump/main.rs Outdated Show resolved Hide resolved

src/hyperlight_host/src/hypervisor/hyperlight_vm.rs Outdated Show resolved Hide resolved

jsturtevant force-pushed the crashdump branch from b585863 to 93c4c75 Compare March 3, 2026 23:12

jsturtevant marked this pull request as ready for review March 3, 2026 23:14

jsturtevant added the kind/bugfix For PRs that fix bugs label Mar 4, 2026

jsturtevant force-pushed the crashdump branch 3 times, most recently from ba9bd81 to 4181d57 Compare March 4, 2026 04:00

Include snapshot and scratch regions in crash dumps

7080aa2

Co-authored-by: Doru Blânzeanu <dblnz@pm.me> Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

jsturtevant force-pushed the crashdump branch 2 times, most recently from 6ab40fe to 21b7183 Compare March 4, 2026 05:48

jsturtevant added 3 commits March 3, 2026 21:52

Add sentinel page for clean GDB backtraces

7bd6eaa

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

jsturtevant force-pushed the crashdump branch from 21b7183 to 137b016 Compare March 4, 2026 05:52

dblnz reviewed Mar 4, 2026

View reviewed changes

fixup! Include snapshot and scratch regions in crash dumps

c93e8c2

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

dblnz previously approved these changes Mar 4, 2026

View reviewed changes

jsturtevant dismissed dblnz’s stale review via a843212 March 4, 2026 22:59

fixup! Add sentinel page for clean GDB backtraces

f85b5a3

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>

jsturtevant force-pushed the crashdump branch from a843212 to f85b5a3 Compare March 4, 2026 23:06

ludfjig approved these changes Mar 5, 2026

View reviewed changes

syntactically reviewed Mar 5, 2026

View reviewed changes

	//! A snapshot freshly created from an empty VM will result in roughly
	//! the following physical layout:
	//!
	//! +-------------------------------------------+
	//! \| Guest Page Tables \|
	//! +-------------------------------------------+
	//! \| Init Data \| (GuestBlob size)
	//! +-------------------------------------------+
	//! \| Guest Heap \|
	//! +-------------------------------------------+
	//! \| PEB Struct \| (HyperlightPEB size)
	//! +-------------------------------------------+
	//! \| Guest Code \|
	//! +-------------------------------------------+ 0x1_000
	//! \| NULL guard page \|
	//! +-------------------------------------------+ 0x0_000

	//! There is also a scratch region at the top of physical memory,
	//! which is mostly laid out as a large undifferentiated blob of
	//! memory, although at present the snapshot process specially
	//! privileges the statically allocated input and output data regions:
	//!
	//! +-------------------------------------------+ (top of physical memory)
	//! \| Exception Stack, Metadata \|
	//! +-------------------------------------------+ (1 page below)
	//! \| Scratch Memory \|
	//! +-------------------------------------------+
	//! \| Output Data \|
	//! +-------------------------------------------+
	//! \| Input Data \|
	//! +-------------------------------------------+ (scratch size)

Conversation

jsturtevant commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

syntactically commented Mar 3, 2026

Uh oh!

jsturtevant commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsturtevant Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dblnz left a comment

Choose a reason for hiding this comment

Uh oh!

ludfjig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syntactically left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jsturtevant Mar 4, 2026 •

edited

Loading