Refactor 'run' #1021

ludfjig · 2025-11-11T23:20:20Z

Gives Hypervisor::run a default implementation and removes each specialized implementation from kvm/mshv/whp. Adds run_vcpu to just run vm for 1 single exit. The rest of the changes are just made to enable the above. Also refactors guest function cancellation slightly (simpler). Certain logic is introduced in this PR that will be removed in my next pr.

ludfjig · 2025-11-12T01:14:04Z

src/hyperlight_host/src/hypervisor/mod.rs

+                        // TODO in next PR: make sure to remove hw breakpoint here
+                        // In case the hw breakpoint is the entry point, remove it to
+                        // avoid hanging here as gdb does not remove breakpoints it
+                        // has not set.
+                        // Gdb expects the target to be stopped when connected.
+                        // self.remove_hw_breakpoint(vcpu_fd, entrypoint)?;


note this self.remove_hw_breakpoint has been moved to another place (handle_debug in each of the 3 drivers), it will be brought back here in next PR.

dblnz

Great work, I appreciate the effort to split the Vm trait PR into multiple smaller PRs
I left some comments, the only "concerning" one is related to the tracing spans reordering.

src/hyperlight_host/src/hypervisor/hyperv_linux.rs

src/hyperlight_host/src/hypervisor/hyperv_windows.rs

src/hyperlight_host/src/hypervisor/mod.rs

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

simongdavies

Made a few comments, two general comments:

I feel like this PR is doing 2 things and could be split up to make it easier to review:

Refactor Run - which is what the title says
Changing the way cancellation is implemented

The second part could (should?) be done in another PR .

I also feel like we need more information about how cancellation works, this really needs to be well documented , maybe we need a diagram or two somewhere

src/hyperlight_host/src/hypervisor/gdb/arch.rs

simongdavies · 2025-11-14T11:46:30Z

src/hyperlight_host/src/hypervisor/mod.rs

-    /// The operation should be retried
-    /// On Linux this can happen where a call to run the CPU can return EAGAIN
-    /// On Windows the platform could cause a cancelation of the VM run
+    /// The operation should be retried, for example this can happen on Linux where a call to run the CPU can return EAGAIN


Dont understand why we need to do this? Cant we exclude this from whatever match is using it on Windows? Also I dont understand why we are not seeing this on Windows since IIRC the platform itself can choose to cancel an execution which we then captured and retried, what happened to this?

I'm not sure I know what you mean. We only construct VmExit::Retry() on linux by this mapping libc::EAGAIN => VmExit::Retry(). There is no equivalent exit on windows that needs to be retried as far as I know. Regardless, if windows exits with a VmExit::Cancelled(), but there is no cancellation requested for the sandbox, then we retry it (but we still don't construct any VmExit::Retry() which is why the warning is supressed on windows).

simongdavies · 2025-11-14T11:49:22Z

src/hyperlight_host/src/hypervisor/mod.rs

+                    // ===== KILL() TIMING POINT 3: Before calling run_vcpu() =====
+                    // If kill() is called and ran to completion BEFORE this line executes:
+                    //    - CANCEL_BIT will be set, but it's too late to prevent entering the guest this iteration
+                    //    - Signals will interrupt the guest (RUNNING_BIT=true), causing VmExit::Cancelled()


This isnt quite true since Windows does not use signals?

simongdavies · 2025-11-14T11:51:16Z

src/hyperlight_host/src/hypervisor/mod.rs

+                    //    - CANCEL_BIT will be set, but it's too late to prevent entering the guest this iteration
+                    //    - Signals will interrupt the guest (RUNNING_BIT=true), causing VmExit::Cancelled()
+                    //    - If the guest completes before any signals arrive, kill() may have no effect
+                    //      - If there are more iterations to do (IO/host func, etc.), the next iteration will be cancelled


What happens if there are not more iterations to do , is it possible to lie about completion or cancel the next use of this Sandbox?

simongdavies · 2025-11-14T11:52:11Z

src/hyperlight_host/src/hypervisor/mod.rs

+            // ===== KILL() TIMING POINT 4: Before capturing cancel_requested =====
+            // If kill() is called and ran to completion BEFORE this line executes:
+            //    - CANCEL_BIT will be set
+            //    - Signals may still be sent (RUNNING_BIT=true) but are harmless no-ops


What about windows?

simongdavies · 2025-11-14T11:57:59Z

src/hyperlight_host/src/hypervisor/mod.rs

+                    }
+                }
+                Ok(VmExit::Cancelled()) => {
+                    // If cancellation was not requested for this specific guest function call,


I don't understand the scope of this problem, is it possible that we try to kill a long running execution and that we end up delivering the signal to the wrong sandbox? If so does this happen in the current implementation? Either way if this does happen we should really think about a way to prevent this from happening as we don't really want one sandboxes behavior to influence the execution of another sandbox.

It does happen in the current implementation and as far as I am aware it's impossible to avoid. Signals are delivered async so there's no guaranteeing what happens in between sending the signal and the signal being delivered. A totally new vm could have started on the same thread or a new guest call could have been made. I think the only thing we can do here is ignore it if signal was not intended for this guest-call or sandbox.

simongdavies · 2025-11-14T12:15:22Z

src/hyperlight_host/src/hypervisor/mod.rs

-    /// This is acceptable because the generation tracking provides an additional
-    /// safety layer. Even if a stale kill somehow stamped cancel_requested, the
-    /// generation mismatch would cause it to be ignored.
-    call_active: AtomicBool,


did you delete this? Seemed like a reasonable optimization to me , it avoided a lot of useless processing

This flag provided no optimization as far as I can tell. We always already exit early if running flag is not true.

simongdavies · 2025-11-14T12:44:08Z

src/hyperlight_host/src/hypervisor/mod.rs

+    /// - CANCEL_BIT remains unchanged
+    ///
+    /// # Memory Ordering
+    /// Uses `Release` ordering to ensure that the `tid` store (which uses `Relaxed`)


This comment is extremely confusing, I don't understand the relationship between store of this running atomic and the tid store or the relevance of the relaxed orderings, I think generally where we are use atomics and ordering we should document why the orderings are the way they are , these things are hard to understand when looking at the code later.

I think it will also make reviewing the code easier.

Nice catch, typo in the docs. And agree. I'll create a new document for this

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig added the kind/refactor For PRs that restructure or remove code without adding new functionality. label Nov 11, 2025

ludfjig force-pushed the intermediate_pr branch 3 times, most recently from 8602907 to 027fa17 Compare November 12, 2025 01:08

ludfjig commented Nov 12, 2025

View reviewed changes

ludfjig marked this pull request as ready for review November 12, 2025 01:53

ludfjig requested review from danbugs, dblnz, devigned, jprendes, marosset, simongdavies and syntactically as code owners November 12, 2025 01:53

dblnz reviewed Nov 12, 2025

View reviewed changes

ludfjig added 2 commits November 13, 2025 11:42

Refactor 'run'

5cbb30d

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Make sure run_vcpu does not create spans

0360070

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the intermediate_pr branch from be64e09 to 0360070 Compare November 13, 2025 19:42

simongdavies reviewed Nov 14, 2025

View reviewed changes

ludfjig added 2 commits November 14, 2025 15:01

Some PR feedback fixes + docs

8ba0396

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Remove interrupt generation number + add docs

d7344f0

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig marked this pull request as draft November 19, 2025 19:07

Refactor 'run' #1021

Are you sure you want to change the base?

Refactor 'run' #1021

Conversation

ludfjig commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dblnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simongdavies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ludfjig commented Nov 11, 2025 •

edited

Loading

ludfjig Nov 14, 2025 •

edited

Loading

ludfjig Nov 14, 2025 •

edited

Loading

ludfjig Nov 14, 2025 •

edited

Loading